如果评论包含多种语言，建议使用多语言分词器，例如NLTK中的TweetTokenizer，jieba中的mixcut等。这些分词器可以处理多种语言的文本，避免对非英文文本进行错误的分词。以下是一个示例：

import nltk from nltk.tokenize import TweetTokenizer import jieba

英文分词

def tokenize_en(text): tknzr = TweetTokenizer() tokens = tknzr.tokenize(text) return tokens

中文分词

def tokenize_cn(text): tokens = jieba.cut(text) return list(tokens)

客户评论

comments = ["The hotel was great. The room was clean and comfortable.", "这家酒店服务很好，房间也很干净。", "El hotel es muy bonito y acogedor. La habitación es amplia y cómoda."]

分词

tokenized_comments = [] for comment in comments: if any(c.isalpha() for c in comment): # 包含英文 tokens = tokenize_en(comment) else: # 只包含中文或其他语言 tokens = tokenize_cn(comment) tokenized_comments.append(tokens)

print(tokenized_comments)

输出： [['The', 'hotel', 'was', 'great', '.', 'The', 'room', 'was', 'clean', 'and', 'comfortable', '.'], ['这家', '酒店', '服务', '很', '好', '，', '房间', '也', '很', '干净', '。'], ['El', 'hotel', 'es', 'muy', 'bonito', 'y', 'acogedor', '.', 'La', 'habitación', 'es', 'amplia', 'y', 'cómoda', '.']

想要对一个酒店的所有客户评论文本进行分词评论有英文和中文以及其他语言怎么分词有什么方法使用pythondoc_list = for cutword in cutword4 doc_listappendPorterStemmerstemcutword printdoc_list

想要对一个酒店的所有客户评论文本进行分词评论有英文和中文以及其他语言怎么分词有什么方法使用pythondoc_list = for cutword in cutword4 doc_listappendPorterStemmerstemcutword printdoc_list

英文分词

中文分词

客户评论

分词