打开文本文件：

text_file = open('online review data.txt', 'r', encoding='utf-8')

读取数据：

text = text_file.read()

读取的数据类型：

print(type(text)) print(' ')

打印文本：

print(text) print(' ')

导入所需库：

import jieba

按句子对文本进行分词：

sentences = list(jieba.cut(text, cut_all=False)) print('精确模式: ' + '/ '.join(sentences))

打印分词结果：

#print(sentences) print(sentences)

import jieba.posseg as pseg tag_filter = ['n'] words_pair = pseg.cut(text) result = [] for word, flag in words_pair: if flag in tag_filter: result.append(word) print('词性过滤完成') print(result)

导入停用词表：

stopwords_file1 = open('baidu_stopwords.txt', 'r', encoding='utf-8') stopwords_file2 = open('hit_stopwords.txt', 'r', encoding='utf-8') stopwords1 = stopwords_file1.read() stopwords2 = stopwords_file2.read()

将停用词表转换为列表：

stopwords1_list = stopwords1.split() stopwords2_list = stopwords2.split()

去除停用词和标点符号：

filtered_words = [] for word in sentences: if word not in stopwords1_list and word not in stopwords2_list and word != '\n' and word != '\u3000' and word != '\r\n': filtered_words.append(word) print('去除停用词和标点符号完成') print(filtered_words)

Python文本处理：使用百度和哈工大停用词表去除停用词和标点符号