打开文本文件:

text_file = open('online review data.txt', 'r', encoding='utf-8')

读取数据:

text = text_file.read()

读取的数据类型:

print(type(text)) print(' ')

打印文本:

print(text) print(' ')

导入所需库:

import jieba

按句子对文本进行分词:

sentences = list(jieba.cut(text, cut_all=False)) print('精确模式: ' + '/ '.join(sentences))

打印分词结果:

#print(sentences) print(sentences)

import jieba.posseg as pseg tag_filter = ['n'] words_pair = pseg.cut(text) result = [] for word, flag in words_pair: if flag in tag_filter: result.append(word) print('词性过滤完成') print(result)

导入停用词表:

stopwords_file1 = open('baidu_stopwords.txt', 'r', encoding='utf-8') stopwords_file2 = open('hit_stopwords.txt', 'r', encoding='utf-8') stopwords1 = stopwords_file1.read() stopwords2 = stopwords_file2.read()

将停用词表转换为列表:

stopwords1_list = stopwords1.split() stopwords2_list = stopwords2.split()

去除停用词和标点符号:

filtered_words = [] for word in sentences: if word not in stopwords1_list and word not in stopwords2_list and word != '\n' and word != '\u3000' and word != '\r\n': filtered_words.append(word) print('去除停用词和标点符号完成') print(filtered_words)

Python文本处理:使用百度和哈工大停用词表去除停用词和标点符号

原文地址: https://www.cveoy.top/t/topic/n2Lm 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录