Python文本分析：停用词处理和词频统计

本文将介绍使用Python进行文本分析的常用方法，包括如何获取停用词列表、如何统计词频，并提供相应的代码示例。

1. 导入所需库

首先，需要导入一些常用的库：

os: 用于操作系统相关的功能。
jieba: 用于中文分词。
pickle: 用于序列化和反序列化对象。
numpy: 用于科学计算。

import re
import os
import jieba
import pickle
import numpy as np

2. 获取停用词列表

定义函数 getstopWords 用于获取停用词列表。参数 txt_path 指定停用词文件的路径，默认为 'D:/6wanDownload/新建文件夹/stopwords.txt'。函数内部首先创建一个空列表 stopWords，然后打开停用词文件，逐行读取文件内容，并将每行去掉末尾的换行符加入 stopWords 列表。最后返回 stopWords 列表。

# 获取停用词列表
def getstopWords(txt_path='D:/6wanDownload/新建文件夹/stopwords.txt'):
    stopWords = []
    with open(txt_path, 'r') as f:
        for line in f.readlines():
            stopWords.append(line[:-1])
    return stopWords

3. 统计词频

定义函数 list2Dict 用于将某个列表统计进字典。参数 wordsList 为需要统计的列表，wordsDict 为需要统计的字典。函数内部遍历 wordsList 中的每个元素，如果元素已经在 wordsDict 字典中存在，就将对应的值加 1；如果元素不存在，就将其作为字典的键，并将值初始化为 1。最后返回统计后的 wordsDict 字典。

# 把某list统计进dict
def list2Dict(wordsList, wordsDict):
    for word in wordsList:
        if word in wordsDict.keys():
            wordsDict[word] += 1
        else:
            wordsDict[word] = 1
    return wordsDict

4. 示例代码

以下代码展示了如何使用 getstopWords 和 list2Dict 函数进行文本分析。

# 获取停用词列表
stop_words = getstopWords()

# 读取文本内容
with open('your_text_file.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# 使用jieba分词
words = jieba.cut(text)

# 统计词频
word_counts = {}
word_counts = list2Dict(words, word_counts)

# 去除停用词
filtered_words = [word for word in words if word not in stop_words]

# 再次统计词频
filtered_word_counts = {}
filtered_word_counts = list2Dict(filtered_words, filtered_word_counts)

# 输出结果
print('原始词频统计：', word_counts)
print('去除停用词后的词频统计：', filtered_word_counts)

总结

本文介绍了如何使用Python进行简单的文本分析，包括获取停用词列表、分词、统计词频等操作。希望对大家有所帮助。