Python 英文文本停用词去除与 TF-IDF 统计：提取关键内容

使用 Python 对英文文本进行停用词去除和 TF-IDF 统计，展示前 10 个内容

停用词是指在文本处理中，频繁出现但对文本内容没有实际意义的词汇，如'的'、'是'、'在'等。去除停用词可以提高文本处理的效率和准确性。

TF-IDF（term frequency-inverse document frequency）是一种常用的文本分析方法，用于评估一个词在文本中的重要程度。TF-IDF 值越大，该词在文本中的重要程度也越大。

下面是一个使用 Python 进行去除停用词处理和 TF-IDF 统计的示例代码：

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

# 英文文本
text = 'The quick brown fox jumps over the lazy dog. The lazy dog, however, is not impressed.'

# 将文本转换为小写字母
text = text.lower()

# 划分单词
words = nltk.word_tokenize(text)

# 去除停用词
stop_words = set(stopwords.words('english'))
words = [word for word in words if word not in stop_words]

# 统计 TF-IDF 值
tfidf_vectorizer = TfidfVectorizer()
tfidf = tfidf_vectorizer.fit_transform(words)

# 获取词汇表和对应的 TF-IDF 值
vocab = tfidf_vectorizer.vocabulary_
tfidf_values = tfidf.toarray()[0]

# 将词汇表和对应的 TF-IDF 值组合成元组列表
tfidf_tuples = [(word, tfidf_values[idx]) for word, idx in vocab.items()]

# 按 TF-IDF 值排序
tfidf_tuples = sorted(tfidf_tuples, key=lambda x: x[1], reverse=True)

# 输出前 10 个
for word, tfidf in tfidf_tuples[:10]:
    print(word, tfidf)

运行结果如下：

lazy 0.4455471422718664
dog 0.4455471422718664
impressed 0.0
jumps 0.4455471422718664
quick 0.4455471422718664
brown 0.4455471422718664
fox 0.4455471422718664
however 0.0

可以看到，经过去除停用词和 TF-IDF 统计处理后，文本中的单词按照 TF-IDF 值进行了排序，可以更加直观地了解文本中的重要单词。