数据集corpus该文件夹下包含两个文件pos和neg文件使用python语言采用KNN算法帮我做一个非常详细的文本分类停用词表采用哈工大停用词表txt
首先需要导入需要的库和模块:
import os
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
接着,读取数据集中的正面和负面文本文件,并将其分别存入两个列表中:
pos_list = [] # 存放正面文本
neg_list = [] # 存放负面文本
# 读取正面文本文件
with open('./corpus/pos', 'r', encoding='utf-8') as f:
for line in f:
pos_list.append(line.strip())
# 读取负面文本文件
with open('./corpus/neg', 'r', encoding='utf-8') as f:
for line in f:
neg_list.append(line.strip())
然后,读取停用词表,将其存入一个集合中:
stopwords = set() # 存放停用词
# 读取停用词表
with open('./哈工大停用词表.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.add(line.strip())
接下来,对正面文本和负面文本进行分词、去停用词和TF-IDF特征提取:
# 分词
pos_words = [' '.join(jieba.cut(line)) for line in pos_list]
neg_words = [' '.join(jieba.cut(line)) for line in neg_list]
# 去停用词
pos_words = [[word for word in line.split() if word not in stopwords] for line in pos_words]
neg_words = [[word for word in line.split() if word not in stopwords] for line in neg_words]
# TF-IDF特征提取
vectorizer = TfidfVectorizer()
X_pos = vectorizer.fit_transform([' '.join(line) for line in pos_words])
X_neg = vectorizer.transform([' '.join(line) for line in neg_words])
最后,使用KNN算法对文本进行分类,并输出分类结果和评估指标:
# 构建训练集和测试集
X_train = X_pos[:400] + X_neg[:400]
X_test = X_pos[400:] + X_neg[400:]
y_train = [1] * 400 + [0] * 400
y_test = [1] * 100 + [0] * 100
# 训练KNN分类器
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
# 输出分类结果和评估指标
print('Classification Report:\n', classification_report(y_test, y_pred))
完整代码如下:
import os
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
# 读取正面文本文件
pos_list = []
with open('./corpus/pos', 'r', encoding='utf-8') as f:
for line in f:
pos_list.append(line.strip())
# 读取负面文本文件
neg_list = []
with open('./corpus/neg', 'r', encoding='utf-8') as f:
for line in f:
neg_list.append(line.strip())
# 读取停用词表
stopwords = set()
with open('./哈工大停用词表.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.add(line.strip())
# 分词
pos_words = [' '.join(jieba.cut(line)) for line in pos_list]
neg_words = [' '.join(jieba.cut(line)) for line in neg_list]
# 去停用词
pos_words = [[word for word in line.split() if word not in stopwords] for line in pos_words]
neg_words = [[word for word in line.split() if word not in stopwords] for line in neg_words]
# TF-IDF特征提取
vectorizer = TfidfVectorizer()
X_pos = vectorizer.fit_transform([' '.join(line) for line in pos_words])
X_neg = vectorizer.transform([' '.join(line) for line in neg_words])
# 构建训练集和测试集
X_train = X_pos[:400] + X_neg[:400]
X_test = X_pos[400:] + X_neg[400:]
y_train = [1] * 400 + [0] * 400
y_test = [1] * 100 + [0] * 100
# 训练KNN分类器
clf = KNeighborsClassifier(n_neighbors=5)
clf.fit(X_train, y_train)
# 预测测试集
y_pred = clf.predict(X_test)
# 输出分类结果和评估指标
print('Classification Report:\n', classification_report(y_test, y_pred))
``
原文地址: https://www.cveoy.top/t/topic/gtzz 著作权归作者所有。请勿转载和采集!