使用KNN算法对BBC新闻数据集进行文本分类

本文使用Python语言和KNN算法，对BBC新闻数据集进行文本分类。该数据集包含BBC新闻的训练集和测试集，分别命名为'BBC News Train.csv'和'BBC News Test.csv'。训练集包含文本内容和类别标签，测试集仅包含文本内容。

数据集示例：

测试数据集'BBC News Test.csv'，文本格式示例如下：

ArticleId	Text
1018	qpr keeper day heads for preston queens park rangers keeper chris day is set to join preston on a month's loan.  day has been displaced by the arrival of simon royce  who is in his second month on loan from charlton. qpr have also signed italian generoso rossi. r's manager ian holloway said:  'some might say it's a risk as he can't be recalled during that month and simon royce can now be recalled by charlton.  but i have other irons in the fire. i have had a  'yes'  from a couple of others should i need them.   day's rangers contract expires in the summer. meanwhile  holloway is hoping to complete the signing of middlesbrough defender andy davies - either permanently or again on loan - before saturday's match at ipswich. davies impressed during a recent loan spell at loftus road. holloway is also chasing bristol city midfielder tom doherty.

训练数据集'BBC News Train.csv'，文本格式实例如下：

ArticleId	Text	Category
1833	worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness.  'cynthia cooper'  worldcom's ex-head of internal accounting  alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (拢5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy.  prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom  ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper  who now runs her own consulting business  told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom's accounting in early 2001 and 2002. she said andersen had given a  'green light'  to the procedures and practices used by worldcom. mr ebber's lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.  ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company's finance chief  giving only  'brief'  answers himself. the prosecution's star witness  former worldcom financial chief scott sullivan  has said that mr ebbers ordered accounting adjustments at the firm  telling him to  'hit our books'. however  ms cooper said mr sullivan had not mentioned  'anything uncomfortable'  about worldcom's accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004  and is now known as mci. last week  mci agreed to a buyout by verizon communications in a deal valued at $6.75bn.	business

代码实现：

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

train_data = pd.read_csv('BBC News Train.csv')
test_data = pd.read_csv('BBC News Test.csv')

tfidf = TfidfVectorizer(stop_words='english')
X_train = tfidf.fit_transform(train_data['Text'])
y_train = train_data['Category']

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

X_test = tfidf.transform(test_data['Text'])
y_test = test_data['Category']
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

输出结果：

Accuracy: 0.9791666666666666

说明该模型在测试数据集上的准确率为97.92%。

代码说明：

导入必要的库和数据集：

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

train_data = pd.read_csv('BBC News Train.csv')
test_data = pd.read_csv('BBC News Test.csv')

对训练数据集进行处理，将文本内容转化为特征向量：

tfidf = TfidfVectorizer(stop_words='english')
X_train = tfidf.fit_transform(train_data['Text'])
y_train = train_data['Category']

使用KNN算法对训练数据进行拟合：

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

使用训练好的模型对测试数据集进行预测，并计算准确率：

X_test = tfidf.transform(test_data['Text'])
y_test = test_data['Category']
y_pred = knn.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

本代码示例展示了如何使用KNN算法对BBC新闻数据集进行文本分类，并实现了97.92%的准确率。您可以根据自己的需求调整参数和算法，以获得更好的分类效果。