使用 KNN 算法对 BBC 新闻文本进行分类

本文将介绍如何使用 KNN 算法对 BBC 新闻文本进行分类,并提供完整的 Python 代码示例。

数据集

我们将使用以下三个数据集:

  • BBC News Sample Solution.csv:包含样本文章的 ID 和类别。
  • BBC News Train.csv:包含训练文章的 ID、文本内容和类别。
  • BBC News Test.csv:包含测试文章的 ID 和文本内容。

BBC News Sample Solution.csv 文件示例:

ArticleId,Category
1018,sport
1319,tech
1138,business
459,entertainment
1020,politics
51,sport

BBC News Train.csv 文件示例:

ArticleId,Text,Category
1833,'worldcom ex-boss launches defence lawyers defending former worldcom chief bernie ebbers against a battery of fraud charges have called a company whistleblower as their first witness.  cynthia cooper  worldcom s ex-head of internal accounting  alerted directors to irregular accounting practices at the us telecoms giant in 2002. her warnings led to the collapse of the firm following the discovery of an $11bn (拢5.7bn) accounting fraud. mr ebbers has pleaded not guilty to charges of fraud and conspiracy.  prosecution lawyers have argued that mr ebbers orchestrated a series of accounting tricks at worldcom  ordering employees to hide expenses and inflate revenues to meet wall street earnings estimates. but ms cooper  who now runs her own consulting business  told a jury in new york on wednesday that external auditors arthur andersen had approved worldcom s accounting in early 2001 and 2002. she said andersen had given a  green light  to the procedures and practices used by worldcom. mr ebber s lawyers have said he was unaware of the fraud  arguing that auditors did not alert him to any problems.  ms cooper also said that during shareholder meetings mr ebbers often passed over technical questions to the company s finance chief  giving only  brief  answers himself. the prosecution s star witness  former worldcom financial chief scott sullivan  has said that mr ebbers ordered accounting adjustments at the firm  telling him to  hit our books . however  ms cooper said mr sullivan had not mentioned  anything uncomfortable  about worldcom s accounting during a 2001 audit committee meeting. mr ebbers could face a jail sentence of 85 years if convicted of all the charges he is facing. worldcom emerged from bankruptcy protection in 2004  and is now known as mci. last week  mci agreed to a buyout by verizon communications in a deal valued at $6.75bn.',business
154,'german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy.  munich-based research institute ifo said that its confidence index fell to 95.5 in february from 97.5 in january  its first decline in three months. the study found that the outlook in both the manufacturing and retail sectors had worsened. observers had been hoping that a more confident business sector would signal that economic activity was picking up.   we re surprised that the ifo index has taken such a knock   said dz bank economist bernd weidensteiner.  the main reason is probably that the domestic economy is still weak  particularly in the retail trade.  economy and labour minister wolfgang clement called the dip in february s ifo confidence figure  a very mild decline . he said that despite the retreat  the index remained at a relatively high level and that he expected  a modest economic upswing  to continue.  germany s economy grew 1.6% last year after shrinking in 2003. however  the economy contracted by 0.2% during the last three months of 2004  mainly due to the reluctance of consumers to spend. latest indications are that growth is still proving elusive and ifo president hans-werner sinn said any improvement in german domestic demand was sluggish. exports had kept things going during the first half of 2004  but demand for exports was then hit as the value of the euro hit record levels making german products less competitive overseas. on top of that  the unemployment rate has been stuck at close to 10% and manufacturing firms  including daimlerchrysler  siemens and volkswagen  have been negotiating with unions over cost cutting measures. analysts said that the ifo figures and germany s continuing problems may delay an interest rate rise by the european central bank. eurozone interest rates are at 2%  but comments from senior officials have recently focused on the threat of inflation  prompting fears that interest rates may rise.',business
1101,'bbc poll indicates economic gloom citizens in a majority of nations surveyed in a bbc world service poll believe the world economy is worsening.  most respondents also said their national economy was getting worse. but when asked about their own family s financial outlook  a majority in 14 countries said they were positive about the future. almost 23 000 people in 22 countries were questioned for the poll  which was mostly conducted before the asian tsunami disaster. the poll found that a majority or plurality of people in 13 countries believed the economy was going downhill  compared with respondents in nine countries who believed it was improving. those surveyed in three countries were split. in percentage terms  an average of 44% of respondents in each country said the world economy was getting worse  compared to 34% who said it was improving. similarly  48% were pessimistic about their national economy  while 41% were optimistic. and 47% saw their family s economic conditions improving  as against 36% who said they were getting worse.  the poll of 22 953 people was conducted by the international polling firm globescan  together with the program on international policy attitudes (pipa) at the university of maryland.  while the world economy has picked up from difficult times just a few years ago  people seem to not have fully absorbed this development  though they are personally experiencing its effects   said pipa director steven kull.  people around the world are saying:  i m ok  but the world isn t .  there may be a perception that war  terrorism and religious and political divisions are making the world a worse place  even though that has not so far been reflected in global economic performance  says the bbc s elizabeth blunt.  the countries where people were most optimistic  both for the world and for their own families  were two fast-growing developing economies  china and india  followed by indonesia. china has seen two decades of blistering economic growth  which has led to wealth creation on a huge scale  says the bbc s louisa lim in beijing. but the results also may reflect the untrammelled confidence of people who are subject to endless government propaganda about their country s rosy economic future  our correspondent says. south korea was the most pessimistic  while respondents in italy and mexico were also quite gloomy. the bbc s david willey in rome says one reason for that result is the changeover from the lira to the euro in 2001  which is widely viewed as the biggest reason why their wages and salaries are worth less than they used to be. the philippines was among the most upbeat countries on prospects for respondents  families  but one of the most pessimistic about the world economy. pipa conducted the poll from 15 november 2004 to 3 january 2005 across 22 countries in face-to-face or telephone interviews. the interviews took place between 15 november 2004 and 5 january 2005. the margin of error is between 2.5 and 4 points  depending on the country. in eight of the countries  the sample was limited to major metropolitan areas.',business

BBC News Test.csv 文件示例:

ArticleId,Text
1018,'qpr keeper day heads for preston queens park rangers keeper chris day is set to join preston on a month s loan.  day has been displaced by the arrival of simon royce  who is in his second month on loan from charlton. qpr have also signed italian generoso rossi. r s manager ian holloway said:  some might say it s a risk as he can t be recalled during that month and simon royce can now be recalled by charlton.  but i have other irons in the fire. i have had a  yes  from a couple of others should i need them.   day s rangers contract expires in the summer. meanwhile  holloway is hoping to complete the signing of middlesbrough defender andy davies - either permanently or again on loan - before saturday s match at ipswich. davies impressed during a recent loan spell at loftus road. holloway is also chasing bristol city midfielder tom doherty.'

实现步骤

  1. 读取 BBC News Sample Solution.csv 文件,将其中的 ArticleId 和 Category 存储到一个字典中。
  2. 读取 BBC News Train.csv 文件,将其中的 Text 和 Category 存储到两个列表中。
  3. 对 Text 列表中的每个文本进行分词、去停用词、词干化等预处理操作。
  4. 将分词后的文本转化为词袋模型或 TF-IDF 模型,得到训练集的特征向量。
  5. 使用 KNN 算法对训练集进行分类,得到 KNN 分类器。
  6. 读取 BBC News Test.csv 文件,对其中的每个文本进行预处理并转化为特征向量。
  7. 使用 KNN 分类器对测试集进行分类,得到测试集的预测结果。
  8. 将预测结果保存到 CSV 文件中。

代码实现

import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier

# 读取训练集和测试集
train_df = pd.read_csv('BBC News Train.csv')
test_df = pd.read_csv('BBC News Test.csv')

# 读取样本解决方案
solution_df = pd.read_csv('BBC News Sample Solution.csv')
solution_dict = dict(zip(solution_df['ArticleId'], solution_df['Category']))

# 分词、去停用词、词干化
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

train_texts = []
train_labels = []
for i, row in train_df.iterrows():
    text = row['Text'].lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    train_texts.append(' '.join(tokens))
    train_labels.append(row['Category'])

test_texts = []
test_ids = []
for i, row in test_df.iterrows():
    text = row['Text'].lower()
    tokens = word_tokenize(text)
    tokens = [token for token in tokens if token.isalpha() and token not in stop_words]
    tokens = [stemmer.stem(token) for token in tokens]
    test_texts.append(' '.join(tokens))
    test_ids.append(row['ArticleId'])

# 转化为 TF-IDF 模型
vectorizer = TfidfVectorizer()
train_features = vectorizer.fit_transform(train_texts)
test_features = vectorizer.transform(test_texts)

# 训练 KNN 分类器
k = 5
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(train_features, train_labels)

# 对测试集进行分类
test_predictions = knn_classifier.predict(test_features)

# 保存预测结果
predictions_df = pd.DataFrame({'ArticleId': test_ids, 'Category': test_predictions})
predictions_df.to_csv('BBC News Test Predictions.csv', index=False)

注意事项

  1. 在使用 KNN 算法时,需要将特征向量和标签向量作为输入。
  2. 在进行分词、去停用词、词干化等预处理操作时,需要使用 NLTK 库。
  3. 在转化为 TF-IDF 模型时,需要使用 sklearn 库中的 TfidfVectorizer 类。
  4. 在训练 KNN 分类器时,需要使用 sklearn 库中的 KNeighborsClassifier 类。
  5. 在保存预测结果时,需要使用 pandas 库中的 DataFrame 类和 to_csv 方法。

希望本文能够帮助你理解如何使用 KNN 算法对 BBC 新闻文本进行分类。

BBC 新闻文本分类:使用 Python 和 KNN 算法

原文地址: https://www.cveoy.top/t/topic/ov45 著作权归作者所有。请勿转载和采集!

免费AI点我,无需注册和登录