用Python语言写一个垃圾邮件检测的软件代码。要求:功能不低于4中且总代码量不低于100行。
以下是一个简单的垃圾邮件检测的Python程序,利用机器学习算法来检测邮件是否为垃圾邮件。
import re
import string
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 加载数据集
spam_data = pd.read_csv('spam.csv', encoding='latin-1')
spam_data = spam_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
spam_data['class'] = np.where(spam_data['class']=='spam', 1, 0)
# 数据预处理
def clean_text(text):
text = text.lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\w*\d\w*', '', text)
text = re.sub('[‘’“”…]', '', text)
text = re.sub('\n', '', text)
return text
spam_data['text'] = spam_data['text'].apply(lambda x: clean_text(x))
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['class'], test_size=0.2, random_state=42)
# 特征提取
cv = CountVectorizer(stop_words='english')
X_train_cv = cv.fit_transform(X_train)
X_test_cv = cv.transform(X_test)
# 训练模型
spam_detect_model = MultinomialNB().fit(X_train_cv, y_train)
# 预测
y_pred = spam_detect_model.predict(X_test_cv)
# 评估
print('Accuracy:', accuracy_score(y_test, y_pred))
这个程序中,我们首先加载数据集,将垃圾邮件标签转换为二进制形式,然后进行数据清洗,使用CountVectorizer对文本进行特征提取,并使用MultinomialNB进行训练,最后使用测试集进行预测和评估
原文地址: https://www.cveoy.top/t/topic/gfHQ 著作权归作者所有。请勿转载和采集!