erron数据集实现基于朴素贝叶斯的垃圾邮件分类tkinter
界面展示
首先,需要下载并导入erron数据集。可以使用以下代码:
import nltk
nltk.download('erron')
然后,我们可以使用以下代码加载数据集:
from nltk.corpus import erron
spam = erron.sents(categories=['spam'])
ham = erron.sents(categories=['ham'])
接下来,我们需要对数据进行预处理,将每个邮件的单词转换成小写,并将其分成单个单词。可以使用以下代码:
import string
def process_sentence(sentence):
table = str.maketrans('', '', string.punctuation)
words = [word.lower().translate(table) for word in sentence]
return words
spam = [process_sentence(sentence) for sentence in spam]
ham = [process_sentence(sentence) for sentence in ham]
现在,我们可以将数据集分成训练集和测试集。可以使用以下代码:
from sklearn.model_selection import train_test_split
X = spam + ham
y = ['spam'] * len(spam) + ['ham'] * len(ham)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
接着,我们可以使用CountVectorizer将单词转换成数字向量。可以使用以下代码:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform([' '.join(sentence) for sentence in X_train])
X_test_counts = vectorizer.transform([' '.join(sentence) for sentence in X_test])
现在,我们可以使用朴素贝叶斯算法来训练模型,并对测试集进行预测。可以使用以下代码:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X_train_counts, y_train)
y_pred = clf.predict(X_test_counts)
最后,我们可以使用tkinter创建一个简单的界面,让用户输入邮件并进行预测。可以使用以下代码:
import tkinter as tk
from tkinter import messagebox
def predict():
sentence = entry.get()
sentence = process_sentence(sentence.split())
sentence = ' '.join(sentence)
sentence_count = vectorizer.transform([sentence])
prediction = clf.predict(sentence_count)[0]
messagebox.showinfo('Prediction', prediction)
root = tk.Tk()
root.title('Spam Classifier')
label = tk.Label(root, text='Enter your message:')
label.pack()
entry = tk.Entry(root)
entry.pack()
button = tk.Button(root, text='Predict', command=predict)
button.pack()
root.mainloop()
这样,我们就可以创建一个简单的界面,让用户输入邮件并进行预测
原文地址: https://www.cveoy.top/t/topic/frwD 著作权归作者所有。请勿转载和采集!