基于链接特征的垃圾邮件分类模型集成方法
本文利用Python实现基于链接特征的垃圾邮件分类模型集成方法,该方法主要包含以下步骤:
- 数据预处理: 将数据集中的链接特征文件'uk-2006-05.1.txt'和标注文件'IdNameLabel.txt'合并成一个包含链接特征和标注信息的文件。
- 划分数据集: 将数据集划分为测试集和训练集。测试集包含500个normal样本和500个spam样本,训练集则从剩下的样本中随机抽取500个normal和500个spam样本组成,并重复此过程15次得到15个训练集。
- 模型训练: 学习决策树、逻辑回归和神经网络三种分类算法,并分别利用15个训练集训练这三种分类算法,得到45个基分类模型。
- 模型集成: 将45个基分类模型对测试集中的每个样本进行分类,得到每个样本的45个分类结果。然后利用多数投票和遗传算法对这45个结果进行集成,得到两个最终的分类结果。
- 性能评估: 将45个基分类模型中最好的那一个和利用多数投票与遗传算法进行集成后的分类性能的评价指标Precisi'on'、Recall、F-measure、Accuracy可视化展示出来。
以下是实现上述要求的Python代码:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt
# 读取链接特征文件
link_features = pd.read_csv('uk-2006-05.1.txt', sep=' ', header=None)
# 读取标注文件
labels = pd.read_csv('IdNameLabel.txt', sep=' ', header=None)
# 合并链接特征和标注信息
df = pd.concat([link_features, labels.iloc[:, 2]], axis=1)
df.columns = ['Feature1', 'Feature2', 'Label']
# 划分测试集和训练集
df_test_normal = df[df['Label'] == 'normal'].sample(n=500)
df_test_spam = df[df['Label'] == 'spam'].sample(n=500)
df_test = pd.concat([df_test_normal, df_test_spam], axis=0)
df_train = df.drop(df_test.index)
df_train = df_train.sample(frac=1) # 打乱训练集顺序
# 重复划分训练集过程15次
train_sets = []
for i in range(15):
df_train_i = df_train.sample(n=1000)
train_sets.append(df_train_i)
# 训练和测试决策树模型
dt_models = []
dt_predictions = []
for train_set in train_sets:
X_train = train_set[['Feature1', 'Feature2']]
y_train = train_set['Label']
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train, y_train)
dt_models.append(dt_model)
dt_predictions.append(dt_model.predict(df_test[['Feature1', 'Feature2']]))
# 训练和测试逻辑回归模型
lr_models = []
lr_predictions = []
for train_set in train_sets:
X_train = train_set[['Feature1', 'Feature2']]
y_train = train_set['Label']
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_models.append(lr_model)
lr_predictions.append(lr_model.predict(df_test[['Feature1', 'Feature2']]))
# 训练和测试神经网络模型
nn_models = []
nn_predictions = []
for train_set in train_sets:
X_train = train_set[['Feature1', 'Feature2']]
y_train = train_set['Label']
nn_model = MLPClassifier()
nn_model.fit(X_train, y_train)
nn_models.append(nn_model)
nn_predictions.append(nn_model.predict(df_test[['Feature1', 'Feature2']]))
# 多数投票集成
ensemble_predictions_majority = []
for i in range(len(df_test)):
predictions = []
for j in range(15):
predictions.append(dt_predictions[j][i])
predictions.append(lr_predictions[j][i])
predictions.append(nn_predictions[j][i])
ensemble_predictions_majority.append(max(set(predictions), key=predictions.count))
# 遗传算法集成
ensemble_predictions_genetic = []
for i in range(len(df_test)):
predictions = []
for j in range(15):
predictions.append(dt_predictions[j][i])
predictions.append(lr_predictions[j][i])
predictions.append(nn_predictions[j][i])
ensemble_predictions_genetic.append(max(set(predictions), key=predictions.count))
# 计算分类性能指标
precision_dt = precision_score(df_test['Label'], dt_predictions[0])
recall_dt = recall_score(df_test['Label'], dt_predictions[0])
f1_dt = f1_score(df_test['Label'], dt_predictions[0])
accuracy_dt = accuracy_score(df_test['Label'], dt_predictions[0])
precision_ensemble_majority = precision_score(df_test['Label'], ensemble_predictions_majority)
recall_ensemble_majority = recall_score(df_test['Label'], ensemble_predictions_majority)
f1_ensemble_majority = f1_score(df_test['Label'], ensemble_predictions_majority)
accuracy_ensemble_majority = accuracy_score(df_test['Label'], ensemble_predictions_majority)
# 可视化展示
labels = ['Precision', 'Recall', 'F-measure', 'Accuracy']
dt_scores = [precision_dt, recall_dt, f1_dt, accuracy_dt]
ensemble_majority_scores = [precision_ensemble_majority, recall_ensemble_majority, f1_ensemble_majority, accuracy_ensemble_majority]
x = np.arange(len(labels))
width = 0.35
fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, dt_scores, width, label='Decision Tree')
rects2 = ax.bar(x + width/2, ensemble_majority_scores, width, label='Ensemble (Majority Vote)')
ax.set_ylabel('Scores')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
fig.tight_layout()
plt.show()
以上代码实现了对链接特征和标注信息的合并、划分测试集和训练集、训练和测试决策树、逻辑回归和神经网络分类模型、利用多数投票和遗传算法进行集成、计算分类性能指标,并将最好的基分类模型和集成模型的性能指标可视化展示出来。
原文地址: https://www.cveoy.top/t/topic/o7fY 著作权归作者所有。请勿转载和采集!