Clickbait Detection: Combining Datasets and Training a Bag-of-Words Classifier
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn import metrics
Read the datasets
positive_data = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait') negative_data = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait')
Combine both datasets into a single dataset
data = pd.concat([positive_data, negative_data], ignore_index=True)
Shuffle the dataset
np.random.shuffle(data.values)
Split the dataset into train, test, and validation datasets
train_data, test_data, train_labels, test_labels = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42) train_data, val_data, train_labels, val_labels = train_test_split(train_data, train_labels, test_size=0.1, random_state=42)
Calculate the target rate of each dataset
test_target_rate = test_labels.mean() train_target_rate = train_labels.mean() val_target_rate = val_labels.mean()
print('Target rate of test dataset: {:.2f}%'.format(test_target_rate100)) print('Target rate of train dataset: {:.2f}%'.format(train_target_rate100)) print('Target rate of validation dataset: {:.2f}%'.format(val_target_rate*100))
Create a pipeline for training a BOW Naive Bayes model
pipeline = Pipeline([ ('vectorizer', CountVectorizer(ngram_range=(1, 2))), ('classifier', MultinomialNB()) ])
Fit the classifier on the training set
pipeline.fit(train_data, train_labels)
Predict on the training set
train_predictions = pipeline.predict(train_data)
Compute precision, recall, and F1-score on the training set
train_precision = metrics.precision_score(train_labels, train_predictions) train_recall = metrics.recall_score(train_labels, train_predictions) train_f1_score = metrics.f1_score(train_labels, train_predictions)
Predict on the validation set
val_predictions = pipeline.predict(val_data)
Compute precision, recall, and F1-score on the validation set
val_precision = metrics.precision_score(val_labels, val_predictions) val_recall = metrics.recall_score(val_labels, val_predictions) val_f1_score = metrics.f1_score(val_labels, val_predictions)
print('Training set - Precision: {:.2f}, Recall: {:.2f}, F1-score: {:.2f}'.format(train_precision, train_recall, train_f1_score)) print('Validation set - Precision: {:.2f}, Recall: {:.2f}, F1-score: {:.2f}'.format(val_precision, val_recall, val_f1_score))
原文地址: https://www.cveoy.top/t/topic/bfLp 著作权归作者所有。请勿转载和采集!