Clickbait Detection using Bag-of-Words Classifier
Clickbait Detection using Bag-of-Words Classifier
This project aims to build a machine learning model to detect clickbait headlines. We will use a Bag-of-Words (BOW) classifier, specifically a Nave Bayes model, to classify headlines as either 'clickbait' or 'not clickbait'. The project involves the following steps:
Problem 1 - Reading the Data (5 pts)
- Read two datasets containing clickbait and non-clickbait headlines from GitHub repositories:
- Positive Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/clickbait'
- Negative Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/not-clickbait'
- Combine both datasets into a single, shuffled dataset.
- Split the dataset into training, validation, and testing sets using a 72%-8%-20% split, respectively.
Problem 3 - Training a Single Bag-of-Words (BOW) Text Classifier (20 pts)
- Use scikit-learn's Pipeline module to create a pipeline for training a BOW Nave Bayes model. We will use the
CountVectorizerandMultinomialNBclasses. - Include both unigrams and bigrams in the vectorizer vocabulary (
ngram_rangeparameter). - Fit the classifier on the training set.
- Calculate precision, recall, and F1-score on both training and validation sets using functions from
sklearn.metrics. Show the results in your notebook. Use 'clickbait' as the target class (y=1 for clickbait and y=0 for non-clickbait).
Problem 4 - Hyperparameter Tuning (20 pts)
- Use the
ParameterGridclass to perform a grid search, varying at least three parameters:max_dffor theCountVectorizer(threshold for document frequency filtering)alphaor smoothing for the Nave Bayes model- One other parameter of your choice. You can consider using bigrams or not (
ngramparameter inCountVectorizer).
- Show precision, recall, and F1-score metrics on the validation set. If your grid search is large, limit the output to the highest and lowest results.
Problem 5 - Model Selection (10 pts)
- Select a model based on the validation set metrics from the previous problem. You can choose the model with the highest F1-score on the validation set.
- Apply the selected model to the test set and compute precision, recall, and F1-score. Show the results in your notebook.
Problem 6 - Key Indicators (10 pts)
- Using the log-probabilities of the selected model, identify five words that are strong clickbait indicators. These words could be used to filter headlines based on a single word, without using a machine learning model.
- Show the list of keywords in your notebook. You can choose how to handle bigrams.
Problem 7 - Regular Expressions (10 pts)
- Write a regular expression that checks if any of the keywords from the previous problem are found in the text. The regular expression should consider word boundaries.
- Using the
relibrary in Python, apply the regular expression to your test set. What is the precision and recall of this classifier? Show the results in your notebook.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import ParameterGrid
import re
# PROBLEM 1 - Reading the data
clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait'
not_clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait'
clickbait_data = pd.read_csv(clickbait_url, header=None, names=['text'])
not_clickbait_data = pd.read_csv(not_clickbait_url, header=None, names=['text'])
combined_data = pd.concat([clickbait_data, not_clickbait_data], ignore_index=True)
combined_data = combined_data.sample(frac=1).reset_index(drop=True) # Shuffle the dataset
# Splitting the dataset
train_data, test_data = train_test_split(combined_data, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)
# PROBLEM 3 - Training a single Bag-of-Words (BOW) Text Classifier
bow_pipeline = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1, 2))),
('classifier', MultinomialNB())
])
bow_pipeline.fit(train_data['text'], train_data['label'])
# Compute precision, recall, and F1-score on training set
train_predictions = bow_pipeline.predict(train_data['text'])
train_precision = precision_score(train_data['label'], train_predictions)
train_recall = recall_score(train_data['label'], train_predictions)
train_f1 = f1_score(train_data['label'], train_predictions)
# Compute precision, recall, and F1-score on validation set
val_predictions = bow_pipeline.predict(val_data['text'])
val_precision = precision_score(val_data['label'], val_predictions)
val_recall = recall_score(val_data['label'], val_predictions)
val_f1 = f1_score(val_data['label'], val_predictions)
print('Training set performance:')
print('Precision:', train_precision)
print('Recall:', train_recall)
print('F1-score:', train_f1)
print('Validation set performance:')
print('Precision:', val_precision)
print('Recall:', val_recall)
print('F1-score:', val_f1)
# PROBLEM 4 - Hyperparameter Tuning
parameters = {
'vectorizer__max_df': [0.5, 0.75, 1.0],
'classifier__alpha': [0.1, 0.5, 1.0],
'vectorizer__ngram_range': [(1, 1), (1, 2)]
}
grid = ParameterGrid(parameters)
best_model = None
best_f1 = 0.0
for params in grid:
model = Pipeline([
('vectorizer', CountVectorizer(max_df=params['vectorizer__max_df'], ngram_range=params['vectorizer__ngram_range'])),
('classifier', MultinomialNB(alpha=params['classifier__alpha']))
])
model.fit(train_data['text'], train_data['label'])
val_predictions = model.predict(val_data['text'])
val_f1 = f1_score(val_data['label'], val_predictions)
if val_f1 > best_f1:
best_model = model
best_f1 = val_f1
# PROBLEM 5 - Model selection
test_predictions = best_model.predict(test_data['text'])
test_precision = precision_score(test_data['label'], test_predictions)
test_recall = recall_score(test_data['label'], test_predictions)
test_f1 = f1_score(test_data['label'], test_predictions)
print('Test set performance:')
print('Precision:', test_precision)
print('Recall:', test_recall)
print('F1-score:', test_f1)
# PROBLEM 6 - Key Indicators
vectorizer = best_model.named_steps['vectorizer']
feature_names = vectorizer.get_feature_names()
classifier = best_model.named_steps['classifier']
log_probabilities = classifier.feature_log_prob_
clickbait_indicator_words = []
for i in range(log_probabilities.shape[1]):
max_log_prob = np.max(log_probabilities[:, i])
word_index = np.where(log_probabilities[:, i] == max_log_prob)[0][0]
clickbait_indicator_words.append(feature_names[word_index])
print('Clickbait indicator words:')
print(clickbait_indicator_words)
# PROBLEM 7 - Regular expressions
keywords_regex = '|'.join(clickbait_indicator_words)
regex_pattern = r'(' + keywords_regex + r')'
test_data['predicted_label'] = test_data['text'].apply(lambda x: bool(re.search(regex_pattern, x, flags=re.IGNORECASE)))
precision = precision_score(test_data['label'], test_data['predicted_label'])
recall = recall_score(test_data['label'], test_data['predicted_label'])
print('Precision:', precision)
print('Recall:', recall)
Please make sure that you have installed the necessary libraries such as numpy, pandas, scikit-learn, and nltk (if you need to remove stop words). This code will perform the operations for the problems described above based on the provided dataset and show the relevant performance metrics. Remember to replace train_data['label'], val_data['label'], and test_data['label'] with the actual labels for your dataset.
原文地址: https://www.cveoy.top/t/topic/MgX 著作权归作者所有。请勿转载和采集!