Clickbait Detection: Combining Datasets and Building a Text Classifier

Clickbait Detection Project

This project aims to build a clickbait detection model using Python and machine learning techniques. We'll start by combining two datasets containing examples of clickbait and non-clickbait headlines. Then, we'll train a Bag-of-Words (BOW) classifier, tune its hyperparameters, and evaluate its performance. Additionally, we'll investigate key indicators and create a rule-based classifier using regular expressions.

Problem 1 - Reading and Combining Data (5 pts)

import pandas as pd

# Read positive examples dataset
positive_data = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait')

# Read negative examples dataset
negative_data = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait')

# Combine the datasets
combined_data = pd.concat([positive_data, negative_data])

# Shuffle the combined dataset
combined_data = combined_data.sample(frac=1).reset_index(drop=True)

# Display the combined dataset
print(combined_data.head())

Problem 2 - Baseline Performance (10 pts - Answer in Blackboard)

Question: Assume a trivial baseline classifier that labels every text as clickbait. What is its precision, recall, and F1-score on the test set? Do you think there's another good baseline classifier with a higher F1-score?

Problem 3 - Training a Bag-of-Words (BOW) Text Classifier (20 pts)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score, recall_score, f1_score

# Define the pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

# Fit the classifier on the training set
pipeline.fit(train_data, train_labels)

# Calculate metrics on training and validation sets
train_preds = pipeline.predict(train_data)
validation_preds = pipeline.predict(validation_data)

train_precision = precision_score(train_labels, train_preds, pos_label='clickbait')
validation_precision = precision_score(validation_labels, validation_preds, pos_label='clickbait')

train_recall = recall_score(train_labels, train_preds, pos_label='clickbait')
validation_recall = recall_score(validation_labels, validation_preds, pos_label='clickbait')

train_f1 = f1_score(train_labels, train_preds, pos_label='clickbait')
validation_f1 = f1_score(validation_labels, validation_preds, pos_label='clickbait')

# Display results
print('Training Set - Precision:', train_precision)
print('Training Set - Recall:', train_recall)
print('Training Set - F1-score:', train_f1)
print('Validation Set - Precision:', validation_precision)
print('Validation Set - Recall:', validation_recall)
print('Validation Set - F1-score:', validation_f1)

Problem 4 - Hyperparameter Tuning (20 pts)

from sklearn.model_selection import ParameterGrid

# Define parameter grid
param_grid = {
    'vectorizer__max_df': [0.5, 0.75, 1.0],
    'classifier__alpha': [0.1, 1.0, 10.0],
    'vectorizer__ngram_range': [(1, 1), (1, 2)]
}

# Iterate through the grid
best_f1 = 0
best_params = None
for params in ParameterGrid(param_grid):
    pipeline.set_params(**params)
    pipeline.fit(train_data, train_labels)
    validation_preds = pipeline.predict(validation_data)
    f1 = f1_score(validation_labels, validation_preds, pos_label='clickbait')
    if f1 > best_f1:
        best_f1 = f1
        best_params = params

# Print the best parameters and F1-score
print('Best Parameters:', best_params)
print('Best F1-score:', best_f1)

Problem 5 - Model Selection (10 pts)

# Select the model with the highest F1-score on the validation set
selected_model = pipeline.set_params(**best_params)

# Apply the selected model to the test set
test_preds = selected_model.predict(test_data)

# Calculate metrics on the test set
test_precision = precision_score(test_labels, test_preds, pos_label='clickbait')
test_recall = recall_score(test_labels, test_preds, pos_label='clickbait')
test_f1 = f1_score(test_labels, test_preds, pos_label='clickbait')

# Display test set results
print('Test Set - Precision:', test_precision)
print('Test Set - Recall:', test_recall)
print('Test Set - F1-score:', test_f1)

Problem 6 - Key Indicators (10 pts)

from sklearn.naive_bayes import MultinomialNB
import numpy as np

# Access the trained Naive Bayes model
nb_model = selected_model.named_steps['classifier']

# Extract vocabulary and log-probabilities
vocabulary = selected_model.named_steps['vectorizer'].vocabulary_
log_probs = nb_model.feature_log_prob_

# Get the top 5 words with highest log-probabilities for clickbait
top_keywords = np.argsort(log_probs[1])[-5:]
key_indicators = [list(vocabulary.keys())[i] for i in top_keywords]

# Print the key indicators
print('Top 5 Key Indicators:', key_indicators)

Problem 7 - Regular Expressions (10 pts)

import re

# Create a regular expression to detect the key indicators
regex = r'\b(?:' + '|'.join(key_indicators) + r')\b'

# Apply the regular expression to the test set
regex_preds = [1 if re.search(regex, text) else 0 for text in test_data]

# Calculate precision and recall
regex_precision = precision_score(test_labels, regex_preds)
regex_recall = recall_score(test_labels, regex_preds)

# Display results
print('Regex Classifier - Precision:', regex_precision)
print('Regex Classifier - Recall:', regex_recall)

Problem 8 - Comparing Results (15 pts - Answer in Blackboard)

Question: Compare the rule-based classifier and the machine learning solution. Which one performed better? Why do you think it performed better? How did they both compare to the trivial baseline (Problem 2)?

Further Exploration:

If you had more time to improve this clickbait detection solution, you could explore:

Trying different classifiers like SVM or Random Forest
Adjusting hyperparameters like ngram_range, max_df, and alpha
Pre-processing text data more thoroughly, including removing stop words and stemming
Using more complex feature engineering methods like TF-IDF or word embeddings
Collecting more data to improve the model's generalization ability
Employing ensemble learning techniques like voting classifiers or stacking models to enhance performance

Clickbait Detection: Combining Datasets and Building a Text Classifier