Clickbait Detection: Combining Datasets and Training a Bag-of-Words Classifier

This project explores clickbait detection using a combination of machine learning and rule-based approaches. We will start by combining two datasets containing positive and negative examples of clickbait headlines, then train a Bag-of-Words classifier to identify clickbait headlines.

Problem 1 - Reading the Data (5 pts)

Using Python, read in the 2 clickbait datasets:
- Positive Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/clickbait'
- Negative Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/not-clickbait'
Combine both datasets into a single, shuffled dataset.
Split your dataset into train, test, and validation datasets using a split of 72% train, 8% validation, and 20% test.
Calculate the 'target rate' (percentage of clickbait headlines) for each dataset.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Read datasets
clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait'
not_clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait'
clickbait_data = pd.read_csv(clickbait_url, names=['text'], sep='\n')
not_clickbait_data = pd.read_csv(not_clickbait_url, names=['text'], sep='\n')

# Add labels
clickbait_data['label'] = 1
not_clickbait_data['label'] = 0

# Combine datasets
dataset = pd.concat([clickbait_data, not_clickbait_data])

# Shuffle dataset
dataset = dataset.sample(frac=1, random_state=42)

# Split dataset
train, test_val = train_test_split(dataset, train_size=0.72, random_state=42)
validation, test = train_test_split(test_val, train_size=0.1, random_state=42)

# Calculate target rates
target_rate_test = test['label'].mean()
target_rate_validation = validation['label'].mean()
target_rate_train = train['label'].mean()

print('Target Rate - Test: {:.2f}%'.format(target_rate_test * 100))
print('Target Rate - Validation: {:.2f}%'.format(target_rate_validation * 100))
print('Target Rate - Train: {:.2f}%'.format(target_rate_train * 100))

Problem 2 - Trivial Baseline (5 pts)

Describe a trivial baseline classifier for this problem.
What would the precision, recall, and F1-score be for this classifier?

Answer: A trivial baseline classifier for this problem would be to predict all headlines as either clickbait or non-clickbait. For example, a classifier that always predicts clickbait would have a recall of 100% (it would correctly identify all actual clickbait headlines) but a precision of only the target rate of the dataset (since it would also incorrectly classify non-clickbait headlines as clickbait). Conversely, a classifier that always predicts non-clickbait would have a precision of 100% but a recall of 0%.

Problem 3 - Training a Single Bag-of-Words (BOW) Text Classifier (20 pts)

Create a scikit-learn pipeline to train a BOW Naive Bayes model using CountVectorizer and MultinomialNB.
Include both unigrams and bigrams in your model.
Fit your classifier on the training set and calculate precision, recall, and F1-score on both the training and validation sets.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, f1_score

# Create pipeline
model = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

# Fit the model
model.fit(train['text'], train['label'])

# Evaluate on training set
train_predictions = model.predict(train['text'])
precision_train = precision_score(train['label'], train_predictions)
recall_train = recall_score(train['label'], train_predictions)
f1_train = f1_score(train['label'], train_predictions)

# Evaluate on validation set
validation_predictions = model.predict(validation['text'])
precision_validation = precision_score(validation['label'], validation_predictions)
recall_validation = recall_score(validation['label'], validation_predictions)
f1_validation = f1_score(validation['label'], validation_predictions)

print('Training Set Metrics:')
print('Precision: {:.2f}'.format(precision_train))
print('Recall: {:.2f}'.format(recall_train))
print('F1-score: {:.2f}'.format(f1_train))

print('\nValidation Set Metrics:')
print('Precision: {:.2f}'.format(precision_validation))
print('Recall: {:.2f}'.format(recall_validation))
print('F1-score: {:.2f}'.format(f1_validation))

Problem 4 - Hyperparameter Tuning (20 pts)

Use ParameterGrid to perform a grid search where you vary at least three parameters:
- max_df for the CountVectorizer (threshold to filter document frequency)
- alpha or smoothing of the Naive Bayes model
- One other parameter of your choice.
Show the metrics (precision, recall, F1-score) on the validation set.

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'vectorizer__max_df': [0.5, 0.75, 1.0],
    'classifier__alpha': [0.1, 1.0, 10.0],
    'vectorizer__ngram_range': [(1, 1), (1, 2)]
}

# Create grid search object
grid_search = GridSearchCV(model, param_grid, scoring='f1_macro', cv=5)

# Fit grid search
grid_search.fit(train['text'], train['label'])

# Show top and bottom results
print('Best Parameters:', grid_search.best_params_)
print('Best F1-score:', grid_search.best_score_)

# Print top and bottom results
print('\nTop Results:')
for i in range(3):
    print('Params:', grid_search.cv_results_['params'][i])
    print('F1-score:', grid_search.cv_results_['mean_test_score'][i])

print('\nBottom Results:')
for i in range(-3, 0):
    print('Params:', grid_search.cv_results_['params'][i])
    print('F1-score:', grid_search.cv_results_['mean_test_score'][i])

Problem 5 - Model Selection (10 pts)

Select one model from the grid search based on its performance on the validation set.
Apply this model to the test set and calculate precision, recall, and F1-score.

# Select best model
best_model = grid_search.best_estimator_

# Evaluate on test set
test_predictions = best_model.predict(test['text'])
precision_test = precision_score(test['label'], test_predictions)
recall_test = recall_score(test['label'], test_predictions)
f1_test = f1_score(test['label'], test_predictions)

print('Test Set Metrics:')
print('Precision: {:.2f}'.format(precision_test))
print('Recall: {:.2f}'.format(recall_test))
print('F1-score: {:.2f}'.format(f1_test))

Problem 6 - Key Indicators (10 pts)

Using the log-probabilities of the selected model, identify 5 words that are strong clickbait indicators.

# Get feature names
feature_names = best_model.named_steps['vectorizer'].get_feature_names_out()

# Get log-probabilities
log_probs = best_model.named_steps['classifier'].feature_log_prob_[1]

# Sort log-probabilities in descending order
sorted_indices = np.argsort(log_probs)[::-1]

# Print top 5 key indicators
print('Top 5 Key Indicators:')
for i in range(5):
    print(feature_names[sorted_indices[i]])

Problem 7 - Regular Expressions (10 pts)

Create a regular expression to detect if any of the key indicators are found in a text, taking into account word boundaries.
Apply this regular expression to the test set using the re library and calculate precision and recall.

import re

# Key indicator words
keywords = ['keyword1', 'keyword2', 'keyword3', 'keyword4', 'keyword5']

# Regular expression
regex = r'\b(?:{})\b'.format('|'.join(keywords))

# Apply regular expression to test set
predictions = test['text'].str.contains(regex).astype(int)

# Calculate precision and recall
precision = precision_score(test['label'], predictions)
recall = recall_score(test['label'], predictions)

print('Precision: {:.2f}'.format(precision))
print('Recall: {:.2f}'.format(recall))

Problem 8 - Comparing Results (15 pts)

Compare the performance of the rule-based classifier and the machine learning solution, considering their metrics and strengths.
Discuss how both approaches compare to the trivial baseline.
If you had more time to improve the clickbait detection solution, what would you explore?

Answer:

Compare the precision, recall, and F1-scores of the rule-based classifier and the machine learning solution. The approach with higher scores indicates better performance.
Analyze the strengths of each approach. The rule-based classifier might be faster and easier to implement, but it relies heavily on the selection of keywords. The machine learning approach is more complex but can potentially learn more nuanced patterns from the data.
Compare the performance of both approaches to the trivial baseline. If both classifiers outperform the trivial baseline, they demonstrate an improvement. If only one approach outperforms the baseline, it suggests a more effective method.
To further improve the clickbait detection solution, consider exploring different feature extraction techniques (e.g., TF-IDF, Word2Vec), experimenting with other machine learning algorithms (e.g., SVM, Random Forest), optimizing hyperparameters, collecting more training data, and potentially integrating multiple models for a more robust system.