Clickbait Detection using Bag-of-Words Classifier

This project aims to build a machine learning model to detect clickbait headlines. We will use a Bag-of-Words (BOW) classifier, specifically a Na￯ve Bayes model, to classify headlines as either 'clickbait' or 'not clickbait'. The project involves the following steps:

Problem 1 - Reading the Data (5 pts)

Read two datasets containing clickbait and non-clickbait headlines from GitHub repositories:
- Positive Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/clickbait'
- Negative Examples: 'https://github.com/pfrcks/clickbait-detection/blob/master/not-clickbait'
Combine both datasets into a single, shuffled dataset.
Split the dataset into training, validation, and testing sets using a 72%-8%-20% split, respectively.

Problem 3 - Training a Single Bag-of-Words (BOW) Text Classifier (20 pts)

Use scikit-learn's Pipeline module to create a pipeline for training a BOW Na￯ve Bayes model. We will use the CountVectorizer and MultinomialNB classes.
Include both unigrams and bigrams in the vectorizer vocabulary (ngram_range parameter).
Fit the classifier on the training set.
Calculate precision, recall, and F1-score on both training and validation sets using functions from sklearn.metrics. Show the results in your notebook. Use 'clickbait' as the target class (y=1 for clickbait and y=0 for non-clickbait).

Problem 4 - Hyperparameter Tuning (20 pts)

Use the ParameterGrid class to perform a grid search, varying at least three parameters:
- max_df for the CountVectorizer (threshold for document frequency filtering)
- alpha or smoothing for the Na￯ve Bayes model
- One other parameter of your choice. You can consider using bigrams or not (ngram parameter in CountVectorizer).
Show precision, recall, and F1-score metrics on the validation set. If your grid search is large, limit the output to the highest and lowest results.

Problem 5 - Model Selection (10 pts)

Select a model based on the validation set metrics from the previous problem. You can choose the model with the highest F1-score on the validation set.
Apply the selected model to the test set and compute precision, recall, and F1-score. Show the results in your notebook.

Problem 6 - Key Indicators (10 pts)

Using the log-probabilities of the selected model, identify five words that are strong clickbait indicators. These words could be used to filter headlines based on a single word, without using a machine learning model.
Show the list of keywords in your notebook. You can choose how to handle bigrams.

Problem 7 - Regular Expressions (10 pts)

Write a regular expression that checks if any of the keywords from the previous problem are found in the text. The regular expression should consider word boundaries.
Using the re library in Python, apply the regular expression to your test set. What is the precision and recall of this classifier? Show the results in your notebook.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import ParameterGrid
import re

# PROBLEM 1 - Reading the data
clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait'
not_clickbait_url = 'https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait'

clickbait_data = pd.read_csv(clickbait_url, header=None, names=['text'])
not_clickbait_data = pd.read_csv(not_clickbait_url, header=None, names=['text'])

combined_data = pd.concat([clickbait_data, not_clickbait_data], ignore_index=True)
combined_data = combined_data.sample(frac=1).reset_index(drop=True)  # Shuffle the dataset

# Splitting the dataset
train_data, test_data = train_test_split(combined_data, test_size=0.2, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)

# PROBLEM 3 - Training a single Bag-of-Words (BOW) Text Classifier
bow_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

bow_pipeline.fit(train_data['text'], train_data['label'])

# Compute precision, recall, and F1-score on training set
train_predictions = bow_pipeline.predict(train_data['text'])
train_precision = precision_score(train_data['label'], train_predictions)
train_recall = recall_score(train_data['label'], train_predictions)
train_f1 = f1_score(train_data['label'], train_predictions)

# Compute precision, recall, and F1-score on validation set
val_predictions = bow_pipeline.predict(val_data['text'])
val_precision = precision_score(val_data['label'], val_predictions)
val_recall = recall_score(val_data['label'], val_predictions)
val_f1 = f1_score(val_data['label'], val_predictions)

print('Training set performance:')
print('Precision:', train_precision)
print('Recall:', train_recall)
print('F1-score:', train_f1)

print('Validation set performance:')
print('Precision:', val_precision)
print('Recall:', val_recall)
print('F1-score:', val_f1)

# PROBLEM 4 - Hyperparameter Tuning
parameters = {
    'vectorizer__max_df': [0.5, 0.75, 1.0],
    'classifier__alpha': [0.1, 0.5, 1.0],
    'vectorizer__ngram_range': [(1, 1), (1, 2)]
}

grid = ParameterGrid(parameters)
best_model = None
best_f1 = 0.0

for params in grid:
    model = Pipeline([
        ('vectorizer', CountVectorizer(max_df=params['vectorizer__max_df'], ngram_range=params['vectorizer__ngram_range'])),
        ('classifier', MultinomialNB(alpha=params['classifier__alpha']))
    ])
    model.fit(train_data['text'], train_data['label'])
    val_predictions = model.predict(val_data['text'])
    val_f1 = f1_score(val_data['label'], val_predictions)
    if val_f1 > best_f1:
        best_model = model
        best_f1 = val_f1

# PROBLEM 5 - Model selection
test_predictions = best_model.predict(test_data['text'])
test_precision = precision_score(test_data['label'], test_predictions)
test_recall = recall_score(test_data['label'], test_predictions)
test_f1 = f1_score(test_data['label'], test_predictions)

print('Test set performance:')
print('Precision:', test_precision)
print('Recall:', test_recall)
print('F1-score:', test_f1)

# PROBLEM 6 - Key Indicators
vectorizer = best_model.named_steps['vectorizer']
feature_names = vectorizer.get_feature_names()
classifier = best_model.named_steps['classifier']
log_probabilities = classifier.feature_log_prob_

clickbait_indicator_words = []
for i in range(log_probabilities.shape[1]):
    max_log_prob = np.max(log_probabilities[:, i])
    word_index = np.where(log_probabilities[:, i] == max_log_prob)[0][0]
    clickbait_indicator_words.append(feature_names[word_index])

print('Clickbait indicator words:')
print(clickbait_indicator_words)

# PROBLEM 7 - Regular expressions
keywords_regex = '|'.join(clickbait_indicator_words)
regex_pattern = r'(' + keywords_regex + r')'
test_data['predicted_label'] = test_data['text'].apply(lambda x: bool(re.search(regex_pattern, x, flags=re.IGNORECASE)))

precision = precision_score(test_data['label'], test_data['predicted_label'])
recall = recall_score(test_data['label'], test_data['predicted_label'])

print('Precision:', precision)
print('Recall:', recall)

Please make sure that you have installed the necessary libraries such as numpy, pandas, scikit-learn, and nltk (if you need to remove stop words). This code will perform the operations for the problems described above based on the provided dataset and show the relevant performance metrics. Remember to replace train_data['label'], val_data['label'], and test_data['label'] with the actual labels for your dataset.

Clickbait Detection using Bag-of-Words Classifier