Clickbait Detection: Combining Datasets, Training a Bag-of-Words Classifier, and Evaluating Performance

Clickbait Detection Project: Combining Datasets, Training a Classifier, and Evaluating Performance

This project focuses on building a clickbait detection system using machine learning techniques. It involves the following steps:

Problem 1: Data Preparation (5 pts)

Reading the Data: Read the two clickbait datasets (positive and negative examples) from the provided GitHub repository links:
- Positive Examples: https://github.com/pfrcks/clickbait-detection/blob/master/clickbait
- Negative Examples: https://github.com/pfrcks/clickbait-detection/blob/master/not-clickbait
Combining Datasets: Combine the positive and negative datasets into a single, shuffled dataset. Use the numpy.random.shuffle function to shuffle the data.
Data Splitting: Split the combined dataset into training, validation, and testing sets with the following proportions: 72% training, 8% validation, and 20% testing. Alternatively, you can save each split as an index (list of row numbers) instead of creating separate datasets.
Target Rate Calculation: Determine the 'target rate' for each dataset (training, validation, and testing). The target rate represents the percentage of headlines labeled as clickbait in each set. Display these rates in your notebook.

Example Code (Problem 1)

import numpy as np
import pandas as pd

# Read positive and negative datasets
pos_dataset = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/clickbait')
neg_dataset = pd.read_csv('https://raw.githubusercontent.com/pfrcks/clickbait-detection/master/not-clickbait')

# Combine datasets
dataset = pd.concat([pos_dataset, neg_dataset], ignore_index=True)

# Shuffle the dataset
np.random.shuffle(dataset.values)

# Split into train, validation, and test sets
train_size = int(len(dataset) * 0.72)
val_size = int(len(dataset) * 0.08)

train_dataset = dataset[:train_size]
val_dataset = dataset[train_size:train_size+val_size]
test_dataset = dataset[train_size+val_size:]

# Calculate target rates
target_rate_train = train_dataset['label'].sum() / len(train_dataset)
target_rate_val = val_dataset['label'].sum() / len(val_dataset)
target_rate_test = test_dataset['label'].sum() / len(test_dataset)

print('Train Dataset Target Rate:', target_rate_train)
print('Validation Dataset Target Rate:', target_rate_val)
print('Test Dataset Target Rate:', target_rate_test)

Problem 2: Baseline Performance (10 pts - Answer in Blackboard)

Trivial Baseline Classifier: Assume a simple classifier that labels every headline as clickbait. Calculate the precision, recall, and F1-score of this classifier on the test set. This baseline represents the performance of a random guess.
Alternative Baseline: Consider if there is another baseline classifier that might perform better. This might involve labeling all headlines as non-clickbait or using a more sophisticated, yet still simple, approach.

Answer to Problem 2 in Blackboard

The trivial baseline classifier that labels all headlines as clickbait will have a precision equal to the actual proportion of clickbait headlines in the test set. Its recall will be 1 (since it correctly identifies all clickbait headlines), and its F1-score will be 2 * (precision * recall) / (precision + recall). In this case, the F1-score will be equal to twice the proportion of clickbait headlines in the test set divided by (the proportion of clickbait headlines + 1). This baseline performs poorly because it does not correctly identify non-clickbait headlines.
An alternative baseline classifier could label all headlines as non-clickbait. This baseline would have a precision of 1 (since it correctly identifies all non-clickbait headlines), but its recall would be 0 (since it misses all clickbait headlines). Its F1-score would also be 0. This baseline performs even worse than the previous one because it fails to identify any clickbait headlines.
A more sophisticated, yet still simple baseline classifier could use a rule-based approach. For example, it could identify headlines that contain certain words or phrases often associated with clickbait. However, the effectiveness of this baseline would depend heavily on the chosen rules and the quality of the dataset.

Problem 3: Training a Bag-of-Words Classifier (20 pts)

Pipeline Creation: Use scikit-learn's Pipeline module to create a pipeline for training a Bag-of-Words (BOW) Naive Bayes classifier. Include both unigrams and bigrams in the vectorizer vocabulary (using the ngram_range parameter in CountVectorizer).
Model Training: Fit the classifier on the training set.
Performance Evaluation: Compute the precision, recall, and F1-score on both the training and validation sets using functions from sklearn.metrics. Display these metrics in your notebook. Use 'clickbait' as your target class (y = 1 for clickbait, y = 0 for non-clickbait).

Example Code (Problem 3)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, f1_score

# Create pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

# Fit the model on training data
pipeline.fit(train_dataset['headline'], train_dataset['label'])

# Make predictions on training and validation sets
train_predictions = pipeline.predict(train_dataset['headline'])
val_predictions = pipeline.predict(val_dataset['headline'])

# Calculate metrics for training and validation sets
train_precision = precision_score(train_dataset['label'], train_predictions)
train_recall = recall_score(train_dataset['label'], train_predictions)
train_f1 = f1_score(train_dataset['label'], train_predictions)

val_precision = precision_score(val_dataset['label'], val_predictions)
val_recall = recall_score(val_dataset['label'], val_predictions)
val_f1 = f1_score(val_dataset['label'], val_predictions)

print('Training Set Precision:', train_precision)
print('Training Set Recall:', train_recall)
print('Training Set F1-score:', train_f1)
print('Validation Set Precision:', val_precision)
print('Validation Set Recall:', val_recall)
print('Validation Set F1-score:', val_f1)

Problem 4: Hyperparameter Tuning (20 pts)

Grid Search: Use the ParameterGrid class to perform a grid search by varying at least three parameters of your model:
- max_df for your count vectorizer (threshold for filtering document frequency)
- alpha or smoothing of your Naive Bayes model
- One other parameter of your choice. This could be a non-numeric parameter like 'ngram' in CountVectorizer (to include or exclude bigrams).
Evaluation: Display the precision, recall, and F1-score metrics on the validation set for each combination of hyperparameters. If your grid search is extensive (> 50 rows), you may limit the output to the highest and lowest performing results.

Problem 5: Model Selection (10 pts)

Best Model Selection: Using the validation set metrics from the previous problem, select one model as your 'selected model'. You can choose based on the highest F1-score on the validation set, or use another approach based on your priorities.
Test Set Evaluation: Apply the selected model to the test set and compute the precision, recall, and F1-score. Display these results in your notebook.

Problem 6: Key Indicators (10 pts)

Indicator Selection: Using the log-probabilities of the selected model, identify five words (or n-grams) that are strong clickbait indicators. These words should be good choices for filtering headlines based on a single word, without relying on a machine learning model. Show this list of keywords in your notebook.

Problem 7: Regular Expressions (10 pts)

Regular Expression: Write a regular expression that checks if any of the keywords identified in the previous problem are present in a text. Your regular expression should be aware of word boundaries to avoid false positives. Use the Python re library to apply your function to the test set (using re.search).
Rule-Based Classifier Evaluation: Calculate the precision and recall of this rule-based classifier. Display these results in your notebook.

Problem 8: Comparing Results (15 pts - Answer in Blackboard)

Comparison: Compare the performance of the rule-based classifier (from Problem 7) with your machine learning solution (from Problem 5). Analyze which classifier achieved better model metrics and explain the reasons behind their performance. How do both classifiers compare to the trivial baseline (from Problem 2)?
Future Improvements: If you had more time to improve the clickbait detection solution, what would you explore? Discuss potential areas for improvement based on your results and understanding of the problem.

Answer to Problem 8 in Blackboard

Compare the F1-scores of the machine learning classifier and the rule-based classifier to determine which model performs better. The model with a higher F1-score is considered better.
Explain the reasons for the performance difference. Factors to consider include:
- Complexity: Machine learning models often capture more complex patterns in data, while rule-based classifiers rely on explicit rules. The effectiveness of each approach depends on the nature of the data and the complexity of the patterns involved.
- Data Size and Quality: Machine learning models generally benefit from larger and more diverse datasets, which allows them to learn more robust patterns. Rule-based classifiers might be more sensitive to data limitations.
- Keyword Selection: The quality of the keywords selected for the rule-based classifier significantly affects its performance. If the selected keywords are not representative of clickbait content or if they appear in non-clickbait headlines, the classifier's accuracy will be compromised.
Compare the performance of both classifiers to the trivial baseline. The machine learning and rule-based classifiers should outperform the trivial baseline because they leverage information about clickbait patterns, even if it's imperfect. However, the extent of improvement over the baseline depends on the effectiveness of the chosen approaches.
For future improvements, explore these possibilities:
- More Data: Use a larger and more diverse dataset to train the machine learning model, potentially incorporating data from different sources or domains.
- Feature Engineering: Develop more informative features by incorporating linguistic analysis, sentiment analysis, or other relevant information beyond simple word counts.
- Advanced Models: Experiment with more sophisticated machine learning models, such as recurrent neural networks (RNNs) or transformer models, which are often more effective in handling sequential data like text.
- Ensemble Methods: Combine multiple models (e.g., using bagging or boosting techniques) to improve prediction accuracy and robustness.
- Human Feedback: Integrate human feedback into the system to refine the rules or improve the training data, leading to more accurate and relevant predictions.

Important Note: This response provides a comprehensive framework for the clickbait detection project. You'll need to implement the code, analyze the results, and provide the answers to Problem 2 and Problem 8 in Blackboard. Remember to adapt the code examples and analysis based on your specific implementation and observations.

Clickbait Detection: Combining Datasets, Training a Bag-of-Words Classifier, and Evaluating Performance