Clickbait Detection with Python: A Text Classification Experiment

This homework assignment guides you through building a clickbait detection model using Python's powerful scikit-learn machine learning library. You will learn how to perform a text classification experiment, explore the intricacies of bag-of-words techniques, and gain practical experience with word embeddings.

Goals:

To conduct a text classification experiment and analyze the results.
To familiarize yourself with the scikit-learn machine learning library.
Optional exercise: To gain practical experience with word embeddings.

Data:

You will be working with a dataset of clickbait headlines, designed to identify text intended to attract attention and encourage users to click links. The data has been curated by other NLP researchers on GitHub:

Positive examples: https://github.com/pfrcks/clickbait-detection/blob/master/clickbait
Negative examples: https://github.com/pfrcks/clickbait-detection/blob/master/not-clickbait

While reading research papers on clickbait detection is not mandatory for this assignment, you are encouraged to explore the original publication that shared this dataset.

Tools:

This assignment requires you to utilize functions from scikit-learn, an open-source Python library widely used by data scientists. You are expected to complete the assignment exclusively in Python; other programming languages will not be accepted.

Note on Model Performance:

The primary objective of this homework is to execute an NLP experiment and present your findings clearly to others (specifically, the course TAs). Different students will observe different results. Rest assured that obtaining slightly lower metrics than others will not impact your grade.

What to Submit:

Please submit the following through Blackboard:

For Problems 1 and 3-7:
- Upload a Jupyter Notebook (.ipynb file) showcasing your work for both datasets. Include cell output to demonstrate your code's execution.
- Include a PDF copy of the same notebook (identical code and output).
For Problems 2 and 8 (no-code Q&A problems):
- Provide your written answers directly in Blackboard along with your homework submission.

Problem 1 – Reading the Data (5 pts)

Using Python, read in the two clickbait datasets (refer to the DATA section) and merge them into a single, shuffled dataset. You can utilize the numpy.random.shuffle function for shuffling.
Next, split your dataset into training, testing, and validation sets. Employ a 72% train, 8% validation, and 20% test split (equivalent to a 20% test set, with the remaining data split 90%/10% for training and validation).
- You have the option to save each split as an index (list of row numbers) instead of creating separate datasets.
Determine the 'target rate' for each of these three datasets. This refers to the percentage of the test dataset labeled as clickbait. Display your findings in your notebook.

Problem 2 – Baseline Performance (10 pts – Answer in Blackboard)

Imagine you have a simple baseline classifier that flags every text presented to it as clickbait. What would the precision, recall, and F1-score be for this classifier on your test set? Do you believe there is another good baseline classifier that might achieve a higher F1-score?

Problem 3 – Training a Single Bag-of-Words (BOW) Text Classifier (20 pts)

Employ the scikit-learn pipelines module to construct a Pipeline for training a BOW naive Bayes model. We recommend using the CountVectorizer and MultinomialNB classes. Include both unigrams and bigrams in your model's vectorizer vocabulary (check the ngram_range parameter).
Fit your classifier on the training set.
Calculate precision, recall, and F1-score on both your training and validation datasets using functions from sklearn.metrics. Display your results in your notebook. Use 'clickbait' as your target class (i.e., y=1 for clickbait and y=0 for non-clickbait).

Alternative: If you are already familiar with Naive Bayes, feel free to select another bag-of-words classifier for this problem. Ensure your chosen method incorporates a way to select top features or key indicators, mapping them to words or n-grams within your vocabulary. This will enable you to complete the remaining problems.

Problem 4 – Hyperparameter Tuning (20 pts)

Using the ParameterGrid class, perform a small grid search where you modify at least three parameters of your model:
- max_df for your count vectorizer (threshold to filter document frequency)
- alpha or smoothing of your NaïveBayes model
- One other parameter of your choice. This can be non-numeric; for example, you can consider a model with and without bigrams (see the 'ngram' parameter in the CountVectorizer class).
Show metrics for precision, recall, and F1-score on your validation set. If your grid search is very extensive (over 50 rows), you may limit the output to the highest and lowest results.

Alternative – If you opted for a method other than Naive Bayes in Problem 3, ensure it's clear which metrics you tuned in Problem 4.

Problem 5 – Model Selection (10pts)

Based on the validation-set metrics from the previous problem, choose one model as your final selected model. You can decide how to select this model; one approach is to choose the model that exhibits the highest F1-score on your training set.
Apply your chosen model to the test set and compute precision, recall, and F1. Display the results in your notebook.

Problem 6 – Key Indicators (10pts)

Using the log-probabilities of the model you selected in the preceding problem, identify five words that serve as strong Clickbait indicators. In other words, if you needed to filter headlines based on a single word without a machine learning model, these words would be excellent choices.
Show this list of keywords in your notebook.
You can decide how to handle bigrams (e.g., 'win big'); you may choose to ignore them and select only unigram vocabulary words as key indicators.

Problem 7 – Regular Expressions (10pts)

Your IT department has reached out to you because they've heard you can help them identify clickbait. They are interested in your machine learning model, but they need a solution today.
Construct a regular expression that checks for the presence of any of the keywords you identified in the previous problem within a text. You should create a single regular expression that detects any of your top five keywords. Your regular expression should account for word boundaries in some way. For instance, the keyword 'win' should not be detected in the text 'Gas prices up in winter months.'
Using the Python re library, apply your function to your test set (see the re.search function). Determine the precision and recall of this classifier. Show your results in your notebook.

Problem 8 – Comparing Results (15pts – Answer in Blackboard)

Compare your rule-based classifier from the previous problem with your machine learning solution. Which classifier demonstrated the best model metrics? Why do you think it performed better? How did both compare to your trivial baseline (Problem 2)?
If you had more time to try to improve this clickbait detection solution, what would you explore? (There is no single correct answer to this question; review your results and come up with your own ideas.)

Clickbait Detection with Python: A Text Classification Experiment