Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
images		images
model		model
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

NEWS CLASIFFICATION 🗞️

This repository hosts a notebook featuring an in-depth analysis of several RNN models, together with CNN and Multinomial Naive Bayes along with an app deployment using Streamlit. The following models were meticulously evaluated:

Basic Keras Model
LSTM Model
LSTM GRU Model
LSTM Bidirectional Model
- TextVectorization + Keras Embedding
- Text_to_word_sequence + Word2Vec Embedding
Basic CNN Model

The dataset used has been downloaded from Kaggle and contains a set of Fake and Real News.

The app can be tested following this link.

👨‍💻 Tech Stack

📐 Set Up

In the first stage, a set of helper functions was created in order to easily visualize the data analysis and modelling results:

Plot WordCLoud: Generate a word cloud for a specific label value and display it in a subplot
Plot Confusion Matrix: Plot a confusion matrix to visualize classification results
Plot Precision/Recall Results: Calculates model accuracy, precision, recall, and F1-score of a binary classification model and returns the results as a DataFrame

👨‍🔬 Data Analysis

The first approach was to analyze the dataset columns and ist distribution. The dataset contains the following columns, divided in two files (fake and true):

Title
Text
Subject
Date

After merging the datasets we see that the labels are pretty well balance as they are close to 50% each. So in this respect the is no need to do oversampling or undersampling. The number of news is 23481 (fake) and 21417 (true), after removal of duplicates (209 rows).

On the other hands, the subjects contains 8 topics, from which the 2 most popular are all true news and the other 6 fake. This means that there is no mix of labels within subjects.

Within the wordcloud, Trump and US are along the most common words in both labels.

👨📶 Preprocessing

Along with the data analysis, the following data preprocessing steps where taken in order to create a clean dataset for the further modelling step:

Removal of duplicated rows
Removal of rows with empty cells
Merging of text and title column in a common column
Cleaning of dataframe. This includes removal of punctuation, numbers, special characters, stopwords and lemmatization.

This lead again to the creation of around 6'000 duplicated rows that were removed, leading to a final dataset of 38'835 rows

👨‍🔬 Modelling

The first approach was to train 2 Pytorch EfficientNet models (EffNetB0, EffNetB2) with 5 and 10 epochs using the pretrained model weights of EffNetB0 for the DataLoaders in order to stablish a baseline. The EffNetB2 with 10 epochs showed the best performance above 93% on the test set.

↗️ Model Improvement

Then the EffNetB2 with 10 epochs was trained again but this time using the pretrained model weights of EffNetB2 for the DataLoaders. This time an accuracy above 95% on the test set and above 93% on the validation set was achieved .

👏 App Deployment

The last step was to deploy and app hosted in Hugging Face using Gradio. This app can be tested with available sample images or with own ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NEWS CLASIFFICATION 🗞️

👨‍💻 Tech Stack

📐 Set Up

👨‍🔬 Data Analysis

👨📶 Preprocessing

👨‍🔬 Modelling

↗️ Model Improvement

👏 App Deployment

About

Languages

benitomartin/nlp-news-classification

Folders and files

Latest commit

History

Repository files navigation

NEWS CLASIFFICATION 🗞️

👨‍💻 Tech Stack

📐 Set Up

👨‍🔬 Data Analysis

👨📶 Preprocessing

👨‍🔬 Modelling

↗️ Model Improvement

👏 App Deployment

About

Topics

Resources

Stars

Watchers

Forks

Languages