Named Entity Recognition & Topic modeling of news articles

This project is aimed at performing Named Entity Recognition (using spaCy, BERT) & Topic modeling (using LDA, BERTopic) of actual news articles sourced from Kaggle.

To access the live version of the app, click here.

Data

The data is fetched from Kaggle and first, we merge all the different datasets (like business, education, etc.) in to a master dataset called master_data.csv.

Then, we remove irrelavant columns and save the resulting data as display_data.csv. This is the data will be used everywhere for all the tasks here on.

Dependencies

pandas to manipulate the data.
scikit-learn for machine learning tasks.
matplotlib for visualizations.
NLTK for NLP tasks like tokenization, removing stop words, etc.
spaCy which has pre-trained statistical models for common NLP tasks.
transformers which is a framework from HuggingFace🤗 that has pre-trained models for NLP tasks.
BERTopic which is a topic modeling framework built on top of Transformers by HuggingFace🤗
streamlit cloud to deploy the app.

Methodology

First we do the exploratory data analysis on display_data.csv that includes wordclouds.

You can follow notebook.ipynb for a detailed walkthrough.

Named Entity Recognition using spaCy

We are importing the display_data.csv & below are the steps carried out:

Preprocessing (lowercasing & removing special characters)
Using spaCy to find out different entity types (saving the output dataframe as entity_df.csv) & visualizing them using a bar chart.
Use displacy to visualize the linguistic structure of the sentences of the original data.

For the detailed steps you can look at NER.ipynb.

Named Entity Recognition using BERT by HuggingFace🤗

We are importing the display_data.csv & below are the steps carried out:

First we load the pre-trained NER model called bert-large-cased-finetuned-conll03-english & train this model on it.
Next, I save the output dataframe as entity_bert.csv
Finally, I print a random observation & its respective entities.

For the detailed steps you can look at NER_BERT.ipynb.

Topic modeling using Latent Dirichlet Allocation

We are importing the display_data.csv & below are the steps carried out:

First, I apply the LDA model & limit the topics to 15. Now, the LDA model will classify the articles in 15 topics (numbered from 1 to 15).
Next, I saved the output as LDA_topics.csv
Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud.

For the detailed steps you can look at topicmodeling_LDA.ipynb.

Topic modeling using BERTopic

We are importing the display_data.csv & below are the steps carried out:

Preprocessing (lowercasing & removing special characters)
Training BERTopic & saving the results as bertopic_results.csv
Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud. For the detailed steps you can look at topicmodeling_BERTopic.ipynb.

Run Locally

Clone the project

  git clone https://github.com/jaideep156/ner-topicmodeling

Go to the project directory

  cd ner-topicmodeling

Install dependencies

  pip install -r requirements.txt

Start the server

  streamlit run 1_🏠_Home.py

Deployment

This code has been deployed using Streamlit Community Cloud and the file is 1_🏠_Home.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Named Entity Recognition & Topic modeling of news articles

To access the live version of the app, click here.

Data

Dependencies

Methodology

Named Entity Recognition using spaCy

Named Entity Recognition using BERT by HuggingFace🤗

Topic modeling using Latent Dirichlet Allocation

Topic modeling using BERTopic

Run Locally

Deployment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
notebook		notebook
pages		pages
.gitignore		.gitignore
1_🏠_Home.py		1_🏠_Home.py
README.md		README.md
requirements.txt		requirements.txt

jaideep156/ner-topicmodeling

Folders and files

Latest commit

History

Repository files navigation

Named Entity Recognition & Topic modeling of news articles

To access the live version of the app, click here.

Data

Dependencies

Methodology

Named Entity Recognition using spaCy

Named Entity Recognition using BERT by HuggingFace🤗

Topic modeling using Latent Dirichlet Allocation

Topic modeling using BERTopic

Run Locally

Deployment

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages