Skip to content

jaideep156/ner-topicmodeling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Named Entity Recognition & Topic modeling of news articles

This project is aimed at performing Named Entity Recognition (using spaCy, BERT) & Topic modeling (using LDA, BERTopic) of actual news articles sourced from Kaggle.

To access the live version of the app, click here.

Data

The data is fetched from Kaggle and first, we merge all the different datasets (like business, education, etc.) in to a master dataset called master_data.csv.

Then, we remove irrelavant columns and save the resulting data as display_data.csv. This is the data will be used everywhere for all the tasks here on.

Dependencies

  • pandas to manipulate the data.
  • scikit-learn for machine learning tasks.
  • matplotlib for visualizations.
  • NLTK for NLP tasks like tokenization, removing stop words, etc.
  • spaCy which has pre-trained statistical models for common NLP tasks.
  • transformers which is a framework from HuggingFace🤗 that has pre-trained models for NLP tasks.
  • BERTopic which is a topic modeling framework built on top of Transformers by HuggingFace🤗
  • streamlit cloud to deploy the app.

Methodology

First we do the exploratory data analysis on display_data.csv that includes wordclouds.

You can follow notebook.ipynb for a detailed walkthrough.

Named Entity Recognition using spaCy

We are importing the display_data.csv & below are the steps carried out:

  • Preprocessing (lowercasing & removing special characters)
  • Using spaCy to find out different entity types (saving the output dataframe as entity_df.csv) & visualizing them using a bar chart.
  • Use displacy to visualize the linguistic structure of the sentences of the original data.

For the detailed steps you can look at NER.ipynb.

Named Entity Recognition using BERT by HuggingFace🤗

We are importing the display_data.csv & below are the steps carried out:

For the detailed steps you can look at NER_BERT.ipynb.

Topic modeling using Latent Dirichlet Allocation

We are importing the display_data.csv & below are the steps carried out:

  • First, I apply the LDA model & limit the topics to 15. Now, the LDA model will classify the articles in 15 topics (numbered from 1 to 15).
  • Next, I saved the output as LDA_topics.csv
  • Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud.

For the detailed steps you can look at topicmodeling_LDA.ipynb.

Topic modeling using BERTopic

We are importing the display_data.csv & below are the steps carried out:

  • Preprocessing (lowercasing & removing special characters)
  • Training BERTopic & saving the results as bertopic_results.csv
  • Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud. For the detailed steps you can look at topicmodeling_BERTopic.ipynb.

Run Locally

Clone the project

  git clone https://github.com/jaideep156/ner-topicmodeling

Go to the project directory

  cd ner-topicmodeling

Install dependencies

  pip install -r requirements.txt

Start the server

  streamlit run 1_🏠_Home.py

Deployment

This code has been deployed using Streamlit Community Cloud and the file is 1_🏠_Home.py