This project is aimed at performing Named Entity Recognition (using spaCy, BERT) & Topic modeling (using LDA, BERTopic) of actual news articles sourced from Kaggle.
To access the live version of the app, click here.
The data is fetched from Kaggle and first, we merge all the different datasets (like business, education, etc.) in to a master dataset called master_data.csv
.
Then, we remove irrelavant columns and save the resulting data as display_data.csv
. This is the data will be used everywhere for all the tasks here on.
- pandas to manipulate the data.
- scikit-learn for machine learning tasks.
- matplotlib for visualizations.
- NLTK for NLP tasks like tokenization, removing stop words, etc.
- spaCy which has pre-trained statistical models for common NLP tasks.
- transformers which is a framework from HuggingFace🤗 that has pre-trained models for NLP tasks.
- BERTopic which is a topic modeling framework built on top of Transformers by HuggingFace🤗
- streamlit cloud to deploy the app.
First we do the exploratory data analysis on display_data.csv
that includes wordclouds.
You can follow notebook.ipynb for a detailed walkthrough.
We are importing the display_data.csv
& below are the steps carried out:
- Preprocessing (lowercasing & removing special characters)
- Using spaCy to find out different entity types (saving the output dataframe as
entity_df.csv
) & visualizing them using a bar chart. - Use displacy to visualize the linguistic structure of the sentences of the original data.
For the detailed steps you can look at NER.ipynb.
We are importing the display_data.csv
& below are the steps carried out:
- First we load the pre-trained NER model called bert-large-cased-finetuned-conll03-english & train this model on it.
- Next, I save the output dataframe as
entity_bert.csv
- Finally, I print a random observation & its respective entities.
For the detailed steps you can look at NER_BERT.ipynb.
We are importing the display_data.csv
& below are the steps carried out:
- First, I apply the LDA model & limit the topics to 15. Now, the LDA model will classify the articles in 15 topics (numbered from 1 to 15).
- Next, I saved the output as
LDA_topics.csv
- Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud.
For the detailed steps you can look at topicmodeling_LDA.ipynb.
We are importing the display_data.csv
& below are the steps carried out:
- Preprocessing (lowercasing & removing special characters)
- Training BERTopic & saving the results as
bertopic_results.csv
- Finally, I visualize it as occurences of top 10 words, topic wise articles and its respective wordcloud. For the detailed steps you can look at topicmodeling_BERTopic.ipynb.
Clone the project
git clone https://github.com/jaideep156/ner-topicmodeling
Go to the project directory
cd ner-topicmodeling
Install dependencies
pip install -r requirements.txt
Start the server
streamlit run 1_🏠_Home.py
This code has been deployed using Streamlit Community Cloud and the file is 1_🏠_Home.py