Quora Insincere Questions Classification

Thinkful Final Capstone Project

Overview

Quora is a service that helps people learn from each other by asking and answering questions - and a key challenge in providing this type of service is filtering out insincere questions. Quora is attempting to filter out toxic and divisive content to uphold their policy of “Be Nice, Be Respectful”.

Data Source

https://www.kaggle.com/c/quora-insincere-questions-classification/data

Goals

Identify and flag insincere questions using machine learning.
Maximize F1 score by accurately predicting whether a question is sincere or not.

Specializations

Advanced NLP
TensorFlow and Keras

Value of Solution

An accurate solution can help Quora develop more scalable methods to detect toxic and misleading content and combat online trolls at scale

This solution will help Quora to uphold their policy of ‘Be Nice, Be Respectful”

Baseline Models Used

Logistic Regression
Naive Bayes
XGBoost
Voting Classifier

Deep Learning Models Used

Convolutional Neural Network
- Self-Trained Embedding
- Google News Vectors
Long Short Term Memory Network
- Google News Vectors

Notebooks

Data Exploration and Cleaning

This notebook is an exploratory analysis used to gain insight into the data.

Baseline Models with Downsampling

This notebook uses downsampling to deal with our class imbalance problem and several shallow learning baseline models.

Baseline Models with Upsampling

This notebook attempts to fix the generalization problem discovered in the previous notebook. We use upsampling and try different text pre-processing methods.

Topic Modeling with Gensim

This notebook uses Gensim’s LDA to model topics. We use coherence score to find the optimal number of topics.

Self-Trained Embedding + CNN

In this notebook we use self trained word embeddings in a CNN.

CNN + Google News Vectors

Here we use the same CNN as the previous kernel but add the pre-trained Google News word Embeddings

Stacking CNN

This kernel stacks multiple instances of our CNN model.

LSTM Trainable Embeddings

Here we allow the pre-trained embeddings to be updated during training and use a LSTM model. To date, this is the best performing model.

Evaluation of Models

The best performing model is the LSTM using pre-trained embeddings that we continue to update during the training of the model. This was determined by comparing F1 scores. All neural network models, including CNN with self-trained and pre-trained embeddings, outperform the shallow learning scikit-learn and XGBoost models.

Production and Beyond

In a production environment, this model can be used to evaluate new questions as they are asked. When the user submits a new question on Quora the model will be used to predict the sincerity of the question. If the question is determined to be sincere it will be posted online, if insincere the user will be prompted to edit their question and resubmit. To continue to improve the model going forward new questions and labels will be added to the training data and the model will be updated with the new information.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
Final Capstone Presentation.pdf		Final Capstone Presentation.pdf
Final Capstone Proposal.pdf		Final Capstone Proposal.pdf
LICENSE		LICENSE
README.md		README.md
baseline-models-with-downsampling.ipynb		baseline-models-with-downsampling.ipynb
baseline-models-with-upsampling.ipynb		baseline-models-with-upsampling.ipynb
embedding-cnn.ipynb		embedding-cnn.ipynb
lstm-google-news-vectors.ipynb		lstm-google-news-vectors.ipynb
quora-insincere-questions-eda.ipynb		quora-insincere-questions-eda.ipynb
quora-questions-topic-modeling.ipynb		quora-questions-topic-modeling.ipynb
stacking-cnn-google-news-vectors.ipynb		stacking-cnn-google-news-vectors.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quora Insincere Questions Classification

Thinkful Final Capstone Project

Overview

Data Source

Goals

Specializations

Value of Solution

Baseline Models Used

Deep Learning Models Used

Notebooks

Evaluation of Models

Production and Beyond

About

Releases

Packages

Languages

License

terrah27/quora_insincere_questions

Folders and files

Latest commit

History

Repository files navigation

Quora Insincere Questions Classification

Thinkful Final Capstone Project

Overview

Data Source

Goals

Specializations

Value of Solution

Baseline Models Used

Deep Learning Models Used

Notebooks

Evaluation of Models

Production and Beyond

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages