- Case Study Chosen - 'Sentiment Analysis'
- This Repository contains various notebooks for 'Sentiment Analysis'
- Different approaches have been tried for the same problem in order of increasing model complexities
Different approaches and summary of results is as below :-
- Bag of words + XGBoost Classifier
- Doc2vec + Neural Network
- Classifier using LSTM
- Transfer Learning Classifier using BERT
- This Problem has class imbalance with negative (63%) , neutral (21%) and positive (16%)
- Hence chosen metric for comparison is f1_score
- Based on f1_score - (Bag of words + XGBoost Classifier) and (Classifier using LSTM) perform well
- If add accuracy as well Bag of words + XGBoost Classifier is clear winner
precision recall f1-score support
0 0.77 0.92 0.84 1834
1 0.65 0.35 0.45 628
2 0.69 0.60 0.64 466
accuracy 0.75 2928
macro avg 0.70 0.62 0.64 2928
weighted avg 0.73 0.75 0.73 2928
precision recall f1-score support
0 0.63 0.99 0.77 916
1 0.22 0.01 0.01 312
2 0.00 0.00 0.00 236
accuracy 0.62 1464
macro avg 0.28 0.33 0.26 1464
weighted avg 0.44 0.62 0.48 1464
precision recall f1-score support
0 0.82 0.85 0.83 916
1 0.54 0.63 0.58 312
2 0.67 0.44 0.54 236
accuracy 0.74 1464
macro avg 0.68 0.64 0.65 1464
weighted avg 0.74 0.74 0.73 1464
precision recall f1-score support
0 0.63 0.68 0.65 916
1 0.22 0.18 0.20 312
2 0.15 0.14 0.15 236
accuracy 0.49 1464
macro avg 0.33 0.33 0.33 1464
weighted avg 0.47 0.49 0.48 1464
- Since number of classes are less, we can also try "One Vs rest" Classificataion
- Hyperparameter optimization for LSTM Model will Increase metrics
- Adding more data will definately help improving metrics, especially for Deep Learning Models
- BERT is trained on Wikipedia Corpus, Either it could be retrained on twitter corpus of fine tuned
- For Doc2Vec , round trip accuracy can be used for better tuning and hence getting better quality embeddings