This repository contains my projects for Udacity's Data Scientist Nanodegree.
For this project I was interested in conducting exploratory data analysis using a Wine Review dataset found on Kaggle containing approximately 130k reviews from the Wine Enthusiast. I wanted the opportunity to explore the data and communicate my findings via a blog post on Medium which gives the reader insight into the questions posed.
Link to notebook
Link to Medium blog post
I applied my data engineering skills to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages. I created a machine learning pipeline to categorize real messages that were sent during disaster events so that the messages could be sent to an appropriate disaster relief agency. The project includes a web app where an emergency worker can input a new message and get classification results in several categories. The web app also displays visualizations of the data.
I analysed the interactions that users have with articles on the IBM Watson Studi platform and made recommendations to them about new articles I thought they'd like. I performed EDA, Rank Based Recommendations, User-user Based Collaborative Filtering, and Matrix factorisation.
Link to notebook
I used PySpark to predict customer churn for a music streaming service. The project involved:
- Loading and cleaning a small subset (128MB) of a full dataset available (12GB)
- Conducting Exploratory Data Analysis to understand the data and what features are useful for predicting churn
- Feature Engineering to create features that will be used in the modelling process
- Modelling using machine learning algorithms such as Logistic Regression, Random Forest, Gradient Boosted Trees, Linear SVM, Naive Bayes
Link to notebook
Link to blog post