The Course materials for General Assembly's Data Science course in Washington D.C. (4/15/17 - 6/17/17)
Fill me out at the end of each class!
You've all been invited to use Slack for chat during class and the day. Please consider this the primary way to contact other students. Office Hours are listed Weekly on Slack in the Office Hours Channel
Lead Instructor: Alex Egorenkov Lead Instructor: Alex Sherman
Class | Date | Topic | Instructor | Homework |
---|---|---|---|---|
1 | 4/15 | What is Data Science | ||
Exploratory Data Analysis with Pandas | Alex S. | IMDB with Pandas Pandas Homework | ||
2 | 4/22 | Git, Github, and the Command Line | Alex S. | |
Introduction to Machine Learning | Alex E. | Command Line & First Project Presentation | ||
3 | 4/29 | Statistics Fundamentals 1 | Alex E. | |
Web Scraping and APIs | Alex S. | Chipotle Python & Web Scraping - IMDB (Optional) | ||
4 | 5/6 | Statistics Fundamentals 2 | Alex E. | |
K-Nearest Neighbors (KNN) | Alex S. | [Final Project 2] Project Brainstorming - Project Question and Dataset Due | ||
5 | 5/13 | Evaluating Model Fit | Alex S. | |
Linear Regression | Alex E. | Yelp Votes Linear Regression (Optional) | ||
6 | 5/20 | Logistic Regression | Alex S. | |
Introduction to Time Series | Alex E. | [Final Project 2] Project Outline | ||
7 | 6/3 | Decision Trees and Random Forests | Alex E. | |
Natural Language Processing (NLP) | Alex S. | Naive Bayes with Yelp Review Text (Optional) [Final Project 3] Exploratory Data Analysis | ||
8 | 6/10 | Dimensionality Reduction | Alex E. | |
Unsupervised Learning - Clustering | Alex S. | [Final Project 4] Modeling and Analysis | ||
9 | 6/17 | Advanced Sklearn | Alex S. | |
In class Kaggle competition | Alex E. | [Final Project 5] Presentations | ||
10 | 6/24 | Introduction to Databases | Alex E. | |
[Data Science Careers] | ||||
[Final Project Presentations] |
Class Resources:
- MovieLens 100k movie ratings (data dictionary, website)
- Alcohol consumption by country (article)
- Reports of UFO sightings (website)
Pandas Resources:
- Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
- To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
- If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
- This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
- This is a nice, short tutorial on pivot tables in Pandas.
Python Resources:
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- Python Tutor: Allows you to visualize the execution of Python code.
- My code isn't working is a great flowchart explaining how to debug Python errors.
- PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
Advanced Python Material:
- Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
- If you want to understand Python at a deeper level: Ned Batchelder's Loop Like A Native, Python Names and Values, Raymond Hettinger's Transforming Code into Beautiful, Idiomatic Python and Python Epiphanies are excellent presentations.
- Everything is an object in Python
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Introduction to Statistical Learning
- Data Science vs Statistics
- 15 Books every Data Scientist Should Read
- 50+ Free Data Science Books
- Building Data Science Teams
- Doing Data Science
- Getting Started with Data Science
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Stack Overflow - Developer Survey Results 2017
- Nate Silver on the Art and Science of Prediction
- Three waves of AI
Material for Next Class:
- Setting up Python for machine learning: scikit-learn and IPython Notebook This videos includes an overview of Jupyter Notebook, which is used in the homework assignment.
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
Class Resources: Set your Git username and email
Command Line Resources:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
- The Linux command line
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- GitHub for Beginners
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
- Introducing GitHub is a nice intro to GitHub that reads quickly
- Version Control with Git
- Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
- How to remove .DS_Store from GitHub
Machine Learning Resources:
- For a very quick summary of the key points about machine learning, watch What is machine learning, and how does it work? (10 minutes) or read the associated notebook.
- For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- For a really nice comparison of supervised versus unsupervised learning, plus an introduction to reinforcement learning, watch this video (13 minutes) from Caltech's Learning From Data course.
- For a preview of some of the machine learning content we will cover during the course, read Sebastian Raschka's overview of the supervised learning process.
Copy the link from your new Forked Repo
Clone your new forked repo to your computer. '''git clone git@github.com:YOUR_USERNAME/ds-dc-19.git'''
cd (change directory) into the cloned repo.
'''git remote add upstream https://github.com/ga-students/ds-dc-19'''
Repeat this step often to keep your Repo up to date with the Class Repo: git pull upstream master
Class Resources:
- APIs (code)
- Web scraping (code)
- Autocomplete in Spyder
Statistics Resources:
- Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
- Read Anscombe's Quartet, and Why Summary Statistics Don't Tell the Whole Story for a classic example of why visualization is useful.
- What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.
- Khan Academy Statistics and Probabiliy Good refresher if you need it.
- ThinkStats Good statistics book with Python code in NumPy and Pandas.
- Bessel's correction Sample variance is incredibly complicated once you look into it, but that makes it one of the simplest examples of meaningful bias and variance.
- Bias of a estimator More on bias of the sample variance estimator.
- Mean Squared Error In many fields we obsess about unbiased estimators, in machine learning we obsess about MSE. More examples of the sample variance estimator.
- Understanding the Bias-Variance Tradeoff Deep topic that we will dive into later in the course, worth a preview.
Web Scraping Resources:
- The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
- For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
- robotstxt.org has a concise explanation of how to write (and read) the
robots.txt
file. - import.io and Kimono claim to allow you to scrape websites without writing any code.
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
- Be Suspicious Of Online Movie Ratings, Especially Fandango’s is a interesting example on the application of web scraping from FiveThirtyEight
API Resources:
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Selenium Resources:
- What is Selenium
- Chromedriver download
- Selenium with Python Documentation
- Selenium Webdriver Python Tutorial For Web Automation
KNN Resources:
- For a recap of the key points about KNN and scikit-learn, watch Getting started in scikit-learn with the famous iris dataset (15 minutes) and Training a machine learning model with scikit-learn (20 minutes).
- KNN supports distance metrics other than Euclidean distance, such as Mahalanobis distance, which takes the scale of the data into account.
- A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
- This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
- Some applications for which KNN is well-suited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.
Seaborn Resources:
- To get started with Seaborn for visualization, the official website has a series of detailed tutorials and an example gallery.
- Data visualization with Seaborn is a quick tour of some of the popular types of Seaborn plots.
- Visualizing Google Forms Data with Seaborn and How to Create NBA Shot Charts in Python are both good examples of Seaborn usage on real-world data.
*Fundamental Statistics
- causal-data-science Great series about DAGs, association, and causation.
- Khan Academy Statistics and Probabiliy Still useful for basic topics.
Homework Read http://scott.fortmann-roe.com/docs/BiasVariance.html A visual guide to to Bias-Variance trade-off and how it relates to over/under-fitting.
Model Evaluation Resources:
- For a recap of some of the key points from today's lesson, watch Comparing machine learning models in scikit-learn (27 minutes).
- For another explanation of training error versus testing error, the bias-variance tradeoff, and train/test split (also known as the "validation set approach"), watch Hastie and Tibshirani's video on estimating prediction error (12 minutes, starting at 2:34).
- Caltech's Learning From Data course includes a fantastic video on visualizing bias and variance (15 minutes).
- Random Test/Train Split is Not Always Enough explains why random train/test split may not be a suitable model evaluation procedure if your data has a significant time element.
Reproducibility Resources:
- What We've Learned About Sharing Our Data Analysis includes tips from BuzzFeed News about how to publish a reproducible analysis.
- Software development skills for data scientists discusses the importance of writing functions and proper code comments (among other skills), which are highly useful for creating a reproducible analysis.
- Data science done well looks easy - and that is a big problem for data scientists explains how a reproducible analysis demonstrates all of the work that goes into proper data science.
Linear Regression Resources:
- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- This introduction to linear regression is more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.
- Setosa has an interactive visualization of linear regression.
- For a brief introduction to confidence intervals, hypothesis testing, p-values, and R-squared, as well as a comparison between scikit-learn code and Statsmodels code, read my DAT7 lesson on linear regression.
- Here is a useful explanation of confidence intervals from Quora.
- Hypothesis Testing: The Basics provides a nice overview of the topic, and John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
- Earlier this year, a major scientific journal banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
- Science Isn't Broken includes a neat tool that allows you to "p-hack" your way to "statistically significant" results.
- Accurately Measuring Model Prediction Error compares adjusted R-squared, AIC and BIC, train/test split, and cross-validation.
Other Resources:
- Section 3.3.1 of An Introduction to Statistical Learning (4 pages) has a great explanation of dummy encoding for categorical features.
- Kaggle has some nice visualizations of the bikeshare data we used today.
Logistic Regression Resources:
- To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
- For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
- For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
- The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
- Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).
Confusion Matrix Resources:
- simple guide to confusion matrix terminology may be useful to you as a reference.
- This blog post about Amazon Machine Learning contains a neat graphic showing how classification threshold affects different evaluation metrics.
- This notebook (from another DAT course) explains how to calculate "expected value" from a confusion matrix by treating it as a cost-benefit matrix.
- Watch Rahul Patwari's videos on Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes).
ROC Resources:
- Rahul Patwari has a great video on ROC Curves (12 minutes).
- An introduction to ROC analysis is a very readable paper on the topic.
- ROC curves can be used across a wide variety of applications, such as comparing different feature sets for detecting fraudulent Skype users, and comparing different classifiers on a number of popular datasets.
NLP Resources:
- If you want to learn a lot more NLP, check out the excellent video lectures and slides from this Coursera course (which is no longer being offered).
- This slide deck defines many of the key NLP terms.
- Natural Language Processing with Python is the most popular book for going in-depth with the Natural Language Toolkit (NLTK).
- A Smattering of NLP in Python provides a nice overview of NLTK, as does this notebook from DAT5.
- spaCy is a newer Python library for text processing that is focused on performance (unlike NLTK).
- If you want to get serious about NLP, Stanford CoreNLP is a suite of tools (written in Java) that is highly regarded.
- When working with a large text corpus in scikit-learn, HashingVectorizer is a useful alternative to CountVectorizer.
- Automatically Categorizing Yelp Businesses discusses how Yelp uses NLP and scikit-learn to solve the problem of uncategorized businesses.
- Modern Methods for Sentiment Analysis shows how "word vectors" can be used for more accurate sentiment analysis.
- Identifying Humorous Cartoon Captions is a readable paper about identifying funny captions submitted to the New Yorker Caption Contest.
- DC Natural Language Processing is an active Meetup group in our local area.
Niave Bayes Resources:
- Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
- For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
- For an intuitive explanation of Naive Bayes classification, read this post on airport security.
- For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
- When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
- These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
- Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
- Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.
Decision Trees Resources
- Introduction to Statistical Learning - Chapter 8 (Tree-Based Methods)
- scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
- For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
- If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
- The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
- Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.
Ensembling Resources:
- scikit-learn's documentation on ensemble methods covers both "averaging methods" (such as bagging and Random Forests) as well as "boosting methods" (such as AdaBoost and Gradient Tree Boosting).
- MLWave's Kaggle Ensembling Guide is very thorough and shows the many different ways that ensembling can take place.
- Browse the excellent solution paper from the winner of Kaggle's CrowdFlower competition for an example of the work and insight required to win a Kaggle competition.
- Interpretable vs Powerful Predictive Models: Why We Need Them Both is a short post on how the tactics useful in a Kaggle competition are not always useful in the real world.
- Not Even the People Who Write Algorithms Really Know How They Work argues that the decreased interpretability of state-of-the-art machine learning models has a negative impact on society.
- For an intuitive explanation of Random Forests, read Edwin Chen's answer to How do random forests work in layman's terms?
- Large Scale Decision Forests: Lessons Learned is an excellent post from Sift Science about their custom implementation of Random Forests.
- Unboxing the Random Forest Classifier describes a way to interpret the inner workings of Random Forests beyond just feature importances.
- Understanding Random Forests: From Theory to Practice is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.
Clustering Resources:
- K-means: documentation, visualization 1, visualization 2
- DBSCAN: documentation, visualization
- For a very thorough introduction to clustering, read chapter 8 (69 pages) of Introduction to Data Mining (available as a free download), or browse through the chapter 8 slides.
- scikit-learn's user guide compares many different types of clustering.
- This PowerPoint presentation from Columbia's Data Mining class provides a good introduction to clustering, including hierarchical clustering and alternative distance metrics.
- An Introduction to Statistical Learning has useful videos on K-means clustering (17 minutes) and hierarchical clustering (15 minutes).
- This is an excellent interactive visualization of hierarchical clustering.
- This is a nice animated explanation of mean shift clustering.
- The K-modes algorithm can be used for clustering datasets of categorical features without converting them to numerical values. Here is a Python implementation.
- Here are some fun examples of clustering: A Statistical Analysis of the Work of Bob Ross (with data and Python code), How a Math Genius Hacked OkCupid to Find True Love, and characteristics of your zip code.
**Dimensionality Reduction Resources: **
scikit-learn Resources:
-
This is a longer example of feature scaling in scikit-learn, with additional discussion of the types of scaling you can use.
-
Practical Data Science in Python is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model.
-
Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of tutorials and examples, a library of machine learning tools and extensions, a new book, and a semi-active blog.
-
scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching functions and asking questions.
-
If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable!
-
Helper functions: Pipeline, GridSearchCV
-
To learn how to use GridSearchCV and RandomizedSearchCV for parameter tuning, watch How to find the best model parameters in scikit-learn (28 minutes) or read the associated notebook.
- This GA slide deck provides a brief introduction to databases and SQL. The Python script from that lesson demonstrates basic SQL queries, as well as how to connect to a SQLite database from Python and how to query it using Pandas.
- The repository for this SQL Bootcamp contains an extremely well-commented SQL script that is suitable for walking through on your own.
- This GA notebook provides a shorter introduction to databases and SQL that helpfully contrasts SQL queries with Pandas syntax.
- SQLZOO, Mode Analytics, Khan Academy, Codecademy, Datamonkey, and Code School all have online beginner SQL tutorials that look promising. Code School also offers an advanced tutorial, though it's not free.
- w3schools has a sample database that allows you to practice SQL from your browser. Similarly, Kaggle allows you to query a large SQLite database of Reddit Comments using their online "Scripts" application.
- What Every Data Scientist Needs to Know about SQL is a brief series of posts about SQL basics, and Introduction to SQL for Data Scientists is a paper with similar goals.
- 10 Easy Steps to a Complete Understanding of SQL is a good article for those who have some SQL experience and want to understand it at a deeper level.
- SQLite's article on Query Planning explains how SQL queries "work".
- A Comparison Of Relational Database Management Systems gives the pros and cons of SQLite, MySQL, and PostgreSQL.
- If you want to go deeper into databases and SQL, Stanford has a well-respected series of 14 mini-courses.
- Blaze is a Python package enabling you to use Pandas-like syntax to query data living in a variety of data storage systems.
Resources:
- scikit-learn's machine learning map may help you to choose the "best" model for your task.
- Choosing a Machine Learning Classifier is a short and highly readable comparison of several classification models, Comparing supervised learning algorithms is a model comparison table that I created, and Supervised learning superstitions cheat sheet is a more thorough comparison (with links to lots of useful resources).
- Machine Learning Done Wrong, Machine Learning Gremlins (31 minutes), Clever Methods of Overfitting, and Common Pitfalls in Machine Learning all offer thoughtful advice on how to avoid common mistakes in machine learning.
- Practical machine learning tricks from the KDD 2011 best industry paper and Andrew Ng's Advice for applying machine learning include slightly more advanced advice than the resources above.
- An Empirical Comparison of Supervised Learning Algorithms is a readable research paper from 2006, which was also presented as a talk (77 minutes).
- Good Data Management Practices for Data Analysis briefly summarizes the principles of "tidy data".
- Hadley Wickham's paper explains tidy data in detail and includes lots of good examples.
- Example of a tidy dataset: Bob Ross
- Examples of untidy datasets: NFL ticket prices, airline safety, Jets ticket prices, Chipotle orders
- If your co-workers tend to create spreadsheets that are unreadable by computers, they may benefit from reading these tips for releasing data in spreadsheets. (There are some additional suggestions in this answer from Cross Validated.)
Regular Expressions Resources:
- Google's Python Class includes an excellent introductory lesson on regular expressions (which also has an associated video).
- Python for Informatics has a nice chapter on regular expressions. (If you want to run the examples, you'll need to download mbox.txt and mbox-short.txt.)
- Breaking the Ice with Regular Expressions is an interactive Code School course, though only the first "level" is free.
- If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
- 5 Tools You Didn't Know That Use Regular Expressions demonstrates how regular expressions can be used with Excel, Word, Google Spreadsheets, Google Forms, text editors, and other tools.
- Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis, and Emojineering explains how Instagram uses regular expressions to detect emoji in hashtags.