Skip to content

ekim1394/kaggleRainforest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ds-dc-19

The Course materials for General Assembly's Data Science course in Washington D.C. (4/15/17 - 6/17/17)

Exit Tickets

Fill me out at the end of each class!

Homework Submission Form

Fill me out for each homework

Slack

You've all been invited to use Slack for chat during class and the day. Please consider this the primary way to contact other students. Office Hours are listed Weekly on Slack in the Office Hours Channel

Your Team

Lead Instructor: Alex Egorenkov Lead Instructor: Alex Sherman

Schedule

Class Date Topic Instructor Homework
1 4/15 What is Data Science
Exploratory Data Analysis with Pandas Alex S. IMDB with Pandas Pandas Homework
2 4/22 Git, Github, and the Command Line Alex S.
Introduction to Machine Learning Alex E. Command Line & First Project Presentation
3 4/29 Statistics Fundamentals 1 Alex E.
Web Scraping and APIs Alex S. Chipotle Python & Web Scraping - IMDB (Optional)
4 5/6 Statistics Fundamentals 2 Alex E.
K-Nearest Neighbors (KNN) Alex S. [Final Project 2] Project Brainstorming - Project Question and Dataset Due
5 5/13 Evaluating Model Fit Alex S.
Linear Regression Alex E. Yelp Votes Linear Regression (Optional)
6 5/20 Logistic Regression Alex S.
Introduction to Time Series Alex E. [Final Project 2] Project Outline
7 6/3 Decision Trees and Random Forests Alex E.
Natural Language Processing (NLP) Alex S. Naive Bayes with Yelp Review Text (Optional) [Final Project 3] Exploratory Data Analysis
8 6/10 Dimensionality Reduction Alex E.
Unsupervised Learning - Clustering Alex S. [Final Project 4] Modeling and Analysis
9 6/17 Advanced Sklearn Alex S.
In class Kaggle competition Alex E. [Final Project 5] Presentations
10 6/24 Introduction to Databases Alex E.
[Data Science Careers]
[Final Project Presentations]

Class 1:

Class Resources:

Pandas Resources:

  • Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
  • To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
  • If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
  • This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
  • This is a nice, short tutorial on pivot tables in Pandas.

Python Resources:

  • Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
  • DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
  • Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
  • A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
  • Python for Informatics: A very beginner-oriented book, with associated slides and videos.
  • Python Tutor: Allows you to visualize the execution of Python code.
  • My code isn't working is a great flowchart explaining how to debug Python errors.
  • PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.

Advanced Python Material:

Resources:

Material for Next Class:


Class 2:

Class Resources: Set your Git username and email

Command Line Resources:

Git and Markdown Resources:

Machine Learning Resources:


Git Repo setup

Step 1

Fork the Class repo: Step 1

Step 2

Copy the link from your new Forked Repo Step 2

Step 3

Clone your new forked repo to your computer. '''git clone git@github.com:YOUR_USERNAME/ds-dc-19.git'''

Step 4

cd (change directory) into the cloned repo.

Step 5

'''git remote add upstream https://github.com/ga-students/ds-dc-19'''

Step 6

Repeat this step often to keep your Repo up to date with the Class Repo: git pull upstream master


Class 3:

Class Resources:

Statistics Resources:

Web Scraping Resources:

API Resources:

Selenium Resources:


Class 4:

KNN Resources:

Seaborn Resources:

*Fundamental Statistics

Homework Read http://scott.fortmann-roe.com/docs/BiasVariance.html A visual guide to to Bias-Variance trade-off and how it relates to over/under-fitting.


Class 5:

Model Evaluation Resources:

Reproducibility Resources:

Linear Regression Resources:

Other Resources:


Class 6:

Logistic Regression Resources:

  • To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
  • For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
  • For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
  • The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
  • Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).

Confusion Matrix Resources:

ROC Resources:


Class 7:

NLP Resources:

Niave Bayes Resources:

  • Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
  • Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.

Decision Trees Resources

Ensembling Resources:


Class 8: Clustering

Clustering Resources:

**Dimensionality Reduction Resources: **


Class 9: Advanced scikit-learn and Kaggle Competition

scikit-learn Resources:


Class 10: Databases and Final Projects

Databases and SQL

Resources:

Additional Resources

Tidy Data

Regular Expressions Resources:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published