Skip to content

Latest commit

 

History

History

Project

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Course Project

Overview

The final project is an opportunity to apply what you learn in this course to a problem of your choosing. Not only is the final project an interesting learning opportunity, it will also become a valuable part of your data science portfolio. Real code applied to a real problem and available on Github is more powerful for many employers than any bullet points on a written resume.

Note also that completion of a final project, including presenting it to the class, is a requirement for a Certificate of Completion for the course.

The best final projects tend to be driven by curiosity. Therefore, we recommend that you pick a topic that you're passionate about. If you have a strong interest in the domain, then it will be more engaging and fun for you. As a result, you'll produce a better project!

In the past, students have completed final projects directly related to their current role, a field to which they want to transition, and areas of personal interest such as fantasy sports and cooking.

Sources of potential inspiration include:

Another great source of inspiration is checking out publicly available data sets. You may also use a private data set that is available to you. If you choose to use a private data set, you may want to pay the $7/month to have a private Github repo for your final project code or upload the data set to dropbox and share with the instructor team. You are not required to post the dataset publicly. However, the instructor team will usually need access to the data set in order to evaluate your final project. Also, remember that you must present your project and what you learned in class. Again, there is no requirement to share the dataset itself, other than as mentioned above.

Feel free to reach out to the instructor team and to consult each other when considering different project options. We are happy to chat about different ideas.

Project Deliverables

You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final class, on September 3.

Your paper should cover the following:

  • Problem statement and hypothesis
  • Description of your data set and how it was obtained
  • Description of any pre-processing / data-munging steps
  • What you learned from exploring the data, including visualizations
  • How you chose which features to use in your analysis
  • Details of your modeling process, including how you selected your models and validated them
  • Your challenges and successes
  • Possible extensions or business applications of your project
  • Conclusions and key learnings

Your presentation should cover these components with less breadth and less depth. Focus on creating an engaging, clear, and informative presentation that tells the story of your project.

Create a GitHub repository for your project -- under your userid and separate from your homework repo -- that contains the following:

  • Project paper: any format (PDF, Markdown, etc.)
  • Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
  • Code: commented Python code and any other code you used in the project
  • Data: data files (see below for more info)
  • Data dictionary: description of each variable, including units

If it's not possible or practical to include your entire dataset, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository. Or you may also upload your data set to dropbox/Google drive and share with the instruction team.

Project Milestones

July 16: Final Project Elevator Pitch

The Final Project Elevator Pitch includes both a short (one paragraph) write-up of your proposed project topic and a concise (<90 seconds) presentation to the class. The elevator pitch should include:

  • A concise statement of the goal of your project
  • What question or questions you hope to answer
  • What data set you plan to use and how you will obtain the data
  • What type of machine learning problem this is (from our 2x2 matrix)
  • Why you chose this project

July 28: Data Ready

By July 28, you should have your data set available for mining and modeling. The steps may include:

  • data retrieval
  • data cleaning
  • data transformation and filtering
  • feature scaling
  • missing data handling
  • noise removal
  • feature engineering
  • ...

In this milestone, you will need to post at least 3 plots created based on your data set and submit them to your github repo. You will give a short (< 90 seconds) presentation to the class using the plots to demonstrate one or more of the following:

  • the distribution(s) of the data points
  • if there are any outliers
  • if scaling and normalization is required and why
  • how do you plan to handle missing data
  • what features are significant
  • any other insights and fun facts that you discovered from the data set

August 11: First Draft Due Before Class

Your peers and instructors will provide feedback by August 18, according to these guidelines.

At a minimum, you should include:

  • Narrative of what you have done so far and what you are still planning to do, ideally in a format similar to the format of your final project paper
  • Code, with lots of comments

Ideally, you would also include:

  • Visualizations you have done
  • Slides (if you have started making them)
  • Data and data dictionary

Tips for success:

  • The work should stand "on its own", and should not depend upon the reader remembering anything you might have previously said in class about your project.
  • Organize your narrative and files so that the reader can easily follow along.
  • The better you explain your project, and the easier it is to follow, the more useful feedback you will receive!
  • If your reviewers can actually run your code on the provided data, they will be able to give you more useful feedback on your code. (It can be very hard to make useful code suggestions on code that can't be run!)

September 3: Presentation

Deliver your project presentation in class and submit all required deliverables (paper, slides, code, data, and data dictionary).