The course site for the Data Processing in Python from IES. See information on SIS. The course is taught by Martin Hronec and Vítek Macháček
The aim of the course is to provide a hands-on experience with the data-manipulation techniques in Python. The special emphasis is put on standard libraries such as Pandas, Numpy or Matplotlib and also collecting web data with requests and BeatifiulSoup. The students will also be guided through the modern social-coding and open-source technologies such as GitHub, Jupyter and Open Data.
The students will gain their experience using the data from the IES website and subject evaluation protocols.
The course would make use of the DataCamp online sources to provide the students with reliable and yet simple resources for learning Python programming.
After passing the course, the students will be able to download the data from APIs or directly from the web, pre-process it, analyze it and visualize it.
Econometrics II. (JEB110) is an explicit prerequisite for bachelor students.
The course is designed for students that have at least some basic coding experience. It does not need to be very advanced, but they should be aware of concepts such as for
loop ,if
and else
,variable
or function
.
No knowledge of Python is required for entering the course.
Pro Git book, Atlassian Git tutorials, Github resources for learning Git
Resources from the official Python webpage
Python, Pandas, Numpy, requests, BeautifulSoup and Matplotlib.
Introduction to Git for Data Science
Intermediate Python for Data Science
Manipulating DataFrames with pandas
Merging DataFrames with pandas
Importing Data in Python (Part 1)
Importing Data in Python (Part 2)
Introduction to Data Visualization
Interactive Data Visualization in Bokeh
Introduction to SQL for Data Science
Introduction to Databases in Python
Practical Introduction to Web Scraping in Python
Passing the course is rewarded with 5 ECTS credits.
The requirement for passing the course are DataCamp assignments (0pts but compulsory), the midterm (30pts) and the final project (70pts).
4 assignments out of assignments 1-6 submitted on time is required.
Assignment 0 - (Introduction to Git)
- not compulsory but strongly recommended. Git is hard and you will need it throughout the course.
Assignment 1 - Submission on 8/10 (Introduction to Python Course)
- Python Lists
- Python Basics
- Function and Packages
Assignment 2 - Submission on 15/10 (Manipulating DataFrames with pandas)
- Numpy
- Extracting and Transforming Data
- Advanced Indexing
Assignment 3 - Submission on 22/10 (Object-Oriented Programming in Python)
- Getting ready for object-oriented programming
- Deep dive into classes and objects
- Fancy classes, fancy objects
Assignment 4 - Submission on 29/10 (Web Scraping in Python Course)
- Introduction to HTML
- XPaths and Selectors
- CSS Locators, Chaining, and Responses
Assignment 5 - Submission on 12/11 (Importing Data in Python (Part 2) Course)
- The Intro to SQL for Data Science (full course)
Assignment 6 - Submission on 19/11 (Merging DataFrames with pandas Course)
- Concatenating and merging data
- Rearranging and reshaping data
- Grouping data
Description:
- November 26th
Description:
- Students in teams by 2
- The task is to download any data from API or directly from the web. These data should be processed and visualized in the Jupyter Notebook, with auxiliary scripts as .py files. The project should be submitted as a GitHub repository.
- The selection of the data is entirely up to the students.
- More details during the lecture.
See an example project from last year.
Deadlines:
November 12th: Project Topic First Submission
November 26th: Midterm Exam
December 3rd: Project Topic Final Submission
January 21st: Project Submission (to be confirmed)
Evaluation Criteria:
- The project use correctly downloaded data from the public API or website.
- The download is easily reproducible
- The data were cleaned appropriately
- The data are visualized
- The project is submitted as a public GitHub repository
- All team members visibly collaborated on the GitHub repository
- The code is readable and well documented
- The code is object-oriented
- The project's summary is submitted as a jupyter notebook.
- Project is distributed as a Python package
- A: above 90 (not inclusive)
- B: between 80 (not inclusive) and 90 (inclusive)
- C: between 70 (not inclusive) and 80 (inclusive)
- D: between 60 (not inclusive) and 70 (inclusive)
- E: between 50 (not inclusive) and 60 (inclusive)
- F: below 50 (inclusive)
Jupyter and GitHub intro here
The Jupyter notebook with IES web parser
Date | Topic | who | Project | HW | |
---|---|---|---|---|---|
1/10 | Intro, Jupyter, Git (+ GitHub) | Martin | HW 0 | ||
8/10 | Strings, Floats, Lists, Dictionaries, Functions | Vítek | HW 1 | ||
15/10 | Numpy, Pandas, Matplotlib | Martin | HW 2 | ||
22/10 | Object-Oriented Programming | Martin | HW 3 | ||
29/10 | HTML, XML, JSON, requests, APIs, BeautifulSoup | Vítek | HW 4 | ||
5/11 | IES Web Scraper | Vítek | |||
12/11 | Introduction to Databases | Vítek | Project Topic Proposal | HW 5 | |
19/11 | Advanced Pandas | Martin | HW 6 | ||
26/11 | MIDTERM | Vítek | |||
3/12 | Project Work 1 | Project Topic Approval | |||
10/12 | Guest Lecture (TBA) | Guest | |||
17/12 | Efficient Computing / Parallelization | Martin | |||
7/1 | Project Work 2 |