Big Data Computing (2020-2021)

News

July 2021 Exam Session
Registrations to the July 2021 exam session are now open on Infostud (id 765581), and so they will until July 8, 2021. Project submission week opens up on July 1, 2021 at 00:00 CEST (Central European Summer Time) and closes on July 7, 2021 at 23:59 CEST.
June 2021 Exam Session
Registrations to the June 2021 exam session are now open on Infostud (id 765579), and so they will until June 17, 2021. Project submission week opens up on June 11, 2021 at 00:00 CEST (Central European Summer Time) and closes on June 17, 2021 at 23:59 CEST.

Project Guidelines: A document containing the main guidelines for the final project is available here.
April 2021 Exam Session (Extra): Project Presentation Schedule
Presentation of the project that has been accepted for oral discussion will take place remotely via Google Meet on April 16, 2021 at 12:00PM CEST. Everyone is welcome to join!
IMPORTANT: Back to In-Person Classes
Starting from April 12, 2021 classes will take place in blended mode again (with 30% limit attendance).
April 2021 Exam Session (Extra)
Registrations to the April 2021 exam session (extra) are now open on Infostud (id 762156), and so they will until April 4, 2021. Project submission week opens up on April 5, 2021 at 00:00 CEST (Central European Summer Time) and closes on April 11, 2021 at 23:59 CEST.
NOTE: This extra session is reserved only to part-time or working students, students with learning disabilities, students who have not completed university exams within set time period, as well as students who are about to graduate.
IMPORTANT: In-Person Classes Suspended
Starting from Monday, March 15, 2021 all the educational activities for all Sapienza's degree programmes will be held remotely only. Therefore, our classes will continue on the same Zoom meeting room, as per the original schedule.
For any further information, please refer to this link on the Sapienza website.
2020-21 classes are starting!
Classes are starting on February 23, 2021 at 5:00PM CET and will be held in blended mode. To attend classes either in presence or remotely, please check out the instructions below.
February 2021 Exam Session: Final Grades
Final grades are available at this link.
February 2021 Exam Session: Project Presentation Schedule
Presentations of the projects that have been accepted for oral discussion will take place remotely via Google Meet on a one-day session on February 10, 2021 at 9:00AM CET. Everyone is welcome to join!
February 2021 Exam Session
Registrations to the February 2021 exam session are now open on Infostud (id 752692), and so they will until February 7, 2021. Project submission week opens up on February 1, 2021 at 00:00 CET (Central European Time) and closes on February 7, 2021 at 23:59 CET.
(Please, see the announcement below for additional details on how to submit your project during this session, which is the first one of the academic year 2020-21.)
Students who are planning to submit their project after the January 2021 session should refer to the Big Data Computing 2020-21 Moodle page, rather than the current one (i.e., Big Data Computing 2019-20). This is to align exam sessions to the correct academic year, since academic year 2019-20 formally ends on January, 31 2021. As such, starting from February 2021 until January 2022 all the exam sessions will be displayed on the newly created Moodle page indicated above, where students will be allowed to submit their work on the corresponding Project Submission Week that will be opened along the way, as usual.
(NOTE: Only students who expect to complete the exam in one of the upcoming 2020-21 sessions must subscribe to the Big Data Computing 2020-21 Moodle page!)

General Information

Welcome to the Big Data Computing class!

This is a first-year, second-semester course of the MSc in Computer Science of Sapienza University of Rome.

This repository contains class material along with any useful information for the 2020-2021 academic year.

Class Schedule

Tuesday from 5:00PM to 7:00PM
Wednesday from 4:00PM to 7:00PM

How to Attend Classes

According to the guidelines provided by Sapienza University to contrast the COVID-19 pandemic, the course will be held both in presence and remotely. For any further information, students must refer to the official documentation available on the Sapienza website.

Attending Classes in Presence: Room G50 - Building G, Viale Regina Elena 295

Students who are willing to attend classes in presence must issue their request through the Infostud Lab App or the Prodigit Sapienza online booking system, according to the rules established (please, see here). Once the booking is confirmed - according to the class schedule above - students must go to Room G50, which is located on the 3rd floor of the Building G in viale Regina Elena 295.

Attending Classes Remotely: Zoom

Students who are willing to attend classes remotely online will need to register to the dedicated Zoom conference, using the following link: https://uniroma1.zoom.us/meeting/register/tZUtd-mupz8rGt3uK2Mz_cKmOGDyVQpNmMfm

Moodle Web Page

Students must subscribe to the Moodle web page using the same credentials (username/password) to access Wi-Fi network and Infostud services, at the following link: https://elearning.uniroma1.it/course/view.php?id=12771

Office Hours

Tuesday from 2:00PM to 4:00PM, Room 106 located at the 1st floor of Building E in viale Regina Elena 295.
(NOTE: Due to the COVID-19 emergency, office hours will be exclusively held online via Google Meet or Zoom upon email request message sent to the following address: tolomei@di.uniroma1.it)

Contacts

Email: tolomei@di.uniroma1.it
Website: https://www.di.uniroma1.it/~tolomei
Bacheca Sapienza: https://corsidilaurea.uniroma1.it/it/users/gabrieletolomeiuniroma1it

Description and Goals

The amount, variety, and rate at which data is being generated nowadays both by humans and machines are unprecedented. This opens up a number of challenges on how to deal with those data, as traditional computing paradigms are not conceived to operate at such a scale.

"Big Data" is the umbrella term that has rapidly become popular to describe methodologies and tools specifically designed for collecting, storing, and processing very large or complex data sets. In addition to addressing foundational computer science problems, such as searching and sorting, big data computing mainly focuses on extracting knowledge - thereby value - from large-scale data sets using advanced data analysis techniques, such as machine learning.

This course is intended to provide graduate-level students with a deep understanding of programming models and tools that are suitable for the large-scale analysis of data distributed across clusters of computers. More specifically, the course will give students the ability to proficiently develop big data/machine learning solutions on top of industry standard frameworks, such as Hadoop and Spark, to tackle real-world problems faced by the so-called "Big Five" tech companies (i.e., Apple, Amazon, Google, Microsoft, and Facebook): text/graph analysis, classification/regression, and recommendation, just to name a few.

Prerequisites

The course assumes that students are familiar with the basics of data analysis and machine learning, properly supported by a strong knowledge of foundational concepts of calculus, linear algebra, and probability and statistics. In addition, students must have non-trivial computer programming skills (preferably using Python programming language). Previous experience with Hadoop, Spark, or distributed computing is not required.

Exams

Students must prove their level of comprehension of the subject by developing a software project, leveraging the set of methodologies and tools introduced during classes. Projects must of course refer to typical Big Data tasks: e.g., clustering, prediction, recommendation using very-large datasets in any application domain of interest. The topic of the project must anyway be agreed with the professor in advance; references where to select interesting projects from will be however suggested throughout the course (e.g., Kaggle). Projects can be done either individually or in group of at most 2 students, and they should be accompanied by a brief presentation written in english (e.g., a few PowerPoint slides). Finally, there will be an oral exam where submitted projects will be discussed in english; other questions on any topic addressed during the course may also be asked, but those can be answered either in english or in italian, as the student prefers.
A document containing the main guidelines for the final project is available here.

Mining of Massive Datasets [Leskovec, Rajaraman, Ullman] available online.
Big Data Analysis with Python [Marin, Shukla, VK]
Large Scale Machine Learning with Python [Sjardin, Massaron, Boschetti]
Spark: The Definitive Guide [Chambers, Zaharia]
Learning Spark: Lightning-Fast Big Data Analysis [Karau, Konwinski, Wendell, Zaharia]
Hadoop: The Definitive Guide [White]
Python for Data Analysis [Mckinney]

Syllabus

Introduction

The Big Data Phenomenon
The Big Data Infrastructure
- Distributed File Systems (HDFS)
- MapReduce (Hadoop)
- Spark
PySpark + Databricks

Unsupervised Learning: Clustering

Similarity Measures
Algorithms: K-means
Example: Document Clustering

Dimensionality Reduction

Feature Extraction
Algorithms: Principal Component Analysis (PCA)
Example: PCA + Handwritten Digit Recognition

Supervised Learning

Basics of Machine Learning
Regression/Classification
Algorithms: Linear Regression/Logistic Regression/Random Forest
Examples:
- Linear Regression -> House Pricing Prediction (i.e., predict the price which a house will be sold)
- Logistic Regression/Random Forest -> Marketing Campaign Prediction (i.e., predict whether a customer will subscribe a term deposit of a bank)

Recommender Systems

Content-based vs. Collaborative filtering
Algorithms: k-NN, Matrix Factorization (MF)
Example: Movie Recommender System (MovieLens)

Graph Analysis

Link Analysis
Algorithms: PageRank
Example: Ranking (a sample of) the Google Web Graph

Environment Setup

PySpark + Databricks

In this course, we will be using the Python application programming interface to the Apache Spark framework (a.k.a. PySpark), in combination with Databricks. This will allows you to write and execute PySpark (as well as pure Python, for that matters) in your browser, with:

Zero configuration required;
Free access to Databricks' powerful cloud infrastructure (including GPUs);
Easy sharing.

Why Databricks?

Starting from this year, our Big Data Computing class at Sapienza has joined the Databricks University Alliance. This is a very active community of educators and faculty members who collaboratively share ideas, thoughts, and actual material on how to improve their teaching experience of Data-Science-like classes, which ultimately allow students to learn the latest data science tools used in the industry.

Where Should I Start with Databricks?

The first thing you have to do in order to start using Databricks is to set up a personal account. Databricks accounts come in two flavours:

Full Platform (payment, 14-day trial)
Community Edition (free)

The former is the standard payment account, which gives you access to the fully-fledged Databricks' data analytics platform based either on Microsoft Azure or Amazon AWS computational resources. The latter, instead, allows you to enjoy Databricks on Amazon AWS for free (of course with some limitations!)

For the aim of our class, students must all sign up for a personal Databricks Community Edition account using this link. Please, be sure to select the correct type of account, as highlighted in the snapshot below:

For any further information, please follow the instructions provided in the documentation.

What Databricks Resources Should I Use?

Many big companies have started relying on Databricks platform for running their data analytics tasks. As such, Databricks is really well-documented and provides you with a lot of useful material to consult. Among such material, I would suggest you to check out the following:

A self-paced training course, whose instructions on how to access it are available here
A four-part tutorial on data analyitics with Databricks
The official Databricks documentation

Optionally, you may also want to install PySpark on your own local machine.

(NOTE: This step is not required for passing this class)

Local Mode Setup [Optional]

In case you would like to install and configure PySpark also on your local machine, please follow the instructions described here. Note that those guidelines may refer to older (or, even worst, deprecated) versions of the required installation packages; please, see the official PySpark documentation for the the most updated installation instructions.

Class Schedules

Lecture #	Date	Topic	Material
Lecture 1	02/23/2021	Introduction to Big Data: Motivations and Challenges	[slides: PDF]
Lecture 2	02/24/2021	MapReduce Programming Model	[slides: PDF]
Lecture 3	03/03/2021	Apache Spark	[slides: PDF]
Lecture 4	03/09/2021	PySpark Tutorial (with Databricks)	[notebook: ipynb]
Lecture 5-6	03/10/2021 - 03/16/2021	Clustering	[slides: PDF]
Lecture 7-8	03/17/2021 - 03/23/2021	Clustering Algorithms: K-means	[slides: PDF]
Lecture 9	03/24/2021	Document Clustering with PySpark	[slides: PDF, notebook: ipynb]
Lecture 10-11	03/30/2021 - 03/31/2021	Dimensionality Reduction (Principal Component Analysis)	[slides: PDF, notes: PDF]
Lecture 12	04/07/2021	Principal Component Analysis with PySpark	[notebook: ipynb]
Lecture 13	04/13/2021	Supervised Learning	[slides: PDF]
Lecture 14-15	04/14/2021 - 04/20/2021	Linear Regression	[slides: PDF]
Lecture 16	04/21/2021	Linear Regression with PySpark	[notebook: ipynb]
Lecture 17-18	04/27/2021-04/28/2021	Logistic Regression	[slides: PDF, notes: PDF]
Lecture 19	05/04/2021	Decision Trees and Ensembles	[slides: PDF]
Lecture 20	05/05/2021	Evaluation Metrics for Classification	[slides: PDF, notebook: ipynb]
Lecture 21	05/11/2021	Recommender Systems (Part I)	[slides: PDF]
Lecture 22	05/12/2021	Recommender Systems (Part II)	[slides: PDF, notebook: ipynb]
Lecture 23	05/18/2021	Graph Link Analysis	[slides: PDF]
Lecture 24	05/19/2021	PageRank	[slides: PDF, notes: PDF, notebook: ipynb]
----------	05/19/2021	The Last Take Home Message	[slides: PDF]

Previous Years

In the following, you can quickly navigate through Big Data Computing class information and material from previous years.

NOTE: The folder containing the class material is unique and it is subject to changes and/or updates; as such, there may be differences between the content displayed on this website and what have been shown in class in the past.

2019-20

Name		Name	Last commit message	Last commit date
Latest commit History 394 Commits
databricks		databricks
datasets		datasets
exams		exams
extra		extra
img		img
notebooks		notebooks
oldest		oldest
slides		slides
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Big Data Computing (2020-2021)

News

General Information

Class Schedule

How to Attend Classes

Attending Classes in Presence: Room G50 - Building G, Viale Regina Elena 295

Attending Classes Remotely: Zoom

Moodle Web Page

Office Hours

Contacts

Description and Goals

Prerequisites

Exams

Recommended Textbooks

Syllabus

Environment Setup

PySpark + Databricks

Why Databricks?

Where Should I Start with Databricks?

What Databricks Resources Should I Use?

Local Mode Setup [Optional]

Class Schedules

Previous Years

About

Releases

Packages

Languages

EmaMerca/big-data-computing

Folders and files

Latest commit

History

Repository files navigation

Big Data Computing (2020-2021)

News

General Information

Class Schedule

How to Attend Classes

Attending Classes in Presence: Room G50 - Building G, Viale Regina Elena 295

Attending Classes Remotely: Zoom

Moodle Web Page

Office Hours

Contacts

Description and Goals

Prerequisites

Exams

Recommended Textbooks

Syllabus

Environment Setup

PySpark + Databricks

Why Databricks?

Where Should I Start with Databricks?

What Databricks Resources Should I Use?

Local Mode Setup [Optional]

Class Schedules

Previous Years

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages