- Graduate School, Leavey School of Business
- Department of Information Systems & Analytics
- Course MSIS 2627: Big Data Modeling & Analytics
- Big-Data-MapReduce Course @ Santa Clara University
- Class meeting dates:
- Start: September 9, 2022
- End: December 9, 2022
- Class hours:
- Tuesday 5:45 PM - 7:20 PM PST (TBDL/online/via Zoom)
- Thursday 5:45 PM - 7:20 PM PST (TBDL/online/via Zoom)
- Instructor: Mahmoud Parsian
- Class room: Lucas Hall 210
- Office: 216AA, 2nd Floor, Lucas Hall (not used due to covid-19)
- Office Hours: TBDL (or by appointment)
- Office Hours ethics: if you are planning to attend an office hour, then you should send me an email
1.
Data Algorithms with Spark by Mahmoud Parsian2.
Data-Intensive Text Processing with MapReduce by Jimmy Lin and Chris Dyer
- 1. A Very Brief Introduction to MapReduce by Diana MacLean
- 2. Introduction to MapReduce by Mahmoud Parsian
- 3. MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat
- 1. PySpark Algorithms by Mahmoud Parsian
- 2. Source code @github.com -- PySpark Algorithms by Mahmoud Parsian
- 3. Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeffrey D. Ullman
- 4. Big Data Now -- book
- 5. Designing Good Mapreduce Algorithms by Ullman
- 6: Bigtable: A Distributed Storage System for Structured Data
- 7. Relational Algebra and MapReduce
- 8. MapReduce examples
- 9. MapReduce and relational algebra
- 10. Spark Streaming Tutorial
- 11. Billion Taxi Rides on Amazon Athena
- Apache Spark Site
- Apache Spark Download, Use version 3.2.1
The main focus of this class is to cover the following concepts:
- Concepts of Big Data
- Distributed File Systems
- Distributed Computing
- Distributed and Parallel Algorithms
- MapReduce Paradigm
- MapReduce Algorithms
- Scale-out Architectures (using Hadoop, Spark, PySpark)
- Apache Spark
- Use Spark, Py-Spark, and Python to teach MapReduce and distributed computing
- SQL for NoSQL Data, How?
- Amazon Athena
- Amazon Athena, S3, Data Partitioning