GitHub - abhisheksunny/ETL-Data-Processing-IMDb-highest-rated-movie: An ETL pipeline written with Apache Spark to find the highest rated movie as per a calculated metric for a given crew member and genre as per IMDb.

Background Information & Purpose

This project uses openly available data of IMDb (https://www.imdb.com/). The intent of this project is to find the best movie and rank them based on a calculated metric or on ratings naively for any given crew member (directors/writers). Finally aggregating this data in a reporting table side by side.
The top most movies from calculated metric and from ratings are combined in a single row. This data is further divided by genres and profession of crew members.
According to IMDb, the data is modified every day and hence a data_date partition is associated with every table.

Technology Stack

Apache Spark
AWS S3
AWS EMR

Running Instruction

Update properties.cfg file.
Execute sh execute.sh.

For testing purposes, execute python code/main.py instead of execute.sh.

properties.cfg usage notes:

Partition - Data Date for the run, if not defined code will consider execution date.
run_mode - Decides which set of property to choose. Currently supported S3|local. If undefined or incorrect, local is set by default.
SOURCE_CODE_S3 - Supported in S3 mode. Intermediate source code location.
AWS - Connection details for AWS.
Directories - Ending with _dir. Intermediate location for storing parquet files.

Source Code Files

execute.sh - Shell Script for initiating the entire process.
properties.cfg - Configuration file, containing AWS Keys and all required params.
code/ - Directory containing the entire code that will be executed from EMR. Below are files within this dir.

aws_script.sh - Script that is executed as a step within EMR, that pushes code to EMR and runs main.py.
main.py - Python driver script, controls the ETL execution flow.
args.py - Arguments population script. Takes property values from config file and instantiate variables.
load_staging_table.py - Load the .tsv.gz files downloaded from IMDb as parquet files with appropriate schema and data structure.
dimension_table_creation.py - Python script for converting the staging tables to multiple intermediate dimension tables.
fact_table_creation.py - Python script for joining multiple dimension tables to create a single fact table.
reporting_table_creation.py - Python script for extracting only the reportable data from Fact table
data_quality.py - Data Quality Check script. Running multiple ad-hoc data analysis checks.
custom_logger.py - Python script for logging. Any changes for logging here will easily be reflected on entire project.
source_data_copy.sh - Runs only in S3 mode, downloads data from IMDb into HDFS for processing and then archives to S3.

ETL Flow

LoadStagingTable → DimensionTableCreation → FactTableCreation → ReportingTableCreation

LoadStagingTable [load_staging_table.py] -

First step in the ETL, conversion of source .tsv.gz files to parquet files with appropriate schema and data structure.

DimensionTableCreation [dimension_table_creation.py] -

Converts each table into multiple intermediate tables to be further used in the creation of the Fact table.

FactTableCreation [fact_table_creation.py] -

Creates fact tables by joining and filtering data from dimension tables.

ReportingTableCreation [reporting_table_creation.py] -

Last step of ETL conversion, transforms the fact data into the reportable format.

Source Files

Subsets of IMDb data are available for access to customers for personal and non-commercial use.
• Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
• IMDb Dataset Details
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.

Below are the file name and structures with counts as on December 2019.

title.basics.tsv.gz (Counts- 6,316,473)

Column	Data Format	Description
tconst	string	alphanumeric unique identifier of the title
titleType	string	the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle	string	the more popular title / the title used by the filmmakers
originalTitle	string	original title, in the original language
isAdult	boolean	0: non-adult title; 1: adult title
startYear	YYYY	represents the release year of a title. In the case of TV Series, it is the series start year
endYear	YYYY	TV Series end year. ‘\N’ for all other title types
runtimeMinutes	numeric	primary runtime of the title, in minutes
genres	string array	includes up to three genres associated with the title

title.crew.tsv.gz (Counts- 6,316,473)

Column	Data Format	Description
tconst	string	alphanumeric unique identifier of the title
directors	array of nconsts	director(s) of the given title
writers	array of nconsts	writer(s) of the given title

title.ratings.tsv.gz (Counts- 992,341)

Column	Data Format	Description
tconst	string	alphanumeric unique identifier of the title
averageRating	decimal	weighted average of all the individual user ratings
numVotes	numeric	number of votes the title has received

name.basics.tsv.gz (Counts- 9,702,445)

Column	Data Format	Description
nconst	string	alphanumeric unique identifier of the name/person
primaryName	string	name by which the person is most often credited
birthYear	YYYY
deathYear	YYYY	if applicable, else '\N'
primaryProfession	array of strings	the top-3 professions of the person
knownForTitles	array of tconsts	titles the person is known for

Fact Table

movie_rating_fact

Column	Data Type	Description	Data Range/Format	Partition
name_const	string	IMDb generated unique ID for name/person	nm[0-9]*
primary_name	string	Name by which the person is most often credited
title_const	string	IMDb generated unique ID for movies	tt[0-9]*
movie_name	string	Popular title of the movie
movie_rating	double	average of all the individual user ratings for that movie	0.0 - 10.0
num_votes	integer	number of votes the movie has received
calculated_metric_movie	double	Calculated Metric of the movie, to be further used for ranking
data_date	date	Date of data on which ETL was executed	YYYY-MM-DD	✓
profession	string	Contribution role of crew member in the title	directors / writers	✓
genres	string	Category of the movie		✓

• Sample Data:

Reporting Table

highest_rated_movie_report

Column	Data Type	Description	Partition
name	string	Crew member name, name of directors/writers
total_movies	long	Total number of movies accounted for
name_HRM	string	Name of the highest rated movie(HRM) as per calculated metric
rating_HRM	double	Rating of the highest rated movie
votes_HRM	integer	Number of votes accumulated for the highest rated movie
calculated_metric_HRM	double	Calculated metric of the highest rated movie
name_AHRM	string	Name of actual highest rated movie(AHRM) based on ratings naively
rating_AHRM	double	Rating of the actual highest rated movie
is_same	string	Values - Yes / No, If both HRM and AHRM movie name are same
genres	string	Category in which calculation of movies is done
data_date	date	Date of data on which ETL was executed, for daily changing data	✓
profession	string	Values - directors / writers, profession of crew member	✓

• Sample Data:

Future Scope / Scenarios

If data was increased by 100x
Processing clusters in this project are created using Amazon EMR, which provides an easy scalability solution with respect to number of nodes and specification of systems. More details could be found here, https://aws.amazon.com/emr/. With respect to this project, changing instance-count within execute.sh while creating an EMR cluster would increase the number of nodes.

If pipelines would be run on a daily basis by 7 am every day
As of now there is no out of the box solution but the project is modular and integrating with Airflow or converting it to a Cron job would be fairly easy. For reference of airflow, check here, https://github.com/apache/airflow.
If database needed to be accessed by 100+ people
Having multiple nodes within the EMR cluster increases concurrency. EMR also provides access via SSH and native support for jupyter notebooks.
Additional reporting tables
Current ETL pipeline filters all non-movie data. An extension of current ETL would be, to include the remaining titleType like TV Series with this as a separate partition.
Just like the existing Fact table, code for tv_fact_table could be added which would report, highest rated episode and highest rated season for TV titles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Background Information & Purpose

Technology Stack

Running Instruction

Source Code Files

ETL Flow

LoadStagingTable → DimensionTableCreation → FactTableCreation → ReportingTableCreation

Source Files

Fact Table

movie_rating_fact

Reporting Table

highest_rated_movie_report

Future Scope / Scenarios

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
code		code
images		images
README.md		README.md
execute.sh		execute.sh
properties.cfg		properties.cfg

abhisheksunny/ETL-Data-Processing-IMDb-highest-rated-movie

Folders and files

Latest commit

History

Repository files navigation

Background Information & Purpose

Technology Stack

Running Instruction

Source Code Files

ETL Flow

LoadStagingTable → DimensionTableCreation → FactTableCreation → ReportingTableCreation

Source Files

Fact Table

movie_rating_fact

Reporting Table

highest_rated_movie_report

Future Scope / Scenarios

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages