Skip to content

An ETL pipeline written with Apache Spark to find the highest rated movie as per a calculated metric for a given crew member and genre as per IMDb.

Notifications You must be signed in to change notification settings

abhisheksunny/ETL-Data-Processing-IMDb-highest-rated-movie

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Background Information & Purpose

This project uses openly available data of IMDb (https://www.imdb.com/). The intent of this project is to find the best movie and rank them based on a calculated metric or on ratings naively for any given crew member (directors/writers). Finally aggregating this data in a reporting table side by side.
The top most movies from calculated metric and from ratings are combined in a single row. This data is further divided by genres and profession of crew members.
According to IMDb, the data is modified every day and hence a data_date partition is associated with every table.


Technology Stack

  • Apache Spark
  • AWS S3
  • AWS EMR

Running Instruction

Update properties.cfg file.
Execute sh execute.sh.

For testing purposes, execute python code/main.py instead of execute.sh.

properties.cfg usage notes:

  • Partition - Data Date for the run, if not defined code will consider execution date.
  • run_mode - Decides which set of property to choose. Currently supported S3|local. If undefined or incorrect, local is set by default.
  • SOURCE_CODE_S3 - Supported in S3 mode. Intermediate source code location.
  • AWS - Connection details for AWS.
  • Directories - Ending with _dir. Intermediate location for storing parquet files.

Source Code Files

Code Structure

    • execute.sh - Shell Script for initiating the entire process.
    • properties.cfg - Configuration file, containing AWS Keys and all required params.
    • code/ - Directory containing the entire code that will be executed from EMR. Below are files within this dir.
      • aws_script.sh - Script that is executed as a step within EMR, that pushes code to EMR and runs main.py.
      • main.py - Python driver script, controls the ETL execution flow.
      • args.py - Arguments population script. Takes property values from config file and instantiate variables.
      • load_staging_table.py - Load the .tsv.gz files downloaded from IMDb as parquet files with appropriate schema and data structure.
      • dimension_table_creation.py - Python script for converting the staging tables to multiple intermediate dimension tables.
      • fact_table_creation.py - Python script for joining multiple dimension tables to create a single fact table.
      • reporting_table_creation.py - Python script for extracting only the reportable data from Fact table
      • data_quality.py - Data Quality Check script. Running multiple ad-hoc data analysis checks.
      • custom_logger.py - Python script for logging. Any changes for logging here will easily be reflected on entire project.
      • source_data_copy.sh - Runs only in S3 mode, downloads data from IMDb into HDFS for processing and then archives to S3.

ETL Flow

LoadStagingTableDimensionTableCreationFactTableCreationReportingTableCreation


  • LoadStagingTable [load_staging_table.py] -
    • First step in the ETL, conversion of source .tsv.gz files to parquet files with appropriate schema and data structure.
  • DimensionTableCreation [dimension_table_creation.py] -
    • Converts each table into multiple intermediate tables to be further used in the creation of the Fact table.
  • FactTableCreation [fact_table_creation.py] -
    • Creates fact tables by joining and filtering data from dimension tables.
  • ReportingTableCreation [reporting_table_creation.py] -
    • Last step of ETL conversion, transforms the fact data into the reportable format.

Source Files

Subsets of IMDb data are available for access to customers for personal and non-commercial use.
• Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
• IMDb Dataset Details
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.

Below are the file name and structures with counts as on December 2019.

  • title.basics.tsv.gz (Counts- 6,316,473)
Column Data Format Description
tconst string alphanumeric unique identifier of the title
titleType string the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
primaryTitle string the more popular title / the title used by the filmmakers
originalTitle string original title, in the original language
isAdult boolean 0: non-adult title; 1: adult title
startYear YYYY represents the release year of a title. In the case of TV Series, it is the series start year
endYear YYYY TV Series end year. ‘\N’ for all other title types
runtimeMinutes numeric primary runtime of the title, in minutes
genres string array includes up to three genres associated with the title
  • title.crew.tsv.gz (Counts- 6,316,473)
Column Data Format Description
tconst string alphanumeric unique identifier of the title
directors array of nconsts director(s) of the given title
writers array of nconsts writer(s) of the given title
  • title.ratings.tsv.gz (Counts- 992,341)
Column Data Format Description
tconst string alphanumeric unique identifier of the title
averageRating decimal weighted average of all the individual user ratings
numVotes numeric number of votes the title has received
  • name.basics.tsv.gz (Counts- 9,702,445)
Column Data Format Description
nconst string alphanumeric unique identifier of the name/person
primaryName string name by which the person is most often credited
birthYear YYYY
deathYear YYYY if applicable, else '\N'
primaryProfession array of strings the top-3 professions of the person
knownForTitles array of tconsts titles the person is known for

Fact Table

movie_rating_fact

Column Data Type Description Data Range/Format Partition
name_const string IMDb generated unique ID for name/person nm[0-9]*
primary_name string Name by which the person is most often credited
title_const string IMDb generated unique ID for movies tt[0-9]*
movie_name string Popular title of the movie
movie_rating double average of all the individual user ratings for that movie 0.0 - 10.0
num_votes integer number of votes the movie has received
calculated_metric_movie double Calculated Metric of the movie, to be further used for ranking
data_date date Date of data on which ETL was executed YYYY-MM-DD
profession string Contribution role of crew member in the title directors / writers
genres string Category of the movie

• Sample Data:
fact_table_data


Reporting Table

highest_rated_movie_report

Column Data Type Description Partition
name string Crew member name, name of directors/writers
total_movies long Total number of movies accounted for
name_HRM string Name of the highest rated movie(HRM) as per calculated metric
rating_HRM double Rating of the highest rated movie
votes_HRM integer Number of votes accumulated for the highest rated movie
calculated_metric_HRM double Calculated metric of the highest rated movie
name_AHRM string Name of actual highest rated movie(AHRM) based on ratings naively
rating_AHRM double Rating of the actual highest rated movie
is_same string Values - Yes / No, If both HRM and AHRM movie name are same
genres string Category in which calculation of movies is done
data_date date Date of data on which ETL was executed, for daily changing data
profession string Values - directors / writers, profession of crew member

• Sample Data:
reporting_table_data


Future Scope / Scenarios

  • If data was increased by 100x
    Processing clusters in this project are created using Amazon EMR, which provides an easy scalability solution with respect to number of nodes and specification of systems. More details could be found here, https://aws.amazon.com/emr/. With respect to this project, changing instance-count within execute.sh while creating an EMR cluster would increase the number of nodes.

  • If pipelines would be run on a daily basis by 7 am every day
    As of now there is no out of the box solution but the project is modular and integrating with Airflow or converting it to a Cron job would be fairly easy. For reference of airflow, check here, https://github.com/apache/airflow.

  • If database needed to be accessed by 100+ people
    Having multiple nodes within the EMR cluster increases concurrency. EMR also provides access via SSH and native support for jupyter notebooks.

  • Additional reporting tables
    Current ETL pipeline filters all non-movie data. An extension of current ETL would be, to include the remaining titleType like TV Series with this as a separate partition.
    Just like the existing Fact table, code for tv_fact_table could be added which would report, highest rated episode and highest rated season for TV titles.


About

An ETL pipeline written with Apache Spark to find the highest rated movie as per a calculated metric for a given crew member and genre as per IMDb.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published