This project uses openly available data of IMDb (https://www.imdb.com/). The intent of this project is to find the best movie and rank them based on a calculated metric or on ratings naively for any given crew member (directors/writers). Finally aggregating this data in a reporting table side by side.
The top most movies from calculated metric and from ratings are combined in a single row. This data is further divided by genres and profession of crew members.
According to IMDb, the data is modified every day and hence a data_date
partition is associated with every table.
- Apache Spark
- AWS S3
- AWS EMR
Update properties.cfg
file.
Execute sh execute.sh
.
For testing purposes, execute python code/main.py
instead of execute.sh.
properties.cfg usage notes:
- Partition - Data Date for the run, if not defined code will consider execution date.
- run_mode - Decides which set of property to choose. Currently supported S3|local. If undefined or incorrect, local is set by default.
- SOURCE_CODE_S3 - Supported in S3 mode. Intermediate source code location.
- AWS - Connection details for AWS.
- Directories - Ending with
_dir
. Intermediate location for storing parquet files.
- execute.sh - Shell Script for initiating the entire process.
- properties.cfg - Configuration file, containing AWS Keys and all required params.
- code/ - Directory containing the entire code that will be executed from EMR. Below are files within this dir.
- aws_script.sh - Script that is executed as a step within EMR, that pushes code to EMR and runs main.py.
- main.py - Python driver script, controls the ETL execution flow.
- args.py - Arguments population script. Takes property values from config file and instantiate variables.
- load_staging_table.py - Load the .tsv.gz files downloaded from IMDb as parquet files with appropriate schema and data structure.
- dimension_table_creation.py - Python script for converting the staging tables to multiple intermediate dimension tables.
- fact_table_creation.py - Python script for joining multiple dimension tables to create a single fact table.
- reporting_table_creation.py - Python script for extracting only the reportable data from Fact table
- data_quality.py - Data Quality Check script. Running multiple ad-hoc data analysis checks.
- custom_logger.py - Python script for logging. Any changes for logging here will easily be reflected on entire project.
- source_data_copy.sh - Runs only in S3 mode, downloads data from IMDb into HDFS for processing and then archives to S3.
- LoadStagingTable
[load_staging_table.py]
- - First step in the ETL, conversion of source .tsv.gz files to parquet files with appropriate schema and data structure.
- DimensionTableCreation
[dimension_table_creation.py]
- - Converts each table into multiple intermediate tables to be further used in the creation of the Fact table.
- FactTableCreation
[fact_table_creation.py]
- - Creates fact tables by joining and filtering data from dimension tables.
- ReportingTableCreation
[reporting_table_creation.py]
- - Last step of ETL conversion, transforms the fact data into the reportable format.
Subsets of IMDb data are available for access to customers for personal and non-commercial use.
• Data Location
The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily.
• IMDb Dataset Details
Each dataset is contained in a gzipped, tab-separated-values (TSV) formatted file in the UTF-8 character set. The first line in each file contains headers that describe what is in each column. A ‘\N’ is used to denote that a particular field is missing or null for that title/name.
Below are the file name and structures with counts as on December 2019.
- title.basics.tsv.gz (Counts- 6,316,473)
Column | Data Format | Description |
---|---|---|
tconst | string | alphanumeric unique identifier of the title |
titleType | string | the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc) |
primaryTitle | string | the more popular title / the title used by the filmmakers |
originalTitle | string | original title, in the original language |
isAdult | boolean | 0: non-adult title; 1: adult title |
startYear | YYYY | represents the release year of a title. In the case of TV Series, it is the series start year |
endYear | YYYY | TV Series end year. ‘\N’ for all other title types |
runtimeMinutes | numeric | primary runtime of the title, in minutes |
genres | string array | includes up to three genres associated with the title |
- title.crew.tsv.gz (Counts- 6,316,473)
Column | Data Format | Description |
---|---|---|
tconst | string | alphanumeric unique identifier of the title |
directors | array of nconsts | director(s) of the given title |
writers | array of nconsts | writer(s) of the given title |
- title.ratings.tsv.gz (Counts- 992,341)
Column | Data Format | Description |
---|---|---|
tconst | string | alphanumeric unique identifier of the title |
averageRating | decimal | weighted average of all the individual user ratings |
numVotes | numeric | number of votes the title has received |
- name.basics.tsv.gz (Counts- 9,702,445)
Column | Data Format | Description |
---|---|---|
nconst | string | alphanumeric unique identifier of the name/person |
primaryName | string | name by which the person is most often credited |
birthYear | YYYY | |
deathYear | YYYY | if applicable, else '\N' |
primaryProfession | array of strings | the top-3 professions of the person |
knownForTitles | array of tconsts | titles the person is known for |
Column | Data Type | Description | Data Range/Format | Partition |
---|---|---|---|---|
name_const | string | IMDb generated unique ID for name/person | nm[0-9]* | |
primary_name | string | Name by which the person is most often credited | ||
title_const | string | IMDb generated unique ID for movies | tt[0-9]* | |
movie_name | string | Popular title of the movie | ||
movie_rating | double | average of all the individual user ratings for that movie | 0.0 - 10.0 | |
num_votes | integer | number of votes the movie has received | ||
calculated_metric_movie | double | Calculated Metric of the movie, to be further used for ranking | ||
data_date | date | Date of data on which ETL was executed | YYYY-MM-DD | ✓ |
profession | string | Contribution role of crew member in the title | directors / writers | ✓ |
genres | string | Category of the movie | ✓ |
Column | Data Type | Description | Partition |
---|---|---|---|
name | string | Crew member name, name of directors/writers | |
total_movies | long | Total number of movies accounted for | |
name_HRM | string | Name of the highest rated movie(HRM) as per calculated metric | |
rating_HRM | double | Rating of the highest rated movie | |
votes_HRM | integer | Number of votes accumulated for the highest rated movie | |
calculated_metric_HRM | double | Calculated metric of the highest rated movie | |
name_AHRM | string | Name of actual highest rated movie(AHRM) based on ratings naively | |
rating_AHRM | double | Rating of the actual highest rated movie | |
is_same | string | Values - Yes / No, If both HRM and AHRM movie name are same | |
genres | string | Category in which calculation of movies is done | |
data_date | date | Date of data on which ETL was executed, for daily changing data | ✓ |
profession | string | Values - directors / writers, profession of crew member | ✓ |
- If data was increased by 100x
Processing clusters in this project are created using Amazon EMR, which provides an easy scalability solution with respect to number of nodes and specification of systems. More details could be found here, https://aws.amazon.com/emr/. With respect to this project, changinginstance-count
within execute.sh while creating an EMR cluster would increase the number of nodes.
-
If pipelines would be run on a daily basis by 7 am every day
As of now there is no out of the box solution but the project is modular and integrating with Airflow or converting it to a Cron job would be fairly easy. For reference of airflow, check here, https://github.com/apache/airflow. -
If database needed to be accessed by 100+ people
Having multiple nodes within the EMR cluster increases concurrency. EMR also provides access via SSH and native support for jupyter notebooks. -
Additional reporting tables
Current ETL pipeline filters all non-movie data. An extension of current ETL would be, to include the remaining titleType like TV Series with this as a separate partition.
Just like the existing Fact table, code fortv_fact_table
could be added which would report, highest rated episode and highest rated season for TV titles.