- API source : TMDB's API.
- Thanks TMDB for free API data
This project is focused on utilizing data sciece methodologies to process and analyze a comprehensive dataset sourced from IMDB. It is enhanced with financial information retrieved through TMDB's API.The ultimate goal is to perform Extract, Transform, Load (ETL) operations on the raw data and building a MySQL database and export it into a set of CSV files. Leveraging machine learning models and hypothesis testing, goal is to extract valuable insights for stakeholders, providing guidance on strategies for achieving success in the realm of filmmaking with recommendations.
[image source](https://www.istockphoto.com/photo/35mm-film-strip-gm1298343176-391220638?phrase=film/)From IMDB's public datasets following data is downloaded and processed based on stakeholders requirements. Perticularly it includes information about movies and its ratings with some financial information like MPAA rating, genre, budget, revenue etc.
tconst
: alphanumeric unique identifier of the titletitleType
: the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)primaryTitle
: the more popular title / the title used by the filmmakers on promotional materials at the point of releaseoriginalTitle
: original title, in the original languageisAdult
: 0: non-adult title; 1: adult titlestartYear (YYYY)
: represents the release year of a title. In the case of TV Series, it is the series start yearendYear (YYYY)
: TV Series end year. ‘\N’ for all other title typesruntimeMinutes
: primary runtime of the title, in minutesgenres
: includes up to three genres associated with the title associated with the title
tconst
: alphanumeric unique identifier of the titleaverageRating
: weighted average of all the individual user ratingsnumVotes
: number of votes the title has received
titleId
: a tconst, an alphanumeric unique identifier of the titleordering
: a number to uniquely identify rows for a given titleIdtitle
: the localized titleregion
: the region for this version of the titlelanguage
: the language of the titletypes
: enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New values may be added in the future without warningattributes
: additional terms to describe this alternative title, not enumeratedisOriginalTitle
: 0: not original title; 1: original title
imdb_id
: unique identifier used by IMDbadult
: indicates if the content is adult-orientedbackdrop_path
: path to the backdrop imagebelongs_to_collection
: information about the collection the movie belongs to, if anybudget
: budget of the moviegenres
: genres associated with the moviehomepage
: official homepage of the movieid
: unique identifier used by TMDBoriginal_language
: original language of the movieoriginal_title
: original title of the movieoverview
: brief summary of the moviepopularity
: measure of the movie's popularityposter_path
: path to the poster imageproduction_companies
: production companies involved in the movieproduction_countries
: countries where the movie was producedrelease_date
: release date of the movierevenue
: revenue generated by the movieruntime
: runtime of the movie in minutesspoken_languages
: languages spoken in the moviestatus
: current status of the movie (e.g., Released, Post Production)
-
Hypothesis
- (
$H_0$ ) Null Hypothesis : The MPAA rating does not affect its revenue. - (
$H_A$ ) Alternative Hypothesis : The MPAA rating of a movie does affect its revenue
- (
-
Select the right test based on data
- 'revenue' (numeric)
- 'certification' (categorical)
- '2 groups'
-
Test
- One Way ANOVA and/or Tukey
-
ANOVA Assumptions
- No significant outliers
- Normality
- Equal variance
-
Significance level 0.05
Outliers - Removed from both the groups
Normality Test
- Assumption of Normality failed though, with sufficient samples in each group we can pass the test
Equal Variance
As we failed the assumtion of equal variance , we will use non parametric Kruskal Wallis instead of one-way ANOVA.
Post-Hoc Tukey Comparisions
Based on the Tukey's comparision we can conclude that PG moveis make most revenue. NC-17 movies and R rated movies make significantly less as compared to other.