We often hear that the second movie in a franchise is always worse than the first. However, human memories tend to be influenced by nostalgia, and it remains uncertain if this classic dinner table debate stands on tangible evidence. In this project, we aim to settle this debate by analyzing the movie data from the CMU Movie Summary Corpus supplemented by TMDB dataset. The preliminary questions to this analysis are, "What makes a good franhcise movie?" and “Are franchise movies more profitable than non-franchise movies?”. In search of answers to these questions, we investigate various metrics such as box office revenue, viewer rating, diversity representation etc., for franchise movies and make a contrast to non-franchise movies.
Our blog posts are available in the following link: https://clementloyer.github.io/ada-website.github.io/
Following the objectives discussed in the abstract, here is the list of concrete questions to tackle with:
-
Do franchise movies degrade in quality and box office revenue as the sequel continues?
-
Is there an underlying pattern of features that makes franchise movies successful? If so, what makes franchise movies different from non-franchise movies?
-
Do some movie genres achieve higher box office revenue in movie franchise than the others? Is the trend consistent in non-franchise movies as well?
-
Does the length and the number of movies impact the success of a franchise and how does it evolves in the franchise?
-
Do actors of certain ethnicity/gender groups play particular personas more frequently? Are they depicted positively (hero/heroine) or negatively (villain) in the movie?
-
From which regions of the world do most franchises come from, and what are the dominant collaboration fluxes between countries? Does it differ a lot from non-franchised movies? Are there parts of the world that mostly create single movies instead of movie series? Are there some parts of the world that interact more often when creating sagas of movies? Are there recurrent bonds that can be identified? And finally, do some features regarding countries of origin have a link with movie revenue and reviews ?
-
We complement our movie data by merging data from the movie database (TMDB). This community-based movie database offers free API for non-commercial use, and we used the database to identify franchise movies in the CMU dataset. We also queried additional features for each movie to utilize them in our analysis. The table below shows the summary of newly added features:
Feature | Description |
---|---|
tmdb_id |
Unique movie ID for TMDB |
collection_name |
Franchise name |
collection_id |
Unique ID for the franchise |
vote_count |
Total number of votes the movie received on TMDB |
vote_average |
Average rating score based on user votes on TMDB |
genres |
Genre(s) the movie belongs to |
budget |
Production budget of the movie |
revenue |
Revenue generated by the movie |
run_time |
Total runtime of the movie in minutes |
tmdb_origin_country |
Country where the movie was produced |
tmdb_original_language |
Primary language of the movie's production |
We decided to add this dataset and the mentionned features as they seem relevant to answer our research questions, and that the given data set had some issues, such as the genres proposed, which were very specific (this is shown in results.ipynb), not usable, and that needed to be grouped together.
Considering the inflation rate, all manetary features such as revenue, budget and profit were adjusted for analysis usng the following formula: $$ \text{Real Price} = \frac{\text{Nominal Price in Year X} \times \text{CPI in Base Year}}{\text{CPI in Year X}} $$ The chosen base year is 2024. The CPI data are taken from the Federal Reserve Bank of Minneapolis. We merged the revenue and the budget data of the CMU and TMBD to maximize our number of data on those features. Franchise movies have higher box office revenue and higher budget than non franchise ones and we examined its statistical significance.
To know which underlying pattern of features makes franchise movies successful, we first want to look at the parameters that are at play: we want to see if they are usable for our analysis, and if they have any influence on reviews or revenues at all. After this initial stage of data exploration, we train a decision tree (HistGradientBoostingClassifier
) to predict whether the subsequent movie exists for a given movie based on features such as genre, language and gender ratio of actors. Shapley values are calculated to identify features that contribute the best to the prediction and therefore, illustrating patterns common to franchise movies.
We can look at violin plots of revenue normalized by budget (Q1) depending on the genres, and do the same for non-franchise movies. To know better how the genres interacts, and how their interactions affect the revenue, we could plot a heatmap of how often genres are paired together. We can then see if the ones more frequent in franchise movies perform better or not: if an under-represented genre in the franchise movies subset performs better than others, are they less frequent because they are paired? This can be answered by looking at the plots mentionned above.
We could also do the same for movie reviews to answer these questions: which genres are more appreciated? Are they the same for franchise and non-franchise movies? Is there a link between genre movie production and review?
To identify patterns in the size and duration of a franchise, a timeline plot was created for all franchises. Due to the large number of franchises, sorting, filtering, and coloring options have been provided.
Another goal is to subcategorise franchises for more detailed analysis, as examining all franchises together tends to yield less clear results. A clustering algorithm was then used to group franchise that have similar features, then by examining the result, we have observed some common relation and general behaviour.
We used a K-means model, for both the franchise as a whole and then at a deeper level with the 1st and 2nd movie of the franchise. For the second part, the franchise where the ratio increased from the first to the second movie were separated from the ones that decrease to make the analysis a bit easier.
The features used for both analysis:
-
at the franchise level : runtime_avg
movie_count
average_years_bt_movies
franchise_length_years revenue_avg budget_avg ratio_rb average_score
genres (is vectorized) country (is vecotrized) -
at the first and second movie level: vote_count_1 vote_count_2 vote_average_1 vote_average_2 run_time_1 run_time_2 release_year_1 release_year_2 years_diff_bt_pre_movies_2 real_revenue_1 real_revenue_2 real_budget_1 real_budget_2 real_profit_1 real_profit_2 ratio_revenue_budget_1 ratio_revenue_budget_2 num_genres_1 num_genres_2 collection_size_1 genres_1 (is vectorized) genres_2 (is vectorized) country_1 (is vectorized) country_2 (is vectorized)
By cross-referencing Actor_ethnicity_Freebase_ID
with Wikidata, we restored >400 unique ethnicity categories. Inspired by racial groups used in the British and the USA census, we came up with the following 7 racial groups into which we manually map these ethnicity categories:
Hispanic, White, Black, Asian, Native American, Middle Eastern, Others
With these racial groups, we first looked into the representation of each group in franchise and non-franchise movies. Second, adjectives describing movie characters were extracted from corresponding movie plots, and mean sentiment scores of these adjectives were assigned to each character using TextBlob. For adjective extraction, we relied on GPT-4o mini by providing the following prompts via Open AI API. The resulting json files are saved in data/character_kws
.
System prompt (affects all responses from the model):
Given a list of character names and a movie plot summary, return a JSON object where each character name is a key, and the value is a list of adjectives that describe the character. Do not repeat the same words in the list. If a character is not mentioned or described in the plot, return an empty list for that character. The output should be directly loaded by json.loads() in Python.
Individual prompt (given for each movie plot)
Character names: {characters} \nMovie plot summary: {plot}
The idea is to compare two networked maps of the world; one considering only franchise movies, the other considering all movies in the dataset.
The maps show one node for each country of the dataset (or regions for more clarity) and the connections between them. For each movie with a pair of origin countries, a connection is created. When a movie has multiple countries of origin, multiple pairs are created.
Do connections increase the box office revenue of the movies? Is the effect significant? And significantly different from non-franchise movies?
- Takuya was in charge of downloading TMDB dataset and analysing character data as well as building a decision tree and conducting Shapley value analysis.
- Maylis worked on the movie genres and their subsequent analysis.
- Salomé worked on analyzin the correlation between box office revenue, budget, its ratio and other feature of a movie. She was jointly involved in builidng a decision tree model with Takuya.
- Clément analyzed dataset from geographical dimension such as countries and regions, as well as leading the website creation using Jekyll.
- Pierre worked on the timeline visualisation and the research of the significant parameter and their influence.
From the project root, please run:
conda install --file requirements.txt
-
Make sure that
movie.metadata.tsv
is indata/
. The CMU dataset can be downloaded from this link. -
Obtain an API key from TMDB. Please follow the instruction on this webpage.
-
Create
data/constants.py
and add the following:
API_KEY = "YOUR_API_KEY"
- From the root, run
python fetch_data_from_tmdb.py
. This will createdata/movie_metadata_with_tmdb.csv
. Note that the run will take 2-3 hours, depending on your Internet connection.
-
Make sure that
plot_summaries.txt
andcharacter.metadata.tsv
are indata/
. -
Obtain an API key from Open AI API.
-
Create
data/constants.py
and add the following:
OPENAI_API_KEY = "YOUR_API_KEY"
- From the root, run
python query_chatgpt.py
. This will save json and pkl files indata/character_kws
. The code is already optimized not to exceed the rate limit for Tier 1 users. We queried 3000 plots at a time to follow this limit.