The Business of Show Business: A Deep Dive into the Film Industry
Fatima W. | 10/18/2024
What comes to mind when you’re picking a movie to watch? Do you look for a certain genre, actors/actresses, or directors/producers? There’s hundreds of thousands of movies that exist, and there’s thousands more to be made. One important fact about the entertainment industry is that its consumers have an insatiable demand for content – but not just any content. Surely, there are thousands of lesser known films/television shows that flopped or didn’t receive the recognition it may have deserved. At the same time, there are hundreds of movies deemed successful in their own ways, whether it’s due to earned box office sales, critic reviews, award nominations, or average rating. With that being said, what determines the success of a movie? The motivation of this project is to not only understand what makes a film successful, but to develop a non-financial success metric to be used in deciding what kind of movie should be remade, and how. So grab some popcorn and get comfy, you’re in for a treat!
The Data Explained
To begin the analysis, we’ll be using a collection of non-commercial datasets from IMDb, the Internet Movie Database that provides information on millions of films/television shows. These databases include information about movies/TV series, the cast and crew members, average ratings submitted by users, number of people who have rated these films, and more. More detailed information can be found here, however I’ve included a mapped graphic to help visualize the breakdown of each dataset I’ll be using in conjunction with one another. This mapping will prove to be useful to refer to when analyzing the coding portions in this analysis.
Click to view code
# load in all our data and packageslibrary(dplyr)library(ggplot2)library(stringr)library(DT)library(tidyr)get_imdb_file <-function(fname){ BASE_URL <-"https://datasets.imdbws.com/" fname_ext <-paste0(fname, ".tsv.gz")if(!file.exists(fname_ext)){ FILE_URL <-paste0(BASE_URL, fname_ext)download.file(FILE_URL, destfile = fname_ext) }as.data.frame(readr::read_tsv(fname_ext, lazy=FALSE))}
Because the data sets are so large, we’ll restrict our attention to people with at least two “known for” credits within the NAME_BASICS table. Since there are also a long list of obscure records, we’ll filter out and remove titles with less than 100 ratings. We can see that these titles make up about 75% of the entire data set. The same filtering will be performed on our other data sets.
After extracting our data, our first task is to ensure each variable is properly assigned to its data type or mode. Most columns in these datasets are read in as character vectors, however some should be classified as numeric or logical. So we’ll clean these columns in each table.
An Exploratory Analysis of the TV Production & Film Industry
Now that we’ve gathered and cleaned our data, we’ll begin analyzing and uncovering insights. Firstly, let’s find out how many movies, TV series, and TV episodes we have present in our data set. Shown below, there are roughly just over 130K movies, 155K TV episodes, and nearly 30K TV series.
With this many titles across various types of title types, I wanted to know who is the oldest living person in our data set and what their profession is. One important thing to consider, however, is within the NAME_BASICS data set, there is a double meaning to the NA values in the deathYear column. These NA values may indicate that the year of death is “unknown” or “still alive/not dead yet.” According to the Guinness World Records, the oldest person alive in the world is Tomiko Itooka, who was born in 1908. Therefore, to approach this finding, we’ll filter out people who were born in 1908 onward and NA values. As a result, the oldest living person in our data set is Angel Acciaresi, who was an assistant director, director, and writer. Seeing this person’s age, it’s a nice reminder and interesting to think about how long the entertainment industry has existed and how far it’s come.
I’m interested in uncovering some insights about TV series productions and their ratings given a baseline of at least 200,000 IMDb ratings. There exists one TV Episode in this data set that fits this criteria, that is, the episode Ozymandias from the American crime drama television series, Breaking Bad. In fact, this episode is ranked the number one “Best TV Episodes” for its perfect 10/10 rating.
I also wanted to know which TV series has the highest average rating. To answer this, I’ll be setting the benchmark to series with more than 12 episodes, and I’ll use the average rating of the TV series as a whole, rather than averaging the sum of each TV series’ episodes’ ratings. As a result, the highest average rating of a TV series is 9.71, which belongs to “The Youth Memories.”
Of all the TV series that exist across cultures worldwide in multiple languages and genres, a Chinese romance drama ranked the highest average rating. What made this specific drama stand out? Could it be due to its cultural appeal – for example, are Chinese dramas more successful/popular than Turkish dramas? Or maybe it’s the series’ romance elements that attract viewers to its show. As someone who occasionally watches C-dramas, I think this finding is interesting and important to consider as we deepen our analysis going further about what makes a movie/series successful.
Specific Movie Observations
Next, I wanted to take a look at well known actors/actresses and the projects they’re known for. More specifically, I’ll take a look at American actor Mark Hamill and the average ratings of movies he took part in. Unsurprisingly, Hamill is known for his role in the Star Wars original and sequel trilogies. However, take a look at the average ratings and the ranking of each sequel. Normally, some may assume that movie sequels are particularly bad, which is, of course, subjective and opinion-based. Keeping this in mind, we see that’s not exactly the case where Star Wars: Episode V - The Empire Strikes Back receives a higher ranking than Star Wars: Episode IV - A New Hope, which was released three years prior.
Click to view code
mark_hamill_top4_projects <- NAME_BASICS |>filter(primaryName =='Mark Hamill') |>separate_longer_delim(knownForTitles, ',') |>rename(tconst = knownForTitles) |>left_join(TITLE_BASICS, by ="tconst") |>left_join(TITLE_RATINGS, by ="tconst") |>arrange(desc(averageRating), desc(numVotes)) |>slice_head(n =4) |>select(primaryTitle, startYear, averageRating, numVotes)colnames(mark_hamill_top4_projects) <-c('Movie Title', 'Year of Release', 'Avg. Rating', 'Number of Votes')mark_hamill_top4_projects |> DT::datatable()
The Rise & Fall of TV Series: Happy Days
Have you ever heard of the phrase “jump the shark”? This is a common idiom used to describe a moment where a once-great show becomes ridiculous and rapidly loses watchers due to its quality. This idiom actually originated from a 1974 American sitcom that ran for 11 seasons, called “Happy Days.” In season 5 episode 3, one of the show’s characters, Fonzie (Henry Winkler) takes on the challenge to prove his bravery by water-skiing over a confined shark in the water. As the series continued on with their seasons, watchers grew tired of the show and mentioned that it was this season’s episode where the entire series began to go downhill, hence the phrase, “jump the shark.” The reason why I bring up this point is to see how the show performed before and after this season’s episode. More specifically, is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?
Because there are 11 seasons, we’ll determine seasons 1 through 5 to be “early seasons” and seasons 6 through 11 to be “later seasons.” As shown below, the series indeed had a higher average rating in earlier seasons than that of later seasons.
Click to view code
# Is it true that episodes from later seasons of Happy Days have lower average ratings than the early seasons?happydays <- TITLE_BASICS |>filter(primaryTitle =='Happy Days', titleType =='tvSeries') |>select(startYear, endYear, tconst)# Happy Days tconst = tt0070992happydays_getavg <- TITLE_EPISODES |>inner_join(happydays, join_by(parentTconst == tconst)) |>left_join(TITLE_RATINGS, join_by(tconst == tconst)) |>select(seasonNumber, episodeNumber, averageRating)happydays_earlyavg <- happydays_getavg |>filter(seasonNumber <=5) |>summarize(avg_rating1 =round(mean(averageRating, na.rm =TRUE), 2))happydays_lateravg <- happydays_getavg |>filter(seasonNumber >5) |>summarize(avg_rating2 =round(mean(averageRating, na.rm =TRUE), 2))happydays_avg <-cbind(happydays_earlyavg, happydays_lateravg)colnames(happydays_avg) <-c('Avg Rating Seasons 1-5', 'Avg Rating Seasons 6-11')happydays_avg |> DT::datatable()
Quantifying Success – Development of a Success Metric
Now that we’ve explored some of our data, our main goal is to propose new movies deemed to be successful. In order to do that, however, we need to come up with a way of measuring the success of a movie given our non-financial data sets, or in other words, IMDb ratings and votes. And while there is no right way to measure success, we’ll assume that successful projects will have both a high average IMDb rating and a large number of ratings, which would indicate quality and broad awareness, respectively.
I had a few approaches in my development of a success metric. Initially, I thought about adding the averageRating with the log of numVotes , where taking the log of vote count will help compress large values. However, this seemed too simple of a calculation to me because I felt that there needed to be some kind of weight added to each factor. I thought about weighing the average rating and number of votes equally, shown below:
This method may have been satisfactory, however I still felt like there is a better way to develop a success metric because I assumed that average rating and number of votes should be weighed equally. Although our data sets contain all votes for each title, I felt that not all votes have the same impact/weight on the final rating. During this process, I wanted to remain mindful of the possibility that people will normally rate and vote on titles that they have strong feelings for, whether they are good or bad. For example, a movie may have been so horrible for someone that they went out of their way to submit a low rating, whereas someone who may have felt indifferent to a movie didn’t bother to submit a rating at all. Had they submitted one though, maybe it would’ve been an average rating like 5/10. I gave one more try where I used the weighted average method and decided to use a benchmark of 10,000 votes as the minimum number of votes that makes a movie successful. To put it into a simple mathematical formula:
Using weighted average to find the success score
Success Score = [(R * v) + (C * m)] / (v + m)
where
R = average rating of each title
v = number of votes of each title
C = the mean rating of all titles
m = minimum number of votes that considers a movie to be successful (10,000 votes)
This ended up being my chosen method, where I also used log in calculating the number of votes.