Lucas Index of Tolerance - on quantifying creative impact

Can we predict the audience reception for a movie based on the level of involvement by the movie's director?

This is the question I set out to answer for my second project of Metis. I derived inspiration for this project from the impossibly terrible series of Star Wars movies from the early 2000s (AKA 'the Jarjar Banks Index of Tolerance'). I noticed that the difference between the first set of the franchise (1970s) and the second set (2000s) was that George Lucas was heavily credited on films in the second set, but not as much on the first. I wanted to see if this was a Lucas-centric phenomenon or if it is an industry trend.

Beautiful Ugly Soup

I started by getting a sense for what movie data is publicly available for scraping. I scraped Rotten Tomatoes, Box Office Mojo and used the omdb API. This lead me to a relatively comprehensive source for the features I was looking to use.

Hold on.... These three sentences grossly underrepresent the arduous task that is Beautiful Soup. I have to give a major shoutout :loudspeaker: to the makers of OMDB because without them, I would have been lost in the quagmire of IMDB's javascript forest with no map.

"That would take some sort of super DataFrame"

My target variable for this project was the Audience Rating for each movie (starting with 1500 movies). It remains my opinion that this is the best indicator of goodness of a movie. For instance, everyone knew that Batman vs. Superman was going to be awful, but everyone still spent their money to see it and left the theater disappointed. Because of its huge marketing budget and visibility as a superhero movie, it did immensely well at the box office, but its Rotten Tomatoes audience score is an abysmal 68%.

I started with each director's body of work, and weeded out movies they didn't direct. Unfortunately, this widdled down my dataset substantially, to roughly 200 movies. After this, I created the features for Lucas Index of Tolerance (hereafter referred to as LIT). If you are interested in seeing how I did this, my ipython notebook is available on Github.

A note about LIT

Before I go on, I'll break down what I considered for a LIT score. I made these categorical dummy variables to minimize my bias about the weight of each role. I'm using Inception (Christopher Nolan) as an example.

Item Scorecard Score
Screenwriter 1
Producer 1
Exec. Producer 0
Actor 0
Total LIT 2

Assumptions / Notes

  • The actor column excludes cameos.
  • I intentionally kept Producer and Executive Producer separate. After research, I determined that they are vastly different.
  • The max LIT is 3, since no one really executive produces and produces.
  • This current scheme doesn't take into account when there are multiple producers or screenwriters. I decided that this didn't matter for the first iteration of this project.

  • I Must Regress

    I was stoked when I got everything ready to analyze - this was is the part of projects I most look forward to. I excitedly awaited the groundbreaking work I was surely about to discover, and then this popped up ---

    Item Scorecard Score
    LIT P-Value : 😭
    R2 : 😑

    You're staring into the abyss of broken modeling dreams.

    You're staring into the abyss of broken modeling dreams.



    A Kernel Density makes the issue a bit more obvious:

    This was a clear no-go. I decided to take a different angle, looking at the gross domestic of each film.





    But you said...

    I know. I know. To quote myself just moments ago:

    everyone knew that Batman vs. Superman was going to be awful, but everyone still spent their money to see it and left the theater disappointed

    But anyone who models knows that an intelligent iterative design process involves recognizing ideas you thought were gold yesterday as garbage today if they don't hold water. So I tried domestic gross:

    Model Item Value
    R2 0.495
    LIT P-Value 0.02

    In Conclusion: Inconclusive

    While my current results are inconclusive, they are nonetheless fascinating. Reflecting upon the lack of statistical significance I turned up, it occurs to me that attempting to explain audience reception with my scorecard isn't comprehensive enough - with any regression of this kind, there are a multitude of lurking variables out there that I either did not consider, or did not have the data for.

    Conceptually, it makes sense that I wouldn't see a high R 2 by using just a few inherently human categorical variables as my only features. The fact that I turned up more predictive power against domestic gross with LIT + budget as the features than with budget alone leads me to believe there is value in my hypothesis. For now, I will continue to explore the possibilities and keep an open mind about what variables I should consider.

    Credits and thanks: Rumman, Sr. Data Scientist at Metis, Joel, Sr. Data Scientist at Metis, Rafa, Janine and Nate for pickle assistance