Lights! Camera! Action! Data Mining the IMDb Top 250

Hi ho!

As discussed in the previous post, data mining is the analysis of a dataset for the purpose of identifying patterns. In turn, these patterns give us a better understanding as to the nature of the data we are recording. Mining a weather dataset can help predict long term weather forecasts and plan accordingly; mining a sales dataset can help identity best selling products / sales groups and address stock levels / marketing strategy accordingly.

In this post, I will be mining the IMDb Top 250 dataset from the previous post. Whilst the information obtained herein is unlikely to be of any financial benefit to the reader, the purpose will be to impart practical examples of data mining techniques that the reader may then apply to their own situation.

Querying and Filtering a Single Table

A lot can be learned from a few queries on a single table. Let’s look at the Film table and find all those films in the top 250 that have been release from the year 2000 onwards:

top250

How many films is that though?

The answer here is 80 films. COUNT() is an example of an aggregate function – that is, one that returns a single value based on rows of results, rather than a value per row. Some other handy aggregate functions include:

  • MIN() – the lowest value of a column
  • MAX() – the maximum value of a column
  • AVG() – the mean average value of a column
  • SUM() – the sum of all values of a column

I haven’t included a SUM() example above (as summing ratings is pretty meaningless). That said, if our Film table had a “Total Ticket Sales” column, it would be interesting to sum that and compare different years to see the rise and fall of sales over the years, and potentially chart that against the state of the economy at that time. Just saying…

Going Wild!

If we are searching by text criteria, we can specify a generic data range using the % wildcard. The following query lists all those films that begin with the word “The”:

TheFilm

However, we may wish to search for any film where a specific word appears anywhere in the title. The following query lists all those films with the word “City” in the title:

CityFilm

A word of warning here though – adding the percentage wildcard to the beginning of a string search will negate any indexing done on this column. This will result in queries that take much longer to return than expected, especially if we query a large dataset!

Cross Table Queries

A lot more can be learned by interrogating multiple tables at once using joins. By joining the following three tables:

  • Actor
  • lnk_Actor_Film
  • Film

We can determine the cast of a film:

lebowski2

Or find out what films an actor has been in that are in the top 250:

ellenpage

Grouping Films by Genre

I’m a big Sci-Fi fan. Using the following query, I can find the list of Sci-Fi films in the top 250, sorted by descending rating:

scifi

What’s also interesting would be to find what the most popular genres are – that is, the number of films in the top 250 per genre. This query will combine a row value (Genre.GenreName) with an aggregate value (the number of films that belong to the genre specified in Genre.GenreName). To do this, we use GROUP BY thus:

genres

Note that the sum of GenreCount will exceed 250 here, as films can be of more than one genre. Speaking multiple genres, it’s also useful to know what sort of films a specified actor normally appears in:

portmangenres

Robert De Niro and Joe Pesci explain Unions, Intersects and Exceptions

Robert De Niro has quite a few films in the top 250, and so does Joe Pesci.

JoeBob

Here’s how to find a list of all films in which either of them have appeared, using UNION:

DeNiroPesci1

Worth noting here is that UNION will eliminate duplicates – i.e. films that both actors have appeared in together (UNION ALL lists duplicates). Speaking of films starring both actors, we can find this out using INTERSECT:

DeNiroPesci2

Also, we might want to find those films that Robert De Niro is in, but Joe Pesci is not – we can do this using EXCEPT:

DeNiroPesci3

Clint Eastwood and Morgan Freeman are Cannier than Robert De Niro and Joe Pesci

The trouble with UNION, INTERSECT AND EXCEPT above is that we end up scanning the same table (Film) twice – once for Robert De Niro, and once for Joe Pesci. It’s not efficient, especially if we are scanning a large table. Clint Eastwood and Morgan Freeman also have a lot of films in the top 250, and having been around longer, they know a few things.

MorganClint

To get a list of films that either Clint Eastwood or Morgan Freeman have appeared in, we can use the following query:

ClintMorgan1

Here, we use an “OR” condition in our “WHERE” clause, so the Film table is only scanned once. The addition of the DISTINCT keyword ensures that we remove duplicate films (i.e. films that Clint Eastwood and Morgan Freeman have appeared in together will only be listed once).

Speaking of films that star both actors, consider the following query:

ClintMorgan2

This is interesting because we use aliasing to work with two copies of the Actor and lnk_Actor_Film tables. Moreover, we join the lnk_Actor_Film table to a copy of itself! Conceptually, it looks like this:

dbstuff2

Whilst we are scanning the Actor table twice, we are doing that with the INTERSECT version of this query anyhow.

Lastly, we consider the films that Clint Eastwood is in, but Morgan Freeman is not. Have a look at the following query:

ClintMorgan3

There are a few interesting points here. Firstly, we create a temporary dataset FreemanFilmIds that only exists for the execution of the query. This table stores the Ids of Films that Morgan Freeman has appeared in, without having to scan the Film table at all (we’re only interested in the ID here).

The next interesting point is that we left join our Clint Eastwood film data with our FreemanFildIds dataset. Conceptually, it looks like this:

LeftJoin

We then filter on FreemanFildIds.FilmId is null – that is, those films that Clint Eastwood is in, but Morgan Freeman is not.

Filtering by Aggregate Functions

Consider the following query to find the top 20 Actors by highest average film rating of films they have appeared in:

Here are the results:

top20

Who are these people? And all the same score? Where is Morgan Freeman?

On investigation, it turns out that all of the above people have only appeared in 1 film in the top 250 – The Shawshank Redemption – which has a score of 9.2 (the highest score) meaning that film extras or minor part actors have a very high average rating. I did like Clancy Brown in this film, and he’s great in The Highlander (not in the top 250 tho!), but he’s no Morgan Freeman.

As such, I rewrite the above query with the stipulation that an actor must appear in at least 5 films in the top 250. To filter on an aggregate function (namely COUNT()), I must use the HAVING keyword thus:

Here are the results:

top20-5

 

A lot of familiar names there.

I Didn’t Like Seeing Donny Go…

And so we reach the final curtain. I hope I’ve provided ample examples of  functions and techniques for your benefit. If so, please comment, like, favourite, retweet, +1, share…etc. Full script of SQL queries available here.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s