Lights! Camera! Action! Data Mining the IMDb Top 250

Hi ho!

As discussed in the previous post, data mining is the analysis of a dataset for the purpose of identifying patterns. In turn, these patterns give us a better understanding as to the nature of the data we are recording. Mining a weather dataset can help predict long term weather forecasts and plan accordingly; mining a sales dataset can help identity best selling products / sales groups and address stock levels / marketing strategy accordingly.

In this post, I will be mining the IMDb Top 250 dataset from the previous post. Whilst the information obtained herein is unlikely to be of any financial benefit to the reader, the purpose will be to impart practical examples of data mining techniques that the reader may then apply to their own situation.

Querying and Filtering a Single Table

A lot can be learned from a few queries on a single table. Let’s look at the Film table and find all those films in the top 250 that have been release from the year 2000 onwards:

top250

How many films is that though?

The answer here is 80 films. COUNT() is an example of an aggregate function – that is, one that returns a single value based on rows of results, rather than a value per row. Some other handy aggregate functions include:

  • MIN() – the lowest value of a column
  • MAX() – the maximum value of a column
  • AVG() – the mean average value of a column
  • SUM() – the sum of all values of a column

I haven’t included a SUM() example above (as summing ratings is pretty meaningless). That said, if our Film table had a “Total Ticket Sales” column, it would be interesting to sum that and compare different years to see the rise and fall of sales over the years, and potentially chart that against the state of the economy at that time. Just saying…

Going Wild!

If we are searching by text criteria, we can specify a generic data range using the % wildcard. The following query lists all those films that begin with the word “The”:

TheFilm

However, we may wish to search for any film where a specific word appears anywhere in the title. The following query lists all those films with the word “City” in the title:

CityFilm

A word of warning here though – adding the percentage wildcard to the beginning of a string search will negate any indexing done on this column. This will result in queries that take much longer to return than expected, especially if we query a large dataset!

Cross Table Queries

A lot more can be learned by interrogating multiple tables at once using joins. By joining the following three tables:

  • Actor
  • lnk_Actor_Film
  • Film

We can determine the cast of a film:

lebowski2

Or find out what films an actor has been in that are in the top 250:

ellenpage

Grouping Films by Genre

I’m a big Sci-Fi fan. Using the following query, I can find the list of Sci-Fi films in the top 250, sorted by descending rating:

scifi

What’s also interesting would be to find what the most popular genres are – that is, the number of films in the top 250 per genre. This query will combine a row value (Genre.GenreName) with an aggregate value (the number of films that belong to the genre specified in Genre.GenreName). To do this, we use GROUP BY thus:

genres

Note that the sum of GenreCount will exceed 250 here, as films can be of more than one genre. Speaking multiple genres, it’s also useful to know what sort of films a specified actor normally appears in:

portmangenres

Robert De Niro and Joe Pesci explain Unions, Intersects and Exceptions

Robert De Niro has quite a few films in the top 250, and so does Joe Pesci.

JoeBob

Here’s how to find a list of all films in which either of them have appeared, using UNION:

DeNiroPesci1

Worth noting here is that UNION will eliminate duplicates – i.e. films that both actors have appeared in together (UNION ALL lists duplicates). Speaking of films starring both actors, we can find this out using INTERSECT:

DeNiroPesci2

Also, we might want to find those films that Robert De Niro is in, but Joe Pesci is not – we can do this using EXCEPT:

DeNiroPesci3

Clint Eastwood and Morgan Freeman are Cannier than Robert De Niro and Joe Pesci

The trouble with UNION, INTERSECT AND EXCEPT above is that we end up scanning the same table (Film) twice – once for Robert De Niro, and once for Joe Pesci. It’s not efficient, especially if we are scanning a large table. Clint Eastwood and Morgan Freeman also have a lot of films in the top 250, and having been around longer, they know a few things.

MorganClint

To get a list of films that either Clint Eastwood or Morgan Freeman have appeared in, we can use the following query:

ClintMorgan1

Here, we use an “OR” condition in our “WHERE” clause, so the Film table is only scanned once. The addition of the DISTINCT keyword ensures that we remove duplicate films (i.e. films that Clint Eastwood and Morgan Freeman have appeared in together will only be listed once).

Speaking of films that star both actors, consider the following query:

ClintMorgan2

This is interesting because we use aliasing to work with two copies of the Actor and lnk_Actor_Film tables. Moreover, we join the lnk_Actor_Film table to a copy of itself! Conceptually, it looks like this:

dbstuff2

Whilst we are scanning the Actor table twice, we are doing that with the INTERSECT version of this query anyhow.

Lastly, we consider the films that Clint Eastwood is in, but Morgan Freeman is not. Have a look at the following query:

ClintMorgan3

There are a few interesting points here. Firstly, we create a temporary dataset FreemanFilmIds that only exists for the execution of the query. This table stores the Ids of Films that Morgan Freeman has appeared in, without having to scan the Film table at all (we’re only interested in the ID here).

The next interesting point is that we left join our Clint Eastwood film data with our FreemanFildIds dataset. Conceptually, it looks like this:

LeftJoin

We then filter on FreemanFildIds.FilmId is null – that is, those films that Clint Eastwood is in, but Morgan Freeman is not.

Filtering by Aggregate Functions

Consider the following query to find the top 20 Actors by highest average film rating of films they have appeared in:

Here are the results:

top20

Who are these people? And all the same score? Where is Morgan Freeman?

On investigation, it turns out that all of the above people have only appeared in 1 film in the top 250 – The Shawshank Redemption – which has a score of 9.2 (the highest score) meaning that film extras or minor part actors have a very high average rating. I did like Clancy Brown in this film, and he’s great in The Highlander (not in the top 250 tho!), but he’s no Morgan Freeman.

As such, I rewrite the above query with the stipulation that an actor must appear in at least 5 films in the top 250. To filter on an aggregate function (namely COUNT()), I must use the HAVING keyword thus:

Here are the results:

top20-5

 

A lot of familiar names there.

I Didn’t Like Seeing Donny Go…

And so we reach the final curtain. I hope I’ve provided ample examples of  functions and techniques for your benefit. If so, please comment, like, favourite, retweet, +1, share…etc. Full script of SQL queries available here.

Advertisements

Everything in its Right Place – Database Normalisation

Flat File Tables – Everything at Once

I’m currently working on a project to replace an Excel based filesystem with a .NET / SQL Server solution. Part of the development effort necessitates the conversion of Excel lookup tables to relational database tables. Often, the Excel lookup tables are “flat file” in nature, and require normalisation.

To illustrate the point without discussing project data, consider the following dataset (data taken from IMDb Top 250 as of date of posting):

dataset

This dataset represents a list of films and the actors who were in them. If I wanted to find the cast of a specific film using Excel, I can use a filter on the film name:

lebowski

The problem here is that a lot of data is repeated unnecessarily – if a film name was entered incorrectly, it would need to be changed for all occurrences in the film column. A similar issue arises for actor names.

There are clearly 2 distinct datasets here – Films and Actors. The link between these two datasets is known as a relationship; namely what films an actor has appeared in and vice versa.

Combining all aspects into a single dataset (i.e. the Cartesian product) produces what is known as a “flat file” database – a single table with all info.

Why do people store data like this unnecessarily? Well, it usually starts with a need to record all data about a relationship – the film details, the actor details…etc – but the subsequent analysis of the data into distinct groups is never done, and the table grows in the fashion because “that’s the way the data has always been stored”. The process of splitting the datasets up and identifying the relationships between them is known as normalisation. Here are some of the techniques I have used in my project work applied to the above dataset.

Importing Excel Data

The first thing to to is to import my Excel flat file dataset into a similar flat file table in my database. Normally, I would save my dataset off as a comma delimited values file, but there is an issue here as film titles can include commas as well, which would break a subsequent data import. To get around this, I use the method defined here to save my file as pipe delimited instead.

I then create a database table that will store all our flat file data thus:

Lastly, I run a bulk insert to add all our CSV data to the database:

Deriving our Film Dataset

Using the flat file table as a base, I create a new table of the list of distinct films:

I populate the film table thus:

I then modify the flat file table to add a link to the film table:

And populate this column thus:

Furthermore, we index this column as follows:

I no longer have need for the film columns on my flat file table, so those column are dropped:

Deriving our Actor Dataset

A similar series of scripts is applied to create an Actor table:

We are left with FilmId, ActorId and the property of this relationship (i.e. the character that the actor played in the film). Just to tidy things up, we’ll rename the table from FilmData to lnk_Actor_Film:

The Film Genres table – Getting Clever with Functions and Cursors

To create a Genre table (and the relationship between Film and Genre),  a little more work is required due to the way film genres are recorded in the flat file.

That is, a film has a comma delimited string of genres stored against it; ideally, we wish to get the complete list of distinct genres. To do this, we need the following custom function, which will split a single comma delimited string up and return a table of its constituent parts:

Next, we will use a cursor to iterate over all our films and add genres to new table thus:

Now we want the a table of the distinct genres, which we create and populate as follows:

Now for the relationship table to link Films and Genre. Again, we use another cursor to iterate over our films – this time we find all GenreIds that apply to the film, and add entries in the link table, and then drop the “Genres” column from the Film table:

So our Entity-Relationship Model looks like this:

ERDiagram

Hi Ho! Data Mining… Coming Soon

Normalised databases are perfect for Data mining i.e. the analysis of a dataset for the purpose of identifying patterns. These patterns are used to give us a better understanding as to the nature of the data we are recording. I’m going to follow up with another post on this topic in a few days. In the meantime, the both the CSV file and my MS SQL scripts are attached if you’d like to play around with the above dataset: