Lights! Camera! Action! Data Mining the IMDb Top 250

Hi ho!

As discussed in the previous post, data mining is the analysis of a dataset for the purpose of identifying patterns. In turn, these patterns give us a better understanding as to the nature of the data we are recording. Mining a weather dataset can help predict long term weather forecasts and plan accordingly; mining a sales dataset can help identity best selling products / sales groups and address stock levels / marketing strategy accordingly.

In this post, I will be mining the IMDb Top 250 dataset from the previous post. Whilst the information obtained herein is unlikely to be of any financial benefit to the reader, the purpose will be to impart practical examples of data mining techniques that the reader may then apply to their own situation.

Querying and Filtering a Single Table

A lot can be learned from a few queries on a single table. Let’s look at the Film table and find all those films in the top 250 that have been release from the year 2000 onwards:

top250

How many films is that though?

The answer here is 80 films. COUNT() is an example of an aggregate function – that is, one that returns a single value based on rows of results, rather than a value per row. Some other handy aggregate functions include:

  • MIN() – the lowest value of a column
  • MAX() – the maximum value of a column
  • AVG() – the mean average value of a column
  • SUM() – the sum of all values of a column

I haven’t included a SUM() example above (as summing ratings is pretty meaningless). That said, if our Film table had a “Total Ticket Sales” column, it would be interesting to sum that and compare different years to see the rise and fall of sales over the years, and potentially chart that against the state of the economy at that time. Just saying…

Going Wild!

If we are searching by text criteria, we can specify a generic data range using the % wildcard. The following query lists all those films that begin with the word “The”:

TheFilm

However, we may wish to search for any film where a specific word appears anywhere in the title. The following query lists all those films with the word “City” in the title:

CityFilm

A word of warning here though – adding the percentage wildcard to the beginning of a string search will negate any indexing done on this column. This will result in queries that take much longer to return than expected, especially if we query a large dataset!

Cross Table Queries

A lot more can be learned by interrogating multiple tables at once using joins. By joining the following three tables:

  • Actor
  • lnk_Actor_Film
  • Film

We can determine the cast of a film:

lebowski2

Or find out what films an actor has been in that are in the top 250:

ellenpage

Grouping Films by Genre

I’m a big Sci-Fi fan. Using the following query, I can find the list of Sci-Fi films in the top 250, sorted by descending rating:

scifi

What’s also interesting would be to find what the most popular genres are – that is, the number of films in the top 250 per genre. This query will combine a row value (Genre.GenreName) with an aggregate value (the number of films that belong to the genre specified in Genre.GenreName). To do this, we use GROUP BY thus:

genres

Note that the sum of GenreCount will exceed 250 here, as films can be of more than one genre. Speaking multiple genres, it’s also useful to know what sort of films a specified actor normally appears in:

portmangenres

Robert De Niro and Joe Pesci explain Unions, Intersects and Exceptions

Robert De Niro has quite a few films in the top 250, and so does Joe Pesci.

JoeBob

Here’s how to find a list of all films in which either of them have appeared, using UNION:

DeNiroPesci1

Worth noting here is that UNION will eliminate duplicates – i.e. films that both actors have appeared in together (UNION ALL lists duplicates). Speaking of films starring both actors, we can find this out using INTERSECT:

DeNiroPesci2

Also, we might want to find those films that Robert De Niro is in, but Joe Pesci is not – we can do this using EXCEPT:

DeNiroPesci3

Clint Eastwood and Morgan Freeman are Cannier than Robert De Niro and Joe Pesci

The trouble with UNION, INTERSECT AND EXCEPT above is that we end up scanning the same table (Film) twice – once for Robert De Niro, and once for Joe Pesci. It’s not efficient, especially if we are scanning a large table. Clint Eastwood and Morgan Freeman also have a lot of films in the top 250, and having been around longer, they know a few things.

MorganClint

To get a list of films that either Clint Eastwood or Morgan Freeman have appeared in, we can use the following query:

ClintMorgan1

Here, we use an “OR” condition in our “WHERE” clause, so the Film table is only scanned once. The addition of the DISTINCT keyword ensures that we remove duplicate films (i.e. films that Clint Eastwood and Morgan Freeman have appeared in together will only be listed once).

Speaking of films that star both actors, consider the following query:

ClintMorgan2

This is interesting because we use aliasing to work with two copies of the Actor and lnk_Actor_Film tables. Moreover, we join the lnk_Actor_Film table to a copy of itself! Conceptually, it looks like this:

dbstuff2

Whilst we are scanning the Actor table twice, we are doing that with the INTERSECT version of this query anyhow.

Lastly, we consider the films that Clint Eastwood is in, but Morgan Freeman is not. Have a look at the following query:

ClintMorgan3

There are a few interesting points here. Firstly, we create a temporary dataset FreemanFilmIds that only exists for the execution of the query. This table stores the Ids of Films that Morgan Freeman has appeared in, without having to scan the Film table at all (we’re only interested in the ID here).

The next interesting point is that we left join our Clint Eastwood film data with our FreemanFildIds dataset. Conceptually, it looks like this:

LeftJoin

We then filter on FreemanFildIds.FilmId is null – that is, those films that Clint Eastwood is in, but Morgan Freeman is not.

Filtering by Aggregate Functions

Consider the following query to find the top 20 Actors by highest average film rating of films they have appeared in:

Here are the results:

top20

Who are these people? And all the same score? Where is Morgan Freeman?

On investigation, it turns out that all of the above people have only appeared in 1 film in the top 250 – The Shawshank Redemption – which has a score of 9.2 (the highest score) meaning that film extras or minor part actors have a very high average rating. I did like Clancy Brown in this film, and he’s great in The Highlander (not in the top 250 tho!), but he’s no Morgan Freeman.

As such, I rewrite the above query with the stipulation that an actor must appear in at least 5 films in the top 250. To filter on an aggregate function (namely COUNT()), I must use the HAVING keyword thus:

Here are the results:

top20-5

 

A lot of familiar names there.

I Didn’t Like Seeing Donny Go…

And so we reach the final curtain. I hope I’ve provided ample examples of  functions and techniques for your benefit. If so, please comment, like, favourite, retweet, +1, share…etc. Full script of SQL queries available here.

Advertisements

Everything in its Right Place – Database Normalisation

Flat File Tables – Everything at Once

I’m currently working on a project to replace an Excel based filesystem with a .NET / SQL Server solution. Part of the development effort necessitates the conversion of Excel lookup tables to relational database tables. Often, the Excel lookup tables are “flat file” in nature, and require normalisation.

To illustrate the point without discussing project data, consider the following dataset (data taken from IMDb Top 250 as of date of posting):

dataset

This dataset represents a list of films and the actors who were in them. If I wanted to find the cast of a specific film using Excel, I can use a filter on the film name:

lebowski

The problem here is that a lot of data is repeated unnecessarily – if a film name was entered incorrectly, it would need to be changed for all occurrences in the film column. A similar issue arises for actor names.

There are clearly 2 distinct datasets here – Films and Actors. The link between these two datasets is known as a relationship; namely what films an actor has appeared in and vice versa.

Combining all aspects into a single dataset (i.e. the Cartesian product) produces what is known as a “flat file” database – a single table with all info.

Why do people store data like this unnecessarily? Well, it usually starts with a need to record all data about a relationship – the film details, the actor details…etc – but the subsequent analysis of the data into distinct groups is never done, and the table grows in the fashion because “that’s the way the data has always been stored”. The process of splitting the datasets up and identifying the relationships between them is known as normalisation. Here are some of the techniques I have used in my project work applied to the above dataset.

Importing Excel Data

The first thing to to is to import my Excel flat file dataset into a similar flat file table in my database. Normally, I would save my dataset off as a comma delimited values file, but there is an issue here as film titles can include commas as well, which would break a subsequent data import. To get around this, I use the method defined here to save my file as pipe delimited instead.

I then create a database table that will store all our flat file data thus:

Lastly, I run a bulk insert to add all our CSV data to the database:

Deriving our Film Dataset

Using the flat file table as a base, I create a new table of the list of distinct films:

I populate the film table thus:

I then modify the flat file table to add a link to the film table:

And populate this column thus:

Furthermore, we index this column as follows:

I no longer have need for the film columns on my flat file table, so those column are dropped:

Deriving our Actor Dataset

A similar series of scripts is applied to create an Actor table:

We are left with FilmId, ActorId and the property of this relationship (i.e. the character that the actor played in the film). Just to tidy things up, we’ll rename the table from FilmData to lnk_Actor_Film:

The Film Genres table – Getting Clever with Functions and Cursors

To create a Genre table (and the relationship between Film and Genre),  a little more work is required due to the way film genres are recorded in the flat file.

That is, a film has a comma delimited string of genres stored against it; ideally, we wish to get the complete list of distinct genres. To do this, we need the following custom function, which will split a single comma delimited string up and return a table of its constituent parts:

Next, we will use a cursor to iterate over all our films and add genres to new table thus:

Now we want the a table of the distinct genres, which we create and populate as follows:

Now for the relationship table to link Films and Genre. Again, we use another cursor to iterate over our films – this time we find all GenreIds that apply to the film, and add entries in the link table, and then drop the “Genres” column from the Film table:

So our Entity-Relationship Model looks like this:

ERDiagram

Hi Ho! Data Mining… Coming Soon

Normalised databases are perfect for Data mining i.e. the analysis of a dataset for the purpose of identifying patterns. These patterns are used to give us a better understanding as to the nature of the data we are recording. I’m going to follow up with another post on this topic in a few days. In the meantime, the both the CSV file and my MS SQL scripts are attached if you’d like to play around with the above dataset:

 

 

 

 

Hello World – An Autopsy

No need to send flowers.

I’ve opted to write an Android “Hello World” tutorial in reverse – rather than start from scratch and progress along to a final product, this tutorial is a teardown of a finished application (albeit a simple one). The goal is to cover a range of fundamental concepts with practical examples.

Installing the app

Firstly, let’s take a look at the app. I’ve made the app publicly available: HelloWorld.apk – download and copy to your phone. You can copy via USB cable, email it to your phone, drop it in via Dropbox / Google Drive… etc.

You’ll need to enable application installs from non-market sources. On your phone, go to Settings -> Security -> Device Administration and click on “Allow installation of apps from sources other than the Play Store”.

wpid-Screenshot_2013-08-16-11-41-25.png

Next, navigation to where you’ve saved the file on your phone. Click to install. Done.

Hello World! – We knew him so well

This is a very simple app consisting of 2 screens. The first screen has an editable textbox and two buttons.

wpid-Screenshot_2013-08-16-11-42-56.png

Clicking the button “Open Dialog” will launch an alert dialog with the inputted text:

wpid-Screenshot_2013-08-16-11-43-09.png

Clicking the button “Open Activity” will open a new screen displaying the inputted text:

wpid-Screenshot_2013-08-16-11-43-15.png

Simples.

What can we learn from this?

In tearing down this app, there are a number of concepts we can cover, such as:

  • Designing a front end and the controls available
  • Designing a back end that references UI controls
  • Simple Navigation
  • Multi language support
  • Multi resolution support

Better get started then.

Code Teardown

I’ve made my source available on github – Android Hello World

Download the file and unzip. Launch Eclipse (see previous blog post), create a new Android application (package com.example.helloworld), and drop the source code in.

MainActivity.java – Summary

This class inherits from the Activity class. An Activity class is the back end code for a front end screen. That is, all screens have an Activity class associated with them.

The method onCreate sets the layout of the screen with a call to setContentView, input parameter R.layout.activity_main.

This input value is easily explained – in the project hierarchy, there is a res/layout folder. In that folder, there is a file: activity_main.xml.

This xml file defines the screen layout. Any xml file dropped into this directory can be referenced in code using the R.layout. prefix.

Before moving on to activity_main.xml, we can see that MainActivity instantiates two buttons and assigns onClick events to them.

To instantiate the buttons, we grab them from the front end using the findViewById method. In the first case, we use the parameter value R.id.btnDialog – this is a unique ID assigned to a control in the XML layout – more on that next.

activity_main.xml

Screen design can be done by dragging and dropping controls onto a canvas.  If you look at the raw XML, you can see that this screen has:

  • A TextView (label) with the id “tvQuestion”
  • An EditText (editable textbox) with the id “etAnswer”
  • A Button with the id “btnActivity”
  • A Button with the id “btnDialog”

When you need to reference a control at the back end code, do so via its ID.

There are several styling attributes – these can be set in the raw XML, or in the canvas editor.

Worth noting is the attribute “android:text”. This can be a text value, but in each case, I’ve done something like android:text=”@string/lblOpenActivity”.

This again is an external reference – it says that the text value we want to use is in res/values/strings.xml under the heading lblOpenActivity. Let’s take a look.

res/values/strings.xml

This file holds all string variables we wish to use. For the example above, we have the following key-value pair:

<string name=”lblOpenActivity”>Open Activity</string>

If I wanted to provide localisation into Polish, it’s quite straightforward – I just need to create the folder res/values-pl, and a strings.xml file in that folder with the necessary translations:

<string name=”lblOpenActivity”>Otwórz aktywny</string>

Android knows which file to load based on the locale of the device. If an appropriate locale file or key/value pair does not exist, the system will default to the value in the vales/strings.xml file.

Drawable folders

In addition to responding to user locale, the application will automatically respond to device resolutions.

The source tree has a number of drawables folders, each one suffixed with a device resolution.

The concept here is each folder contains icons for the appropriate device resolution.

For example, the folder drawable-mdpi has an icon ic_launcher.png. The folder drawable-hdpi has an icon with the same name, but higher resolution.

In code, we merely refer to drawable.ic_launcher – the system will know which resolution to load based on the device.

AndroidManifest.xml

This file is important in that it defines:

  • All Activities in the application (here we have MainActivity and ChildActivity).
  • Which Activity is the starting activity (the launcher).
  • What is the minimum version of the operating system supported by the app
  • The app icon and label
  • The version number and name

Note that the app icon id defined as per the drawables section above.

Version number and name – these will be used if you submit to the Play store. The number is an integer and has to increment with each new version you release, and the name is merely a label.

MainActivity.java – Opening Dialogs

Back to MainActivity.java. Have a look at the launchDialog() method. You can see we are extracting the value of the editable textbox and displaying it in a dialog. Nothing complex at all here. We could add additional functionality to the onClick method of the “OK” button if we really wanted to.

MainActivity.java – Opening Activities

Have a look at the launchActivity() method.  Here we are getting the value of the editable textbox and passing that as an input parameter to a new screen.

As an analogy, think of a hyperlink with a  querystring – not only do we define where to go to, but also the state of the screen.

To launch a new screen, we do so using the Intent class. In the code, we launch an intent for the ChildActivity screen.

We add a key-value pair to the intent using the “putExtra” method, before starting the activity.

ChildActivity.java

In the onCreate method of this Activity, we grab the key-value pairs from launching intent (akin to reading the querystring).

We do so using the getExtras() method:

Bundle extras = getIntent().getExtras();

Then we do a lookup in the bundle:

String input = extras.getString(“input”).

Where “input” was the name of the key used in the previous screen.

This Activity also has a button – clicking it calls the finish() method, which dismisses the activity and returns the user to the previous screen.

Final Summary

Thank you all for coming. Tea and sandwiches will be served back at the family home.

Using the Factory Design Pattern

It’s sorta, kinda, like…

I’ve always been vaguely aware of the factory design pattern, but I’ve actually only started using it in earnest in the last few months. The factory design pattern is used whenever you need to create an instance of a class, but the specific class you require an instance of is not known until runtime. But what exactly does that mean? I’m a great believer in learning by example, so I took the opportunity to write this blog in an attempt to demystify what I’ve found to be a really useful way of compartmentalising my code.

Abstract Classes

At the heart of the factory model is an abstract class. This is a template class from which a number of variants are derived (it’s the job of the factory to provide the correct variant at runtime, but more on that in a bit). Let’s assume we are designing a system for an insurance company. They may have use of an abstract class such as the one below:

1_policy

That is, an insurance policy has an ID, an insured party, start and end date and a property indicating the total value of goods insured by the policy. Now let’s assume that our insurance company offer both motor and property insurance. Consequently, they may have classes that look like those below:

These classes are variants on the Policy class, each with properties relevant to the variant:

  • the motor policy maintains a list of insured vehicles, and defines our TotalInsuredValue property as the sum of the value of all insured vehicles
  • the property policy maintains a list of insured properties, and defined our TotalInsuredValue property as the sum of the value of all insured properties, including contents value if the property has contents insured also.

The Factory

Imagine the first screen of our application – what type of policy do you require? Our factory comes into play whenever we need to provide the correct variant thus:

3_factory

With common fields gathered, the user is then redirected the the appropriate area of the application – that is, the user will be redirected to an area where they can enter the details of the vehicles or properties to be insured, dependent upon the policy type.

Listing Customer Policies

And now the beauty of the abstract class approach – we can write code such as the following:

4_customerpolicies

Note that each Policy will call the appropriate method to get the value of TotalInsuredValue dependent on the specific Policy class. In this way, we can extend the system and offer new policy types without interfering with existing types – let’s say we now offer insurance for mobile phones:

5_mobile

Compartmental Development

The above model allows new products to be added into the system in isolation. This is especially useful when you have multiple development teams, each working on a different product to be released in the same development cycle, not to mention production support catering for existing products that require bug fixes after go-live.

Of course, this is not the only applicable design pattern for such a system, but I hope I’ve made this one clearer.

Setting up an Android Development Environment on Windows

So you want to give Android development a go, but don’t know where to start? That’s easy – start by setting up a development environment of course!

STEP 1 – Install Java Development Kit

Android development requires at least JDK 6 (note JRE alone is insufficient). If you need to install the JDK, you can grab it from Oracle’s website (link to JDK7: http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html).

STEP 2 – Add your Java directory to system Path

First, make note of the JDK binary directory (e.g. C:\Program Files\Java\jdk1.7.0_21\bin).

Next go to Control Panel > System and Security > System. Click on “Advanced system settings”:

1_systemproperties

From the “System Properties” popup, click on the “Advanced” tab; from there, click the “Environment Variables” button:

2_systemproperties

The “Environment Variables” popup has 2 lists of variables, and we are interested in the second list (“System Variables”). Scroll down in the list and highlight the “Path” variable before clicking the “Edit…” button:

3_environmentvariables

The “Edit System Variable” popup will appear. Add a semi-colon to the end of “Variable Value” and then append the directory path for the JDK binary. Note, “append” – do not overwrite! Then click OK.

4_editpath

STEP 3 – Download the Android SDK

Go to http://developer.android.com/sdk/index.html and click on the “Download the SDK” button.

Unzip the downloaded file and run eclipse.exe.

That’s it!