Lights! Camera! Action! Data Mining the IMDb Top 250

Hi ho!

As discussed in the previous post, data mining is the analysis of a dataset for the purpose of identifying patterns. In turn, these patterns give us a better understanding as to the nature of the data we are recording. Mining a weather dataset can help predict long term weather forecasts and plan accordingly; mining a sales dataset can help identity best selling products / sales groups and address stock levels / marketing strategy accordingly.

In this post, I will be mining the IMDb Top 250 dataset from the previous post. Whilst the information obtained herein is unlikely to be of any financial benefit to the reader, the purpose will be to impart practical examples of data mining techniques that the reader may then apply to their own situation.

Querying and Filtering a Single Table

A lot can be learned from a few queries on a single table. Let’s look at the Film table and find all those films in the top 250 that have been release from the year 2000 onwards:

top250

How many films is that though?

The answer here is 80 films. COUNT() is an example of an aggregate function – that is, one that returns a single value based on rows of results, rather than a value per row. Some other handy aggregate functions include:

  • MIN() – the lowest value of a column
  • MAX() – the maximum value of a column
  • AVG() – the mean average value of a column
  • SUM() – the sum of all values of a column

I haven’t included a SUM() example above (as summing ratings is pretty meaningless). That said, if our Film table had a “Total Ticket Sales” column, it would be interesting to sum that and compare different years to see the rise and fall of sales over the years, and potentially chart that against the state of the economy at that time. Just saying…

Going Wild!

If we are searching by text criteria, we can specify a generic data range using the % wildcard. The following query lists all those films that begin with the word “The”:

TheFilm

However, we may wish to search for any film where a specific word appears anywhere in the title. The following query lists all those films with the word “City” in the title:

CityFilm

A word of warning here though – adding the percentage wildcard to the beginning of a string search will negate any indexing done on this column. This will result in queries that take much longer to return than expected, especially if we query a large dataset!

Cross Table Queries

A lot more can be learned by interrogating multiple tables at once using joins. By joining the following three tables:

  • Actor
  • lnk_Actor_Film
  • Film

We can determine the cast of a film:

lebowski2

Or find out what films an actor has been in that are in the top 250:

ellenpage

Grouping Films by Genre

I’m a big Sci-Fi fan. Using the following query, I can find the list of Sci-Fi films in the top 250, sorted by descending rating:

scifi

What’s also interesting would be to find what the most popular genres are – that is, the number of films in the top 250 per genre. This query will combine a row value (Genre.GenreName) with an aggregate value (the number of films that belong to the genre specified in Genre.GenreName). To do this, we use GROUP BY thus:

genres

Note that the sum of GenreCount will exceed 250 here, as films can be of more than one genre. Speaking multiple genres, it’s also useful to know what sort of films a specified actor normally appears in:

portmangenres

Robert De Niro and Joe Pesci explain Unions, Intersects and Exceptions

Robert De Niro has quite a few films in the top 250, and so does Joe Pesci.

JoeBob

Here’s how to find a list of all films in which either of them have appeared, using UNION:

DeNiroPesci1

Worth noting here is that UNION will eliminate duplicates – i.e. films that both actors have appeared in together (UNION ALL lists duplicates). Speaking of films starring both actors, we can find this out using INTERSECT:

DeNiroPesci2

Also, we might want to find those films that Robert De Niro is in, but Joe Pesci is not – we can do this using EXCEPT:

DeNiroPesci3

Clint Eastwood and Morgan Freeman are Cannier than Robert De Niro and Joe Pesci

The trouble with UNION, INTERSECT AND EXCEPT above is that we end up scanning the same table (Film) twice – once for Robert De Niro, and once for Joe Pesci. It’s not efficient, especially if we are scanning a large table. Clint Eastwood and Morgan Freeman also have a lot of films in the top 250, and having been around longer, they know a few things.

MorganClint

To get a list of films that either Clint Eastwood or Morgan Freeman have appeared in, we can use the following query:

ClintMorgan1

Here, we use an “OR” condition in our “WHERE” clause, so the Film table is only scanned once. The addition of the DISTINCT keyword ensures that we remove duplicate films (i.e. films that Clint Eastwood and Morgan Freeman have appeared in together will only be listed once).

Speaking of films that star both actors, consider the following query:

ClintMorgan2

This is interesting because we use aliasing to work with two copies of the Actor and lnk_Actor_Film tables. Moreover, we join the lnk_Actor_Film table to a copy of itself! Conceptually, it looks like this:

dbstuff2

Whilst we are scanning the Actor table twice, we are doing that with the INTERSECT version of this query anyhow.

Lastly, we consider the films that Clint Eastwood is in, but Morgan Freeman is not. Have a look at the following query:

ClintMorgan3

There are a few interesting points here. Firstly, we create a temporary dataset FreemanFilmIds that only exists for the execution of the query. This table stores the Ids of Films that Morgan Freeman has appeared in, without having to scan the Film table at all (we’re only interested in the ID here).

The next interesting point is that we left join our Clint Eastwood film data with our FreemanFildIds dataset. Conceptually, it looks like this:

LeftJoin

We then filter on FreemanFildIds.FilmId is null – that is, those films that Clint Eastwood is in, but Morgan Freeman is not.

Filtering by Aggregate Functions

Consider the following query to find the top 20 Actors by highest average film rating of films they have appeared in:

Here are the results:

top20

Who are these people? And all the same score? Where is Morgan Freeman?

On investigation, it turns out that all of the above people have only appeared in 1 film in the top 250 – The Shawshank Redemption – which has a score of 9.2 (the highest score) meaning that film extras or minor part actors have a very high average rating. I did like Clancy Brown in this film, and he’s great in The Highlander (not in the top 250 tho!), but he’s no Morgan Freeman.

As such, I rewrite the above query with the stipulation that an actor must appear in at least 5 films in the top 250. To filter on an aggregate function (namely COUNT()), I must use the HAVING keyword thus:

Here are the results:

top20-5

 

A lot of familiar names there.

I Didn’t Like Seeing Donny Go…

And so we reach the final curtain. I hope I’ve provided ample examples of  functions and techniques for your benefit. If so, please comment, like, favourite, retweet, +1, share…etc. Full script of SQL queries available here.

Advertisements

Upcycling – A New Use for an Old Smartphone

Hoard Mode

I confess, I’m a hoarder. It’s a habit that baffles family and friends; why do I retain items that other (saner) people would have thrown out as junk long ago?

In most cases, there is a misguided belief that an item will become useful again at some point in the future. In some cases, there is a sentimental attachment that prevents me disposing of said item. A few favourite exhibits from my extensive collection / pile are shown below.

L-R: Megadrive circa 1990; assorted F1 memorabilia from races attended last decade;  the mighty Scorpionok; one of many old phones

L-R: Sega Megadrive circa 1990; assorted F1 memorabilia from races attended last decade; the mighty Scorpionok; one of many old phones

The last item in the above picture is a Samsung Galaxy Ace, my first smartphone. I have really fond memories of this phone – it could do so much without costing a fortune, and it started my interest in Android, but it’s light years behind my Nexus 4. How could it be put to use? A backup / emergency phone? I have plenty of those already…

Things to do with an old Droid

Consider this: a smartphone is a mini computer, with built in WiFi, camera, Bluetooth, GPS, sensors, solid state storage – it’s a digital Swiss army knife.

The protocol droid is fluent in over six million forms of communication, despite having its mind wiped

The protocol droid is fluent in over six million forms of communication, despite having its mind wiped

There are several articles detailing tonnes of uses for an old Android, such as:

  • Media server
  • Internet connected security camera
  • In car GPS and navigation system
  • Pedometer
  • Torrentbox
  • VPN Server
  • Bitcoin miner
  • Internet radio

In time, I think I’ll attempt all of the above, but for today I’m writing about a torrentbox / media server. The instructions below show how to create either, or indeed both (which is what I did).

Torrenting for Beginners through the media of Tetris and Food

For those not in the know, torrenting is a very popular way of downloading larges files on the internet. Instead of downloading a large file from a single server, a user downloads file fragments from multiple servers asynchronously, which are then assembled in the correct order on the client machine. A separate, smaller file (the torrent file) tells the users computer which fragments to fetch, and the order in which they are to be assembled.

Notice that there is only one straight line piece in this diagram. Entirely intentional.

Notice that there is only one straight line piece in this diagram. Entirely intentional.

In essence, a torrent file is like a recipe – it details the ingredients (fragments) required to make a meal (large file) and the ordered steps to recreate it.

The advantage of the multi-server download approach presented by torrenting is that if one server goes down, the user just gets the fragments from somewhere else (a bit like buying some ingredients from Sainsbury’s if Tesco are out of stock, rather than waiting for Tesco to restock).

Disclaimer – you should not download copyrighted material. Please see this helpful video explaining why.

Creating a Torrentbox

Torrent users will search for torrent files on the internet. Those files are then opened with a torrent client program, and downloading begins. The download is left to progress in the background while the user does something else on their computer, or maybe just walks off and does something else entirely.

A torrentbox is a computer used only for the purpose of downloading files in this manner. Lots of old computers find new life as a torrentbox, why not old smartphones? They have disk storage, internet access, and require a minimal amount of power. By nature, they are highly portable as well.

And so I installed tTorrent on my Galaxy Ace.

I chose tTorrent because it’s one of the few torrent clients that has a “watch directory” function – that is, any torrent files dropped into a specified watch folder will automatically be opened and downloading will begin.

And so I installed Dropsync on my Galaxy Ace.

Dropsync allows me to synchronise a folder within my Dropbox account with a local folder on the phone. The idea being that I search for torrent files on my Nexus 4, my Chromebook, my Nexus 7 – any device with a browser – and drop them into my Dropbox folder. The Galaxy Ace syncs the folder, picks up the torrent file and begins downloading.

And all ready by the time I get home

And all ready by the time I get home

Unfortunately, the Galaxy Ace does not support USB On-The-Go, although most old phones running Android 4+ should (try it out if you have an old Galaxy S2 lying around). With USB OTG, I could attach an external hard drive to the phone for much more storage than an SD card could provide. That said, I’d recommend plugging both the smartphone and the external hard drive into an externally powered USB hub – that way the phone can access the hard disk without having to sacrifice an external power source and resort to the battery.

Creating a Media Server

The purpose of a media server will be to share out content stored on the phone with other devices on the same network. This may be content downloaded via torrenting, or any other media accessible to the device (via SD card or connected hard disk, say).

Just one app to install here – BubbleUPnP has a free version with limitations on how much video content may be streamed per day, but the licence is a little over £3 at present. This will make the phone and its content accessible to any other device on the network with a DLNA client:

  • If you have an Android phone or tablet, install BubbleUPnP on it (disable local content share in settings)
  • If you have an iPhone / iPad, install GoodPlayer
  • Windows phone / tablet, install MediaMonkey
  • Windows computers, Macs, XBox 360, XBox One, PS3 and PS4 already have built in clients
From has-been to centre of attention

From has-been to centre of attention

A New Life for an Old Friend

Okay, this setup will never rival a NAS box for performance, but it does give pretty good functionality for essentially nothing given that I had the parts lying around anyway, and power requirements are very low (it uses less energy in a day than it takes to boil an average kettle).

If I did want to spend money, these old phones will only get cheaper as new ones are released. In fact, I’ll probably end up inheriting a lot of them when family members upgrade to new phones.

But I’ll probably use those phones to try out some of the other ideas above.

It’s good to hoard after all.

Nexus 5 goodness on the Nexus 4

KitKat is here! Is that it?

If you’re a Nexus 4 owner who has recently updated to Android 4.4 KitKat, you may be a bit dismayed to find out that the home screen and app drawer hasn’t changed a great deal (save a few accent changes here and there). If you were expecting the Nexus 5 look and feel, with the “OK Google” and “Swipe left for Google Now” features – brace yourself – they aren’t there. So do you need to buy a Nexus 5 just to get your fix? Thankfully not!

KitKat on the Nexus 4. Apparently.

KitKat on the Nexus 4. Apparently.

Application Launchers

The look and feel of your Android device is determined by the application launcher you are using. Most users use the default launcher that came with the phone, and most equipment manufacturers will create a default launcher that is distinct for their brand – it’s why HTC Androids look different from Samsung Androids (they use launchers called “Sense” AND “TouchWiz” respectively), even though they are running the same OS underneath.

HTC Sense on left; Samsung TouchWiz on right.

HTC Sense on left; Samsung TouchWiz on right.

Application launchers are just another type of app, and you can install as many of them as you like. It’s one of the lauded features of Android over iOS – the ability to skin the device as you’d like. If you want your KitKat Nexus 4 to look like the Nexus 5, you just need to install the Google Experience Launcher that comes with the Nexus 5.

Installing the Google Experience Launcher

First, download the launcher app. It’s available to download from Android File Host here.

You’ll need to enable application installs from non-market sources. On your phone, go to Settings -> Security -> Device Administration and click on “Allow installation of apps from sources other than the Play Store”:

Enabling installation of non market applications. Disable this afterward.

Enabling installation of non market applications. Disable this afterward.

Click on file to install.

After install, I’d advise disabling install from unknown locations.

Vive la Difference!

A quick run down on the launcher. Firstly, the home screen now has translucent icons and folders:

Looking a bit different...

Looking a bit different…

Long press on the home screen to change wallpapers or add widgets (which are no longer in the app drawer):

Wallpaper, Widgets and Settings for Google Now

Wallpaper, Widgets and Settings for Google Now

The app drawer itself now has larger icons, displaying 20 icons per page rather than 25 from the Nexus 4 launcher. Note the translucent background rather than plain black also:

Nexus 4 launcher on left; Nexus 5 on right.

Nexus 4 launcher on left; Nexus 5 on right.

That “Race Results” app is great – the developer is a really cool guy. You can download the app here.

Another nice touch – if Google Music is running, the lock screen will use the album cover art of the track currently playing:

Mellon Collie. Awesome album.

Mellon Collie. Awesome album.

Swipe left from home screen for Google Now works as well:

Go Go Black and Gold!

Go Go Black and Gold!

Enabling “OK Google”

One last hurdle. Long press on the home screen, and choose “Settings”. From the screen that appears, select “Voice”, then select language as “English (US)”. For some reason this version of the launcher only works when US English is used, but don’t worry, you don’t need to put on a fake accent (try the veal, I’m here all week):

Yeeee haw!

Yeeee haw!

Next, give it a try:

Nifty

Nifty

Not for you?

If you’re not fussed with this new launcher and want to revert, doing so is a piece of cake. Launch the Settings app from the app drawer and tap on “Home” under the device menu:

Back in time with no need for a DeLorean.

Back in time with no need for a DeLorean.

The home screen lists all the launchers you have installed on the phone. Tap on the top launcher to revert back to default Nexus 4 launcher. You can also delete any launchers you have installed by tapping the bin icon.

Easy!