Flat File Tables – Everything at Once
I’m currently working on a project to replace an Excel based filesystem with a .NET / SQL Server solution. Part of the development effort necessitates the conversion of Excel lookup tables to relational database tables. Often, the Excel lookup tables are “flat file” in nature, and require normalisation.
To illustrate the point without discussing project data, consider the following dataset (data taken from IMDb Top 250 as of date of posting):
This dataset represents a list of films and the actors who were in them. If I wanted to find the cast of a specific film using Excel, I can use a filter on the film name:
The problem here is that a lot of data is repeated unnecessarily – if a film name was entered incorrectly, it would need to be changed for all occurrences in the film column. A similar issue arises for actor names.
There are clearly 2 distinct datasets here – Films and Actors. The link between these two datasets is known as a relationship; namely what films an actor has appeared in and vice versa.
Combining all aspects into a single dataset (i.e. the Cartesian product) produces what is known as a “flat file” database – a single table with all info.
Why do people store data like this unnecessarily? Well, it usually starts with a need to record all data about a relationship – the film details, the actor details…etc – but the subsequent analysis of the data into distinct groups is never done, and the table grows in the fashion because “that’s the way the data has always been stored”. The process of splitting the datasets up and identifying the relationships between them is known as normalisation. Here are some of the techniques I have used in my project work applied to the above dataset.
Importing Excel Data
The first thing to to is to import my Excel flat file dataset into a similar flat file table in my database. Normally, I would save my dataset off as a comma delimited values file, but there is an issue here as film titles can include commas as well, which would break a subsequent data import. To get around this, I use the method defined here to save my file as pipe delimited instead.
I then create a database table that will store all our flat file data thus:
Lastly, I run a bulk insert to add all our CSV data to the database:
Deriving our Film Dataset
Using the flat file table as a base, I create a new table of the list of distinct films:
I populate the film table thus:
I then modify the flat file table to add a link to the film table:
And populate this column thus:
Furthermore, we index this column as follows:
I no longer have need for the film columns on my flat file table, so those column are dropped:
Deriving our Actor Dataset
A similar series of scripts is applied to create an Actor table:
We are left with FilmId, ActorId and the property of this relationship (i.e. the character that the actor played in the film). Just to tidy things up, we’ll rename the table from FilmData to lnk_Actor_Film:
The Film Genres table – Getting Clever with Functions and Cursors
To create a Genre table (and the relationship between Film and Genre), a little more work is required due to the way film genres are recorded in the flat file.
That is, a film has a comma delimited string of genres stored against it; ideally, we wish to get the complete list of distinct genres. To do this, we need the following custom function, which will split a single comma delimited string up and return a table of its constituent parts:
Next, we will use a cursor to iterate over all our films and add genres to new table thus:
Now we want the a table of the distinct genres, which we create and populate as follows:
Now for the relationship table to link Films and Genre. Again, we use another cursor to iterate over our films – this time we find all GenreIds that apply to the film, and add entries in the link table, and then drop the “Genres” column from the Film table:
So our Entity-Relationship Model looks like this:
Hi Ho! Data Mining… Coming Soon
Normalised databases are perfect for Data mining i.e. the analysis of a dataset for the purpose of identifying patterns. These patterns are used to give us a better understanding as to the nature of the data we are recording. I’m going to follow up with another post on this topic in a few days. In the meantime, the both the CSV file and my MS SQL scripts are attached if you’d like to play around with the above dataset: