2015 UK General Election Normalising Data

Jump to navigation Jump to search

The unnormalised data from the CSV file looks like this:


Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1)<br>  FROM ge<br> HAVING COUNT(1)>1
  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id

FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

  • constituency(ons_id, region_id, constituency, county_name, contituency_type)
  • county(county_name, region_id)
  • region(region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)