2015 UK General Election Normalising Data

Jump to navigation Jump to search

The unnormalised data from the CSV file looks like this:


Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1
  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

The columns headins are:

ons_id	ons_region_id	constituency_name	county_name	region_name	country_name	constituency_type	party_name	party_abbreviation	firstname	surname	gender	sitting_mp	former_mp	votes	share	change


Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
county_name              -> ons_region_id
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

Bold indicates a primary key, italics indicates a foreign key.

  • constituency(ons_id, constituency, county_name, contituency_type)
  • county(county_name, ons_region_id)
  • region(ons_region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)

Implement the tables

We need to start with the tables that do not have out-going foreign keys.


  party_id VARCHAR(50) PRIMARY KEY,
  party_name VARCHAR(50)


  ons_region_id VARCHAR(10),
  region_name VARCHAR(50),
  country_name VARCHAR(50)

Now that we have the foreign key target in place we can introduce county which refers to ons_region_id

  county_name VARCHAR(50) PRIMARY KEY,
  ons_region_id VARCHAR(10),
  FOREIGN KEY (ons_region_id) REFERENCES region(ons_region_id)

And now constituency which references county_name:

CREATE TABLE constituency(
  constituency VARCHAR(50),
  county_name VARCHAR(10),
  contituency_type VARCHAR(10),
  FOREIGN KEY (county_name) REFERENCES county(county_name)

And finally candidate: CREATE TABLE candidate(

 ons_id VARCHAR(10),
 firstname VARCHAR(50),
 surname VARCHAR(50),
 gender VARCHAR(10),
 party_id VARCHAR(50),
 sitting_mp VARCHAR(3),
 former_mp VARCHAR(3),
 votes INT,
 share FLOAT,
 FLOAT change,
 PRIMARY KEY (ons_id,firstname,surname),
 FOREIGN KEY (ons_id) REFERENCES constituency(ons_id),
 FOREIGN KEY (party_id) REFERENCES party(party_id)