Difference between revisions of "2015 UK General Election Normalising Data"

From SQLZOO
Jump to navigation Jump to search
Line 39: Line 39:
 
==Decide on tables==
 
==Decide on tables==
 
Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.
 
Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.
*ons_id identifies a constituency such as "Edinburgh South" or S14000024 as it is known in the office of national statistics. We will call this table '''seat''' as that is a short word used in parliamentary circles.
 
  
*seat(<b>seat_id</b>, <i>region_id</i>, constituency_name, county_name, contituency_type)
+
*constituency(<b>ons_id</b>, <i>region_id</i>, constituency, county_name, contituency_type)
 
*county(<b>county_name</b>, <i>region_id</i>)
 
*county(<b>county_name</b>, <i>region_id</i>)
 
*region(<b>region_id</b>, region_name, country_name)
 
*region(<b>region_id</b>, region_name, country_name)
 
*party(<b>party_id</b>, party_name)
 
*party(<b>party_id</b>, party_name)
*candidate(<b>seat_id</b>, <b>firstname</b>, <b>surname</b>, gender, <i>party_id</i>, sitting_mp, former_mp, votes, share, change)
+
*candidate(<b>ons_id</b>, <b>firstname</b>, <b>surname</b>, gender, <i>party_id</i>, sitting_mp, former_mp, votes, share, change)

Revision as of 22:33, 19 October 2016

The unnormalised data from the CSV file looks like this:

Ge2015.png

Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over.

Pick a primary key

Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1)

FROM ge HAVING COUNT(1)>1

  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id

FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

Having decided on our primary key as (ons_id, firstname, surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   -> county_name
ons_id                   -> constituency_type
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change

Decide on tables

Each distinct determiner (the left hand side of the -> above) will be a table. The determiner will be the primary key in each case.

  • constituency(ons_id, region_id, constituency, county_name, contituency_type)
  • county(county_name, region_id)
  • region(region_id, region_name, country_name)
  • party(party_id, party_name)
  • candidate(ons_id, firstname, surname, gender, party_id, sitting_mp, former_mp, votes, share, change)