Difference between revisions of "2015 UK General Election Normalising Data"

From SQLZOO
Jump to navigation Jump to search
(Created page with "The unnormalised data from the CSV file looks like this: File:ge2015.png Notice that it includes redundancy - constituency names are repeated for example. We can see that...")
 
Line 6: Line 6:
 
*(firstname,surname) would be good
 
*(firstname,surname) would be good
 
**The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
 
**The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
**Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as<pre>SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1</pre>
+
**Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as<pre>SELECT firstname,surname COUNT(1)
 +
  FROM ge
 +
HAVING COUNT(1)>1</pre>
 +
*The combination (ons_id,party) is also tempting
 +
**No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
 +
**Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
 +
*It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as <pre>SELECT firstname,surname,ons_id
 +
  FROM ge
 +
GROUP BY firstname,surname,ons_id
 +
HAVING COUNT(1)>1;</pre>
 +
==Identifying dependencies==
 +
Having decided on our primary key as (ons_id,firstname,surname) we notice the following dependencies:
 +
ons_id                  -> ons_region_id
 +
ons_id                  -> consitituency_name
 +
ons_id                  ->county_name
 +
ons_id                  -> constituency_type
 +
ons_region_id            -> region_name
 +
ons_region_id            -> country_name
 +
party_abbreviation      -> party_name
 +
ons_id,firstname,surname -> gender
 +
ons_id,firstname,surname -> party_abbreviation
 +
ons_id,firstname,surname -> sitting_mp
 +
ons_id,firstname,surname -> former_mp
 +
ons_id,firstname,surname -> votes
 +
ons_id,firstname,surname -> share
 +
ons_id,firstname,surname -> change

Revision as of 21:51, 19 October 2016

The unnormalised data from the CSV file looks like this: Ge2015.png Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over. Our first job is to identify a primary key. There are several options to consider:

  • First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
  • (firstname,surname) would be good
    • The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
    • Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
      SELECT firstname,surname COUNT(1)

FROM ge HAVING COUNT(1)>1

  • The combination (ons_id,party) is also tempting
    • No party will put up two candidates in the same constituency, that would be self defeating and against the rules.
    • Unfortunately there are independent candidates with a NULL party and we cannot have NULL values in the primary key
  • It turns out that the triple (ons_id,firstname,surname) is unique. You may not have more than one candidate in a constituency with the same first name and surname. This would be confusing for voters. We can verify that this is a safe choice with a query such as
    SELECT firstname,surname,ons_id

FROM ge GROUP BY firstname,surname,ons_id HAVING COUNT(1)>1;

Identifying dependencies

Having decided on our primary key as (ons_id,firstname,surname) we notice the following dependencies:

ons_id                   -> ons_region_id
ons_id                   -> consitituency_name
ons_id                   ->county_name
ons_id                   -> constituency_type
ons_region_id            -> region_name
ons_region_id            -> country_name
party_abbreviation       -> party_name
ons_id,firstname,surname -> gender
ons_id,firstname,surname -> party_abbreviation
ons_id,firstname,surname -> sitting_mp
ons_id,firstname,surname -> former_mp
ons_id,firstname,surname -> votes
ons_id,firstname,surname -> share
ons_id,firstname,surname -> change