2015 UK General Election Normalising Data
The unnormalised data from the CSV file looks like this: Notice that it includes redundancy - constituency names are repeated for example. We can see that link between the ons_id and the constituency_name is repeated over and over. Our first job is to identify a primary key. There are several options to consider:
- First we notice that there is no single column that is unique and so we will need to use at least a pair of columns.
- (firstname,surname) would be good
- The candidates are unique. By law, no one is allowed to stand as a candidate in two constituencies
- Unfortunately candidate names are not unique - there are two candidates called "Alan Johnson" for example. You can confirm this with a query such as
SELECT firstname,surname COUNT(1) FROM ge HAVING COUNT(1)>1