2015 UK General Election using SQL Server
This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio
Contents
Get to the powershell prompt and download the csv
(new-object system.net.webclient).downloadfile("http://researchbriefings.files.parliament.uk/documents/CBP-7186/hocl-ge2015-results-full.csv","$pwd\e2015.csv")
Go into sqlcmd or SSMS and create a table for the results
Here you create a single flat table that can store all of the unnormalised data.
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
sqlcmd -S .\sqlexpress -E
The create table statement could be:
CREATE DATABASE gisq GO USE gisq GO CREATE TABLE ge( ons_id VARCHAR(10), ons_region_id VARCHAR(10), constituency_name VARCHAR(50), county_name VARCHAR(50), region_name VARCHAR(50), country_name VARCHAR(50), constituency_type VARCHAR(10), party_name VARCHAR(50), party_abbreviation VARCHAR(50), firstname VARCHAR(50), surname VARCHAR(50), gender VARCHAR(6), sitting_mp VARCHAR(3), former_mp VARCHAR(3), votes INT, share FLOAT, change VARCHAR(20), PRIMARY KEY(ons_id,firstname,surname) ) GO
Import the csv file into your flat table
You can import the data using this line
BULK INSERT ge FROM 'C:\db\ge2015.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2) GO
- Each line is ended with carriage return \n
- The first row contains column headings not data so we start at row 2
You will most likely get error messages like this:
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 106, column 12 (gender). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp). ...
Identifying the problem rows
You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.
$ge1 = get-content "$pwd\ge2015.csv" $ge[104]
Which shows
S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104) Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.
Fixing the problem rows
The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.
You can use replace - we can replace the , inside a double quote using a regular expression:
$ge1[104] -replace ',"(.*),(.*)"', ',"$1|$2"'
shows this
S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
You can apply this to the whole file using
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"'
Then write the file out to ge2.csv
$ge1 > "$pwd\ge2.csv"
You can try loading this into the table ge again - but take care to delete any rows you created last time.
More problems
You will probably find further problems:
- there is a constituency with two commas in it
- your data still includes double quote marks that should be replaced
- you have | marks in your data that should be replaced with commas
You can deal with these problems either in powershell, a capable text editor, excel or in SQL
Here is a solution:
$ge1 = get-content "$pwd\ge2015.csv" $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' $ge1 = $ge1 -replace '"', $ge1 = $ge1 -replace ",", "`t" $ge1 > "$pwd\ge2.csv" sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge5.csv' WITH (FIELDTERMINATOR='\t', ROWTERMINATOR='\n', FIRSTROW=2)"
Run some queries
Now let's look at some data. How many female candidates were there?
select count(1) from ge where gender='female';
Who stood in Edinburgh South?
select surname,votes from ge where constituency_name='Edinburgh South';
You can now move to the next stage: 2015 UK General Election Normalising Data