Difference between revisions of "2015 UK General Election using SQL Server"
Line 83: | Line 83: | ||
Here is a solution: | Here is a solution: | ||
+ | $ge1 = get-content "$pwd\ge2015.csv" | ||
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' | $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' | ||
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' | $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' |
Revision as of 14:48, 16 June 2017
This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio
Contents
Get to the powershell prompt and download the csv
(new-object system.net.webclient).downloadfile("http://researchbriefings.files.parliament.uk/documents/CBP-7186/hocl-ge2015-results-full.csv","$pwd\e2015.csv")
Go into sqlcmd or SSMS and create a table for the results
Here you create a single flat table that can store all of the unnormalised data.
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
sqlcmd -S .\sqlexpress -E
The create table statement could be:
CREATE DATABASE gisq GO USE gisq GO CREATE TABLE ge( ons_id VARCHAR(10), ons_region_id VARCHAR(10), constituency_name VARCHAR(50), county_name VARCHAR(50), region_name VARCHAR(50), country_name VARCHAR(50), constituency_type VARCHAR(10), party_name VARCHAR(50), party_abbreviation VARCHAR(50), firstname VARCHAR(50), surname VARCHAR(50), gender VARCHAR(6), sitting_mp VARCHAR(3), former_mp VARCHAR(3), votes INT, share FLOAT, change VARCHAR(20), PRIMARY KEY(ons_id,firstname,surname) ) GO
Import the csv file into your flat table
You can import the data using this line
BULK INSERT ge FROM 'C:\db\ge2015.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2) GO
- Each line is ended with carriage return \n
- The first row contains column headings not data so we start at row 2
You will most likely get error messages like this:
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 106, column 12 (gender). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp). ...
Identifying the problem rows
You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.
$ge1 = get-content "$pwd\ge2015.csv" $ge[104]
Which shows
S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104) Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.
Fixing the problem rows
The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.
You can use replace - we can replace the , inside a double quote using a regular expression:
$ge1[104] -replace ',"(.*),(.*)"', ',"$1|$2"'
shows this
S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
You can apply this to the whole file using
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"'
Then write the file out to ge2.csv
$ge1 > "$pwd\ge2.csv"
You can try loading this into the table ge again - but take care to delete any rows you created last time.
More problems
You will probably find further problems:
- there is a constituency with two commas in it
- your data still includes double quote marks that should be replaced
- you have | marks in your data that should be replaced with commas
You can deal with these problems either in powershell, a capable text editor, excel or in SQL
Here is a solution:
$ge1 = get-content "$pwd\ge2015.csv" $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' $ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"' $ge1 = $ge1 -replace '"', $ge1 = $ge1 -replace ",", "`t" $ge1 > "$pwd\ge2.csv" sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge5.csv' WITH (FIELDTERMINATOR='\t', ROWTERMINATOR='\n', FIRSTROW=2)"
Run some queries
Now let's look at some data. How many female candidates were there?
select count(1) from ge where gender='female';
Who stood in Edinburgh South?
select surname,votes from ge where constituency_name='Edinburgh South';
You can now move to the next stage: 2015 UK General Election Normalising Data