Difference between revisions of "2015 UK General Election using SQL Server"
(16 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
==Get to the powershell prompt and download the csv== | ==Get to the powershell prompt and download the csv== | ||
− | + | Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv | |
==Go into sqlcmd or SSMS and create a table for the results== | ==Go into sqlcmd or SSMS and create a table for the results== | ||
Line 9: | Line 9: | ||
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these: | The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these: | ||
sqlcmd -S .\sqlexpress -E | sqlcmd -S .\sqlexpress -E | ||
+ | We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap. | ||
+ | |||
The create table statement could be: | The create table statement could be: | ||
− | |||
− | |||
− | |||
− | |||
CREATE TABLE ge( | CREATE TABLE ge( | ||
ons_id VARCHAR(10), | ons_id VARCHAR(10), | ||
Line 34: | Line 32: | ||
PRIMARY KEY(ons_id,firstname,surname) | PRIMARY KEY(ons_id,firstname,surname) | ||
) | ) | ||
− | |||
− | == | + | ==Attempt import the csv file into your flat table== |
− | You can import the data using this line | + | You can import the data using this line (from SQLCMD or SQL Server Management Studio) |
− | BULK INSERT ge FROM 'C:\db\ | + | BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2) |
GO | GO | ||
*Each line is ended with carriage return \n | *Each line is ended with carriage return \n | ||
Line 51: | Line 48: | ||
Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp). | Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp). | ||
... | ... | ||
− | + | Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions: | |
− | + | *Use Excel - you can load CSV and save as TXT | |
− | + | *Use regular expressions in powershell: [[convert CSV to TXT]] | |
− | + | *Use python | |
− | + | *Find a converter online | |
− | + | With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems | |
− | + | BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2) | |
− | + | You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows. | |
− | + | SELECT COUNT(*) FROM ge | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | You | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Run some queries== | ==Run some queries== | ||
Line 99: | Line 64: | ||
Who stood in Edinburgh South? | Who stood in Edinburgh South? | ||
select surname,votes from ge where constituency_name='Edinburgh South'; | select surname,votes from ge where constituency_name='Edinburgh South'; | ||
+ | |||
+ | ==Who won the election?== | ||
+ | The important question is which party got the most seats. Which party leader should the queen invite to form a government? | ||
+ | We can use the [[RANK]] function for this but we prefer to use a slower, simpler solution: | ||
+ | |||
+ | First calculate the max votes per constituency: | ||
+ | SELECT ons_id,MAX(votes) AS mv | ||
+ | FROM ge | ||
+ | GROUP BY ons_id | ||
+ | |||
+ | Now JOIN that to the original table to find the party that won each seat: | ||
+ | SELECT ge.ons_id, ge.party_name | ||
+ | FROM ge JOIN | ||
+ | (SELECT ons_id,MAX(votes) AS mv | ||
+ | FROM ge | ||
+ | GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv | ||
+ | We can now count the seats by party: | ||
+ | SELECT party_name,COUNT(1) TotSeat | ||
+ | FROM ( | ||
+ | SELECT ge.ons_id, ge.party_name | ||
+ | FROM ge JOIN | ||
+ | (SELECT ons_id,MAX(votes) AS mv | ||
+ | FROM ge | ||
+ | GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv | ||
+ | ) AS mp | ||
+ | GROUP BY party_name | ||
You can now move to the next stage: [[2015 UK General Election Normalising Data]] | You can now move to the next stage: [[2015 UK General Election Normalising Data]] |
Latest revision as of 08:33, 5 July 2017
This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio
Contents
Get to the powershell prompt and download the csv
Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv
Go into sqlcmd or SSMS and create a table for the results
Here you create a single flat table that can store all of the unnormalised data.
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
sqlcmd -S .\sqlexpress -E
We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap.
The create table statement could be:
CREATE TABLE ge( ons_id VARCHAR(10), ons_region_id VARCHAR(10), constituency_name VARCHAR(50), county_name VARCHAR(50), region_name VARCHAR(50), country_name VARCHAR(50), constituency_type VARCHAR(10), party_name VARCHAR(50), party_abbreviation VARCHAR(50), firstname VARCHAR(50), surname VARCHAR(50), gender VARCHAR(6), sitting_mp VARCHAR(3), former_mp VARCHAR(3), votes INT, share FLOAT, change VARCHAR(20), PRIMARY KEY(ons_id,firstname,surname) )
Attempt import the csv file into your flat table
You can import the data using this line (from SQLCMD or SQL Server Management Studio)
BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2) GO
- Each line is ended with carriage return \n
- The first row contains column headings not data so we start at row 2
You will most likely get error messages like this:
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 106, column 12 (gender). Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3 Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp). ...
Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions:
- Use Excel - you can load CSV and save as TXT
- Use regular expressions in powershell: convert CSV to TXT
- Use python
- Find a converter online
With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems
BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2)
You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows.
SELECT COUNT(*) FROM ge
Run some queries
Now let's look at some data. How many female candidates were there?
select count(1) from ge where gender='female';
Who stood in Edinburgh South?
select surname,votes from ge where constituency_name='Edinburgh South';
Who won the election?
The important question is which party got the most seats. Which party leader should the queen invite to form a government? We can use the RANK function for this but we prefer to use a slower, simpler solution:
First calculate the max votes per constituency:
SELECT ons_id,MAX(votes) AS mv FROM ge GROUP BY ons_id
Now JOIN that to the original table to find the party that won each seat:
SELECT ge.ons_id, ge.party_name FROM ge JOIN (SELECT ons_id,MAX(votes) AS mv FROM ge GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
We can now count the seats by party:
SELECT party_name,COUNT(1) TotSeat FROM ( SELECT ge.ons_id, ge.party_name FROM ge JOIN (SELECT ons_id,MAX(votes) AS mv FROM ge GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv ) AS mp GROUP BY party_name
You can now move to the next stage: 2015 UK General Election Normalising Data