Difference between revisions of "2015 UK General Election using SQL Server"

From SQLZOO
Jump to navigation Jump to search
 
(24 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
==Get to the powershell prompt and download the csv==
 
==Get to the powershell prompt and download the csv==
  
  (new-object system.net.webclient).downloadfile("http://researchbriefings.files.parliament.uk/documents/CBP-7186/hocl-ge2015-results-full.csv","$pwd\e2015.csv")
+
  Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv
  
 
==Go into sqlcmd or SSMS and create a table for the results==
 
==Go into sqlcmd or SSMS and create a table for the results==
Line 9: Line 9:
 
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
 
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
 
  sqlcmd -S .\sqlexpress -E
 
  sqlcmd -S .\sqlexpress -E
 +
We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap.
 +
 
The create table statement could be:
 
The create table statement could be:
CREATE DATABASE gisq
 
GO
 
USE gisq
 
GO
 
 
  CREATE TABLE ge(
 
  CREATE TABLE ge(
 
   ons_id VARCHAR(10),
 
   ons_id VARCHAR(10),
Line 34: Line 32:
 
   PRIMARY KEY(ons_id,firstname,surname)
 
   PRIMARY KEY(ons_id,firstname,surname)
 
  )
 
  )
GO
 
  
==Import the csv file into your flat table==
+
==Attempt import the csv file into your flat table==
You can import the data using this line
+
You can import the data using this line (from SQLCMD or SQL Server Management Studio)
  BULK INSERT ge FROM 'C:\db\ge2015.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
+
  BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
 
  GO
 
  GO
 
*Each line is ended with carriage return \n
 
*Each line is ended with carriage return \n
Line 51: Line 48:
 
  Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
 
  Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
 
  ...
 
  ...
 
+
Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions:
==Identifying the problem rows==
+
*Use Excel - you can load CSV and save as TXT
You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.
+
*Use regular expressions in powershell: [[convert CSV to TXT]]
$ge1 = get-content "$pwd\ge2015.csv"
+
*Use python
$ge[104]
+
*Find a converter online
Which shows
+
With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems
S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
+
BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2)
(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104)
+
You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows.
Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.
+
  SELECT COUNT(*) FROM ge
 
 
==Fixing the problem rows==
 
The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.
 
 
 
You can use replace - we can replace the , inside a double quote using a regular expression:
 
 
$ge1[104] -replace ',"(.*),(.*)"', ',"$1|$2"'
 
shows this
 
S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794
 
You can apply this to the whole file using
 
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"'
 
Then write the file out to ge2.csv
 
$ge1 > "$pwd\ge2.csv"
 
You can try loading this into the table ge again - but take care to delete any rows you created last time.
 
 
 
==More problems==
 
You will probably find further problems:
 
*there is a constituency with two commas in it
 
*your data still includes double quote marks that should be replaced
 
*you have | marks in your data that should be replaced with commas
 
You can deal with these problems either in powershell, a capable text editor, excel or in SQL
 
 
 
Here is a solution:
 
$ge1 = get-content "$pwd\ge2015.csv"
 
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"'
 
$ge1 = $ge1 -replace ',"(.*),(.*)"', ',"$1|$2"'
 
$ge1 = $ge1 -replace '"', ""
 
$ge1 = $ge1 -replace ",", "`t"
 
$ge1 > "$pwd\ge2.csv"
 
  sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge5.csv' WITH (FIELDTERMINATOR='\t', ROWTERMINATOR='\n', FIRSTROW=2)"
 
  
 
==Run some queries==
 
==Run some queries==
Line 97: Line 64:
 
Who stood in Edinburgh South?
 
Who stood in Edinburgh South?
 
  select surname,votes from ge where constituency_name='Edinburgh South';
 
  select surname,votes from ge where constituency_name='Edinburgh South';
 +
 +
==Who won the election?==
 +
The important question is which party got the most seats. Which party leader should the queen invite to form a government?
 +
We can use the [[RANK]] function for this but we prefer to use a slower, simpler solution:
 +
 +
First calculate the max votes per constituency:
 +
SELECT ons_id,MAX(votes) AS mv
 +
  FROM ge
 +
GROUP BY ons_id
 +
 +
Now JOIN that to the original table to find the party that won each seat:
 +
SELECT ge.ons_id, ge.party_name
 +
  FROM ge JOIN
 +
        (SELECT ons_id,MAX(votes) AS mv
 +
          FROM ge
 +
        GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
 +
We can now count the seats by party:
 +
SELECT party_name,COUNT(1) TotSeat
 +
  FROM (
 +
    SELECT ge.ons_id, ge.party_name
 +
      FROM ge JOIN
 +
          (SELECT ons_id,MAX(votes) AS mv
 +
              FROM ge
 +
            GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
 +
        ) AS mp
 +
GROUP BY party_name
  
 
You can now move to the next stage: [[2015 UK General Election Normalising Data]]
 
You can now move to the next stage: [[2015 UK General Election Normalising Data]]

Latest revision as of 08:33, 5 July 2017

This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio

Get to the powershell prompt and download the csv

Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv

Go into sqlcmd or SSMS and create a table for the results

Here you create a single flat table that can store all of the unnormalised data.

The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:

sqlcmd -S .\sqlexpress -E

We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap.

The create table statement could be:

CREATE TABLE ge(
  ons_id VARCHAR(10),
  ons_region_id VARCHAR(10),
  constituency_name VARCHAR(50),
  county_name VARCHAR(50),
  region_name VARCHAR(50),
  country_name VARCHAR(50),
  constituency_type VARCHAR(10),
  party_name VARCHAR(50),
  party_abbreviation VARCHAR(50),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(6),
  sitting_mp VARCHAR(3),
  former_mp VARCHAR(3),
  votes INT,
  share FLOAT,
  change VARCHAR(20),
  PRIMARY KEY(ons_id,firstname,surname)
)

Attempt import the csv file into your flat table

You can import the data using this line (from SQLCMD or SQL Server Management Studio)

BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
GO
  • Each line is ended with carriage return \n
  • The first row contains column headings not data so we start at row 2

You will most likely get error messages like this:

Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 106, column 12 (gender).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
...

Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions:

  • Use Excel - you can load CSV and save as TXT
  • Use regular expressions in powershell: convert CSV to TXT
  • Use python
  • Find a converter online

With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems

BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2)

You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows.

SELECT COUNT(*) FROM ge

Run some queries

Now let's look at some data. How many female candidates were there?

select count(1) from ge where gender='female';

Who stood in Edinburgh South?

select surname,votes from ge where constituency_name='Edinburgh South';

Who won the election?

The important question is which party got the most seats. Which party leader should the queen invite to form a government? We can use the RANK function for this but we prefer to use a slower, simpler solution:

First calculate the max votes per constituency:

SELECT ons_id,MAX(votes) AS mv
  FROM ge
GROUP BY ons_id

Now JOIN that to the original table to find the party that won each seat:

SELECT ge.ons_id, ge.party_name
  FROM ge JOIN
       (SELECT ons_id,MAX(votes) AS mv
          FROM ge
        GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv

We can now count the seats by party:

SELECT party_name,COUNT(1) TotSeat
  FROM (
    SELECT ge.ons_id, ge.party_name
      FROM ge JOIN
          (SELECT ons_id,MAX(votes) AS mv
             FROM ge
           GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
       ) AS mp
GROUP BY party_name

You can now move to the next stage: 2015 UK General Election Normalising Data