Difference between revisions of "2015 UK General Election using SQL Server"

From SQLZOO
Jump to navigation Jump to search
(Created page with "This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio ==Get to the powershell prompt and download the csv== (new-object sy...")
 
 
(34 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
==Get to the powershell prompt and download the csv==
 
==Get to the powershell prompt and download the csv==
  
  (new-object system.net.webclient).downloadfile("http://researchbriefings.files.parliament.uk/documents/CBP-7186/hocl-ge2015-results-full.csv","ge2015.csv")
+
  Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv
  
 
+
==Go into sqlcmd or SSMS and create a table for the results==
==Go into mysql and create a table for the results==
 
 
Here you create a single flat table that can store all of the unnormalised data.
 
Here you create a single flat table that can store all of the unnormalised data.
  
 
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
 
The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:
  mysql -u 40000036 -ptiger 40000036
+
  sqlcmd -S .\sqlexpress -E
 +
We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap.
 +
 
 
The create table statement could be:
 
The create table statement could be:
 
  CREATE TABLE ge(
 
  CREATE TABLE ge(
Line 27: Line 28:
 
   former_mp VARCHAR(3),
 
   former_mp VARCHAR(3),
 
   votes INT,
 
   votes INT,
   `share` FLOAT,
+
   share FLOAT,
   `change` VARCHAR(20),
+
   change VARCHAR(20),
 
   PRIMARY KEY(ons_id,firstname,surname)
 
   PRIMARY KEY(ons_id,firstname,surname)
  );
+
  )
==Import the csv file into your flat table==
+
 
You can import the data using this line
+
==Attempt import the csv file into your flat table==
  LOAD DATA INFILE '/tmp/hocl-ge2015-results-full.csv' INTO TABLE ge COLUMNS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
+
You can import the data using this line (from SQLCMD or SQL Server Management Studio)
The above line is complicated but it instructs MySQL about the format of the file.
+
  BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
*CSV stands for "comma separated values", each field is terminated by a comma
+
GO
*Each line is ended with the newline, carriage return \r\n
+
*Each line is ended with carriage return \n
*Sometimes a value includes a comma in such cases the field is enclosed by quotes.
+
*The first row contains column headings not data so we start at row 2
**The constituency "Ayr, Carrick and Cumnock" is an example.
+
You will most likely get error messages like this:
*The first row contains column headings not data and so we "IGNORE 1 LINES"
+
 
 +
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
 +
Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp).
 +
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
 +
Bulk load data conversion error (truncation) for row 106, column 12 (gender).
 +
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
 +
Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
 +
...
 +
Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions:
 +
*Use Excel - you can load CSV and save as TXT
 +
*Use regular expressions in powershell: [[convert CSV to TXT]]
 +
*Use python
 +
*Find a converter online
 +
With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems
 +
BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2)
 +
You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows.
 +
SELECT COUNT(*) FROM ge
 +
 
 
==Run some queries==
 
==Run some queries==
 
Now let's look at some data.
 
Now let's look at some data.
Line 46: Line 64:
 
Who stood in Edinburgh South?
 
Who stood in Edinburgh South?
 
  select surname,votes from ge where constituency_name='Edinburgh South';
 
  select surname,votes from ge where constituency_name='Edinburgh South';
 +
 +
==Who won the election?==
 +
The important question is which party got the most seats. Which party leader should the queen invite to form a government?
 +
We can use the [[RANK]] function for this but we prefer to use a slower, simpler solution:
 +
 +
First calculate the max votes per constituency:
 +
SELECT ons_id,MAX(votes) AS mv
 +
  FROM ge
 +
GROUP BY ons_id
 +
 +
Now JOIN that to the original table to find the party that won each seat:
 +
SELECT ge.ons_id, ge.party_name
 +
  FROM ge JOIN
 +
        (SELECT ons_id,MAX(votes) AS mv
 +
          FROM ge
 +
        GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
 +
We can now count the seats by party:
 +
SELECT party_name,COUNT(1) TotSeat
 +
  FROM (
 +
    SELECT ge.ons_id, ge.party_name
 +
      FROM ge JOIN
 +
          (SELECT ons_id,MAX(votes) AS mv
 +
              FROM ge
 +
            GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
 +
        ) AS mp
 +
GROUP BY party_name
  
 
You can now move to the next stage: [[2015 UK General Election Normalising Data]]
 
You can now move to the next stage: [[2015 UK General Election Normalising Data]]

Latest revision as of 08:33, 5 July 2017

This tutorial assumes you have access to powershell and sqlcmd or Microsoft SQL Server Management Studio

Get to the powershell prompt and download the csv

Invoke-WebRequest http://researchbriefings.files.parliament.uk/documents/CBP-7979/hocl-ge2017-results-full.csv -OutFile ge2017.csv

Go into sqlcmd or SSMS and create a table for the results

Here you create a single flat table that can store all of the unnormalised data.

The first line of the CSV file contains the column headings (ons_id,ons_region_id,constituency_name,county_name,region_name,country_name,constituency_type,party_name,party_abbreviation,firstname,surname,gender,sitting_mp,former_mp,votes,share,change). We create a table with a column for each of these:

sqlcmd -S .\sqlexpress -E

We do not know the size of each column yet, too small and the import statement will give errors or warnings. It is best to be generous - memory is cheap.

The create table statement could be:

CREATE TABLE ge(
  ons_id VARCHAR(10),
  ons_region_id VARCHAR(10),
  constituency_name VARCHAR(50),
  county_name VARCHAR(50),
  region_name VARCHAR(50),
  country_name VARCHAR(50),
  constituency_type VARCHAR(10),
  party_name VARCHAR(50),
  party_abbreviation VARCHAR(50),
  firstname VARCHAR(50),
  surname VARCHAR(50),
  gender VARCHAR(6),
  sitting_mp VARCHAR(3),
  former_mp VARCHAR(3),
  votes INT,
  share FLOAT,
  change VARCHAR(20),
  PRIMARY KEY(ons_id,firstname,surname)
)

Attempt import the csv file into your flat table

You can import the data using this line (from SQLCMD or SQL Server Management Studio)

BULK INSERT ge FROM 'C:\db\ge2017.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)
GO
  • Each line is ended with carriage return \n
  • The first row contains column headings not data so we start at row 2

You will most likely get error messages like this:

Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 105, column 13 (sitting_mp).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 106, column 12 (gender).
Msg 4863, Level 16, State 1, Server ME1C039-130368\SQLEXPRESS, Line 3
Bulk load data conversion error (truncation) for row 107, column 13 (sitting_mp).
...

Unfortunately BULK INSERT cannot deal with the CSV format so we need to pre-process the file. Some solutions:

  • Use Excel - you can load CSV and save as TXT
  • Use regular expressions in powershell: convert CSV to TXT
  • Use python
  • Find a converter online

With the conversion complete you can import the data - you will have too use the full path name and you may have permissions problems

BULK INSERT ge FROM 'c:\path\ge2017.txt' WITH (FIRSTROW=2)

You may get a few errors - it is probably easiest to fix these "by hand" using Excel or a text editor. The BULK INSERT command should show you the line numbers. Check that you have the correct number of rows - there should be 3304 data rows.

SELECT COUNT(*) FROM ge

Run some queries

Now let's look at some data. How many female candidates were there?

select count(1) from ge where gender='female';

Who stood in Edinburgh South?

select surname,votes from ge where constituency_name='Edinburgh South';

Who won the election?

The important question is which party got the most seats. Which party leader should the queen invite to form a government? We can use the RANK function for this but we prefer to use a slower, simpler solution:

First calculate the max votes per constituency:

SELECT ons_id,MAX(votes) AS mv
  FROM ge
GROUP BY ons_id

Now JOIN that to the original table to find the party that won each seat:

SELECT ge.ons_id, ge.party_name
  FROM ge JOIN
       (SELECT ons_id,MAX(votes) AS mv
          FROM ge
        GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv

We can now count the seats by party:

SELECT party_name,COUNT(1) TotSeat
  FROM (
    SELECT ge.ons_id, ge.party_name
      FROM ge JOIN
          (SELECT ons_id,MAX(votes) AS mv
             FROM ge
           GROUP BY ons_id) AS ms ON ge.ons_id=ms.ons_id AND ge.votes=ms.mv
       ) AS mp
GROUP BY party_name

You can now move to the next stage: 2015 UK General Election Normalising Data