Convert CSV to TXT

From SQLZOO
Jump to navigation Jump to search

Identifying the problem rows

You can look at the data with a text editor or use powershell. We load all the raw text into $ge1 - we plan to fix the data stage by stage.

$ge = get-content "$pwd\ge2015.csv"
$ge[104]

Which shows

S14000006,S92000003,"Ayr, Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794

(The error was reported on 105 but because we skipped row 1 the actual first problem is on 104) Observe that the constituency name has a comma in it - BULK INSERT is interpreting this as a field delimiter.

Fixing the problem rows

The best way to fix this is to use someone else's code to deal with CSV - but if we are careful you can do this yourself with regular expressions. Later versions of SQL Server allow text delimiters to be defined, python has a csv module.

You can use replace - we can replace the , inside a double quote using a regular expression:

$ge[104] -replace ',"(.*),(.*)"', ',"$1|$2"'

shows this

S14000006,S92000003,"Ayr| Carrick and Cumnock",Scotland,Scotland,Scotland,County,Scottish National Party,SNP,Corri,Wilson,Female,No,No,25492,0.488268306,0.307935794

You can apply this to the whole file using

$ge = $ge -replace ',"(.[^"]),(.[^"])"', ',"$1|$2"'

Then write the file out to ge2.csv

$ge > "$pwd\ge2.csv"

You can try loading this into the table ge again - but take care to delete any rows you created last time.

sqlcmd -S .\sqlexpress -E -d gisq -Q "BULK INSERT ge FROM '$pwd\ge2.csv' WITH (FIELDTERMINATOR=',', ROWTERMINATOR='\n', FIRSTROW=2)"

More problems

You will probably find further problems:

  • there is a constituency with two commas in it
  • your data still includes double quote marks that should be replaced
  • you have | marks in your data that should be replaced with commas

You can deal with these problems either in powershell, a capable text editor, excel or in SQL

Here is a solution in powershell:

$ge1 = get-content "$pwd\ge2015.csv"
$ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"'
$ge1 = $ge1 -replace ',"([^"]*),([^"]*)"', ',"$1|$2"'
$ge1 = $ge1 -replace '"', ""
$ge1 = $ge1 -replace ",", "`t"
$ge1 > "$pwd\ge2.csv"
DataWars, Data Science Practice Projects - LogoDataWars: Practice Data Science/Analysis with +100 Real Life Projects