CSV2SQL
The CSV (“Comma Separated Values”) file format is often used to exchange data between differently similar applications. For programmer, parsing CSV files isn’t a rare case. If CSV file contains not more than one hundred lines it’s not problem to parse it with built in programming language tools or libraries. But when you are using very high-level language and file size increases you need something more powerful.
For example, we have file file.csv
which size is 496M. Can you suggests how much time it will take to parse this file on my MacBookPro11,2 with default Ruby csv library?
Ooops, some info about file
$ wc -l file.csv
28897456 file.csv
$ head -n 2 file.csv
1,2,3,4,5,6,7,8,9
1,2,3,4,5,6,7,8,9
Ok, now, let’s try to just read it with ruby
require 'csv'
CSV.read("file.csv")
$ time ruby script.rb
ruby script.rb 1834,16s user 2502,81s system 87% cpu 1:22:57,79 total
1:22:57? Isn’t good. But if you trying to think about CSV format you will find that it’s very similar to relational database with columns. Could PostgreSQL help us in this question?
Let’s write script that creates temp table and saves each file record in this table. We will use PostgreSQL COPY command to parse CSV file.
CREATE TEMP TABLE digits_table (a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int);
COPY digits_table FROM '/path/to/file.csv' WITH DELIMITER ',';
$ time psql -U postgres -d database -a -f script.sql
CREATE TEMP TABLE digits_table (a1 int, a2 int, a3 int, a4 int, a5 int, a6 int, a7 int, a8 int, a9 int);
CREATE TABLE
COPY digits_table FROM '/path/to/file.csv' WITH DELIMITER ',';
COPY 28897456
psql -Upostgres -dengine_temp -a -f script.sql 0,01s user 0,01s system 0% cpu 44,399 total
Just 44 seconds and we have all this stuff in DB. Pretty cool. But how you can use it in real world issues? Let’s assume, you have list of abbreviations, list of doctors and you need to save all this stuff into table that contains full doctor name and doctor’s high school. Users table has next structure:
Column | Type | Modifiers
-------------+------+-----------
full_name | text |
high_school | text |
We have file with abbreviations (I used this) that separated with tabs and looks like
$ head -n 3 list_of_abbrs.csv
AA Alcoholics Anonymous
AABB AABB, formerly known as the American Association of Blood Banks
AACN American Association of Critical-Care Nurses
And list of doctors that looks like
$ head -n 3 doctors.csv
Christophe,Bartell,AAP
Aaliyah,Altenwerth,AAOS
Albert,Collins,AAO
Let’s put it all together. PostgreSQL uses tabs as default delimetr for text format that used by default in COPY
comand. Starting from parsing list of abbreviations:
# CREATE TEMP TABLE abbreviations (abbreviation text, organisation_or_personnel text);
# COPY abbreviations FROM '/path/to/file/list_of_abbrs.csv';
COPY 57
# SELECT * FROM abbreviations LIMIT 5;
abbreviations | organisation_or_personnel
---------------+-----------------------------------------------------------------
AA | Alcoholics Anonymous
AABB | AABB, formerly known as the American Association of Blood Banks
AACN | American Association of Critical-Care Nurses
AAD | American Association of Dermatology
AADGP | American Academy of Dental Group Practice
In PostgreSQL commas is default delimeter for csv
format and we can replace WITH DELIMITER ','
with WITH (FORMAT "csv")
. Then retrieve doctors from file:
# CREATE TEMP TABLE doctors (first_name text, second_name text, high_school text);
# COPY doctors FROM '/path/to/file/doctors.csv' WITH (FORMAT "csv");
# SELECT * FROM doctors;
first_name | second_name | high_school
------------+-------------+-------------
Christophe | Bartell | AAP
Aaliyah | Altenwerth | AAOS
Albert | Collins | AAO
Alice | Lind | AAN
And try to generate query that retrive data for us:
# SELECT "doctors"."first_name" || ' ' || "doctors"."second_name" as "full_name",
# "abbreviations"."organisation_or_personnel" as "high_school"
# FROM doctors
# INNER JOIN "abbreviations" ON "abbreviations"."abbreviation" = "doctors"."high_school";
full_name | high_school
--------------------+------------------------------------------
Alice Lind | American Academy of Nursing
Albert Collins | American Academy of Ophthalmology
Albert Collins | American Academy of Optometry
Aaliyah Altenwerth | American Academy of Orthopaedic Surgeons
Christophe Bartell | American Academy of Pediatrics
Oh no, there can be more than one definition of abbreviation. In my case I will use first abbreviation definition if there is more that one possible variant. DISTINCT ON keyword will help:
# SELECT DISTINCT ON("doctors"."high_school")
# "doctors"."first_name" || ' ' || "doctors"."second_name" as "full_name",
# "abbreviations"."organisation_or_personnel" as "high_school"
# FROM doctors
# INNER JOIN "abbreviations" ON "abbreviations"."abbreviation" = "doctors"."high_school";
full_name | high_school
--------------------+------------------------------------------
Alice Lind | American Academy of Nursing
Albert Collins | American Academy of Ophthalmology
Aaliyah Altenwerth | American Academy of Orthopaedic Surgeons
Christophe Bartell | American Academy of Pediatrics
Much better. You need just add INSERT
to put all this data into users
table. For more complex example you can combine all this stuff with Ruby ActiveRecord and it will save a lot of time. But you need understand that time spent on coding should be commensurate with time that you save by using it. If pure ruby version works 1 hour and you need to parse one document per week seems not so reasonable make all this overhead. This XKCD comics can be used like landmark and should help you.