Previously, we set up a local Postgres development environment with Docker. We also set up a couple of client tools to connect to it and we also ran some SQL to make sure everything was set up correctly.
In this article we are going to import the title.basics.tsv from the
IMDB Non-commercial data sets.
This data set is large enough to have a decent workload and interesting enough
to not be bored.
Download title.basics.tsv.gz, unzip it and have a look. You’ll see the is tab
separated and it contains all the film titles that are available on
IMDB.
What we’re implementing is:
- Create a schema with two identical tables. One table is only used during
import to stage that date (the
staging table) any data from the TSV file. The second table is the main one that contains all the current data (theproduction table). - Import data into a staging table
- Update any existing items in the production table
- Add any new items to the production table
The method described in this post is not the fastest in the world, but it shows the basic techniques and none of the possible optimizations.
Creating the schema
Run psql to connect to the database and run the following queries:
-- Create the database first
CREATE DATABASE imdbimport;
-- Switch to the database
\c imdbimport;
-- Create the production table
CREATE TABLE titles
(
tconst VARCHAR(10),
titleType VARCHAR(50),
primaryTitle VARCHAR(50),
originalTitles VARCHAR(50),
isAdult BOOLEAN,
startYear INT,
endYear INT,
runtimeMinutes INT,
genres VARCHAR(50)
)
-- Create the staging table. This table has the same schema in this case
CREATE TABLE staging_titles
(
tconst VARCHAR(10),
titleType VARCHAR(50),
primaryTitle VARCHAR(50),
originalTitles VARCHAR(50),
isAdult BOOLEAN,
startYear INT,
endYear INT,
runtimeMinutes INT,
genres VARCHAR(50)
)
For us SQL Server people:
- The
USEstatement does not do anything- It’s important the database name is lowercase here. otherwise you’d get a “Database does not exist”-error
Import data into the temporary table with the COPY command (and fail!)
The following command only works when the file is accessible by the Postgres server:
COPY Titles FROM '~/Downloads/title.basics.tsv'
WITH
DELIMITER E'\t'
CSV
HEADER
In our case this does not work because the Postgres server is running inside a Docker container, and it is not possible with the current setup to import the file this way.
Postgres gives us a helpful hint on how it can be done:
HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.
Import data into the temporary table with psql (and succeed!)
Run psql to connect to the database and run the following commands:
Use the psql client command \copy to copy the data into the staging table:
\copy staging_titles from '~/Downloads/title.basics.tsv' (DELIMITER E'\t', HEADER, ON_ERROR ignore)
This works even though the Postgres server does not have access to the file
directly because the \copy command is a client command. The local psql
process does have access to the file, reads the file and blasts it into the
staging table.
In this example, the
ON_ERRORclause is set toignoreall import errors. In a production setting this should not be specified as it indicates a problem with the data source.
Merge the staging table into the main table
Once the data is in the staging table, the data in the production table can be
updated. One of the ways is to use the
MERGE command:
MERGE INTO titles t
USING staging_titles st
ON t.tconst = st.tconst
WHEN MATCHED THEN
UPDATE SET titleType = st.titleType,
primaryTitle = st.primaryTitle,
originalTitles = st.originalTitles,
isAdult = st.isAdult,
startYear = st.startYear,
endYear = st.endYear,
runtimeMinutes = st.runtimeMinutes,
genres = st.genres
WHEN NOT MATCHED THEN
INSERT (tconst, titleType, primaryTitle, originalTitles, isAdult, startYear, endYear, runtimeMinutes, genres)
VALUES (st.tconst, st.titleType, st.primaryTitle, st.originalTitles, st.isAdult, st.startYear, st.endYear, st.runtimeMinutes, st.genres)
The MERGE command above is your one stop shop for INSERT, UPDATE and
DELETE statements when copying data from one place to the other. The command
tries to match the records on the tconst column values. If a value is matched,
the column values in the record from the titles table are updated with the
values for the record from the staging_table. If a record is not matched, then
the record is inserted.
The MERGE command has a lot more options (like DELETE or specify additional
criteria to match records), but those go beyond the scope of this post. For
advanced uses of the command, please check the
docs
Things to consider
- From a maintenance perspective it is a good idea to keep some kind of ledger that keeps track of what has been imported, how many rows have been inserted and how many have been updated.
