Import data into a database (2/2)

Import data into a local development environment with PostgreSQL

Previously, we set up a local Postgres development environment with Docker. We also set up a couple of client tools to connect to it and we also ran some SQL to make sure everything was set up correctly.

In this article we are going to import the title.basics.tsv from the IMDB Non-commercial data sets. This data set is large enough to have a decent workload and interesting enough to not be bored.

Download title.basics.tsv.gz, unzip it and have a look. You’ll see the is tab separated and it contains all the film titles that are available on IMDB.

What we’re implementing is:

  • Create a schema with two identical tables. One table is only used during import to stage that date (the staging table) any data from the TSV file. The second table is the main one that contains all the current data (the production table).
  • Import data into a staging table
  • Update any existing items in the production table
  • Add any new items to the production table

The method described in this post is not the fastest in the world, but it shows the basic techniques and none of the possible optimizations.

Creating the schema

Run psql to connect to the database and run the following queries:


-- Create the database first
CREATE DATABASE imdbimport;

-- Switch to the database
\c imdbimport;

-- Create the production table
CREATE TABLE titles
(
    tconst VARCHAR(10),
    titleType VARCHAR(50),
    primaryTitle VARCHAR(50),
    originalTitles VARCHAR(50),
    isAdult BOOLEAN,
    startYear INT,
    endYear INT,
    runtimeMinutes INT,
    genres VARCHAR(50)
)

-- Create the staging table. This table has the same schema in this case
CREATE TABLE staging_titles
(
    tconst VARCHAR(10),
    titleType VARCHAR(50),
    primaryTitle VARCHAR(50),
    originalTitles VARCHAR(50),
    isAdult BOOLEAN,
    startYear INT,
    endYear INT,
    runtimeMinutes INT,
    genres VARCHAR(50)
)

For us SQL Server people:

  • The USE statement does not do anything
  • It’s important the database name is lowercase here. otherwise you’d get a “Database does not exist”-error

Import data into the temporary table with the COPY command (and fail!)

The following command only works when the file is accessible by the Postgres server:

COPY Titles FROM '~/Downloads/title.basics.tsv'
WITH
    DELIMITER E'\t'
    CSV
    HEADER

In our case this does not work because the Postgres server is running inside a Docker container, and it is not possible with the current setup to import the file this way.

Postgres gives us a helpful hint on how it can be done:

HINT: COPY FROM instructs the PostgreSQL server process to read a file. You may want a client-side facility such as psql's \copy.

Import data into the temporary table with psql (and succeed!)

Run psql to connect to the database and run the following commands:

Use the psql client command \copy to copy the data into the staging table:


\copy staging_titles from '~/Downloads/title.basics.tsv' (DELIMITER E'\t', HEADER, ON_ERROR ignore)

This works even though the Postgres server does not have access to the file directly because the \copy command is a client command. The local psql process does have access to the file, reads the file and blasts it into the staging table.

In this example, the ON_ERROR clause is set to ignore all import errors. In a production setting this should not be specified as it indicates a problem with the data source.

Merge the staging table into the main table

Once the data is in the staging table, the data in the production table can be updated. One of the ways is to use the MERGE command:


MERGE INTO titles t
USING staging_titles st
ON t.tconst = st.tconst
WHEN MATCHED THEN
    UPDATE SET titleType = st.titleType,
        primaryTitle = st.primaryTitle,
        originalTitles = st.originalTitles,
        isAdult = st.isAdult,
        startYear = st.startYear,
        endYear = st.endYear,
        runtimeMinutes = st.runtimeMinutes,
        genres = st.genres
WHEN NOT MATCHED THEN
    INSERT (tconst, titleType, primaryTitle, originalTitles, isAdult, startYear, endYear, runtimeMinutes, genres)
    VALUES (st.tconst, st.titleType, st.primaryTitle, st.originalTitles, st.isAdult, st.startYear, st.endYear, st.runtimeMinutes, st.genres)

The MERGE command above is your one stop shop for INSERT, UPDATE and DELETE statements when copying data from one place to the other. The command tries to match the records on the tconst column values. If a value is matched, the column values in the record from the titles table are updated with the values for the record from the staging_table. If a record is not matched, then the record is inserted.

The MERGE command has a lot more options (like DELETE or specify additional criteria to match records), but those go beyond the scope of this post. For advanced uses of the command, please check the docs

Things to consider

  • From a maintenance perspective it is a good idea to keep some kind of ledger that keeps track of what has been imported, how many rows have been inserted and how many have been updated.
Logo

Let's Connect!

Curious what I can do for your business? Reach out and let's discuss how I can help you achieve your goals.

Send me an email or find me on LinkedIn.