Import data into a database (1/2)

Set up local development environment for PostgreSQL

Importing large datasets into a PostgreSQL database is a common task for developers and data engineers. In this two-part post, we’ll walk through one of the many approaches to import data into PostgreSQL. We’ll be using a local setup using Docker and client tools and import one of the IMDB Non-Commercial data downloads.

We’ll set up a development environment in this post. The second post we’ll and get some data to play with, stage and merge it.

Setting up the dev environment

In a dev environment it should be possible to have a short Code -> Run -> Check Results -> Repeat-feedback loop. It should also be possible to quickly reset the state - in this case a clean PostgreSQL server.

I like to use Docker to run a server locally for this reason. Restart the container and all databases are wiped again! Client tools like psql and pgcli I do usually install through brew. Usually, these do not need to be reset and can easily be upgraded/downgraded through brew.

To get started, make sure you have Docker desktop running and that you are logged in. This is necessary to be able to download and run images.

Run the following commands in a terminal to get the Postgres Docker image locally and start up a container:

# Get the image from the hub locally
docker pull postgres

# Use the image to start a container
docker run --name dev-postgres -e "POSTGRES_PASSWORD=Hello@W0rld" -p 5432:5432 -d postgres

This starts a container called dev-postgres running a Postgres server. On startup a user postgres was created with the password specified by POSTGRES_PASSWORD. Furthermore, the internal port 5432 has been exposed to the host machine so that Postgres is available on localhost:5432

Next up, the client tools

The following installs libpq. This gives you a whole bunch of client utilities without having to install the full Postgres:

# Installs psql among other things.
brew install libpq

# Links all binaries
# There is a caveat though: https://stackoverflow.com/a/49689589
brew link --force libpq

# Installs pgcli
brew install pgcli

Checking the setup

Once the Docker container is running and client tools are installed, we can connect to it:

pgcli --host localhost --port 5432 --user postgres

# or

psql --host=localhost --port=5432 --username=postgres

You’ll be asked for a password. This is the password that was specified above in the docker run command. In my case it’s Hello@W0rld.

This opens an interactive shell (aka REPL). It allows you to run SQL against the database. For example:

SELECT datname FROM pg_database

Shows a list of databases that you have access to. In my case there were three (postgres, template0, template1).

In case of an error, check the logs in the Docker dashboard. Postgres is surprisingly friendly in the error messages there.

Notes

  • If you do not like command line tools, the Postgres Docker Hub page has Docker compose snippet that runs an adminer container next to the Postgres container. This provides a web management interface to Postgres.
  • Always a good idea to have a peek in the documentation

Next up

Next step is to import some data, but we’ll do that in second part of this post!

Logo

Let's Connect!

Curious what I can do for your business? Reach out and let's discuss how I can help you achieve your goals.

Send me an email or find me on LinkedIn.