Analyze Large CSV Files in Minutes With Docker Compose


I recently had to compare two huge CSV files. Instead of installing a database, I used Docker

By Stephan Schulze

Imagine you have to analyze a massive CSV file. What do you do if Excel or similar tools get a nervous breakdown handling the huge file size? With small files, investigating or comparing CSV files is a walk in the park. But with huge files, it’s a pain, if not impossible.

Recently, I had to compare two CSV files with different information but a common identifier, each with ~500k entries.

My developer’s heart said: Shouldn’t I use a database?

My head said: Installing a full DB just for that use case?

I mean, I just wanted to check a few small things and then get rid of it again.

The solution — Docker

If you swiftly need to build a minimal temporary infrastructure, Docker might be your friend. To get started, pull the latest database container (Postgres, in my case), and you are ready to go.

Part 1 — Install Docker

Download and install Docker (Compose is built in).

Save the following snippet as docker-compose.yml.

# Use postgres/example user/password credentials
version: '3.1'
services:
  db:
      image: postgres
      container_name: postgres-test
      environment:
        POSTGRES_PASSWORD: example
      ports:
        - 15432:5432
Code language: PHP (php)

Part 2 — Start the Container

Run the command docker-compose up.

The container will start running:

Screenshot: A CLI log showing Docker downloading
Docker downloading

Have the database running:

Screenshot: A CLI log showing the Docker container started
Docker container started

Part 3 — Access the Database and Load Your CSV

I can recommend DBeaver for that task.

Create a new connection to your dockerized database.

  • Important: Use the Port number defined in the docker file: 15432
  • The username is: postgres
  • The password is (see docker-compose.yml): example
Screenshot: Connection settings of PostgreSQL in DBeaver
PostgreSQL in DBeaver

Part 4 — Import Your CSV Files

DBeaver offers a nice feature called “Flat files” connections.

Create one by pointing to the folder that contains your CSV files.

Screenshot: New “Flat files CSV Connection” in DBeaver
New “Flat files CSV Connection”

In the DBeaver UI, you can see the connection and all CSV files listed under “Tables”. It should look similar to this:

Screenshot: DBeaver UI listing the connection and CSV files under "Tables"

To import these files into your Docker database:

  1. Create a new table within the “Public” schema without any columns (ensure it is persisted!).
  2. Right-click on that table and choose “Import Table Data”.
  3. Select the “Flat files” connection and choose your file(s) from there.
  4. Adapt the mapping in the next step if necessary.
  5. Finish the import.

That’s it. Enjoy analyzing your data!