DuckDB for Ad-Hoc Analytics: Turning Thousands of CSVs into a Dashboard
Amazon Seller Central generates data at a pace that outstrips most analysts’ tooling. Traffic reports, conversion data, inventory health summaries, business performance exports, advertising campaign breakdowns – each delivered as a CSV or TSV file, accumulating over months and years into directories containing thousands of files. The standard response is to import everything into a database. But in reality, standing up, maintaining, and querying a PostgreSQL instance for what is fundamentally a read-heavy, schema-variable analytics workload introduces more operational overhead than the problem itself warrants.
We built a reporting pipeline for an Amazon seller operating across five European marketplaces. The raw data: six years of Seller Central exports, 4,200 files, covering business reports, advertising performance, inventory snapshots, and FBA fee ledgers. The initial approach – a PostgreSQL schema with ETL scripts – took three weeks to stabilise and required constant attention as Amazon revised column names and report structures across API versions. The queries were fast. The maintenance was not.
DuckDB changes that calculus. It is an embedded analytical database – it runs in-process alongside your analysis code, requires no server, no daemon, no connection strings, and reads directly from files on disk. The query interface is standard SQL. The performance characteristics are those of a columnar engine optimised for aggregation at scale. For a workload like this, DuckDB is not just adequate – it is the correct tool.
What DuckDB actually is
The distinction that matters most is columnar storage versus row storage. PostgreSQL organises data in rows: every field for a given record is stored contiguously on disk. When you run SELECT revenue, units_sold FROM orders WHERE marketplace = 'DE', the database reads every field for every row, discards the ones you did not request, and returns the two you did. This is efficient for transactional workloads – inserting, updating, and fetching individual records. It is largely unsuited to analytical queries that aggregate across millions of rows while touching only a handful of columns.
DuckDB organises data by column. Every value for revenue is stored together; every value for units_sold is stored together. The same query reads only the two column chunks it needs, skipping the rest entirely. On a dataset with forty columns, this is not a minor optimisation – it reduces I/O by roughly 95 percent for typical analytical queries. Combined with vectorised execution – processing data in batches of 1,024 values at a time using SIMD CPU instructions rather than one row at a time – DuckDB achieves query performance on analytical workloads that regularly exceeds what a well-tuned PostgreSQL instance can deliver on identical hardware.
The embedded architecture matters equally. There is no network round-trip, no authentication overhead, no connection pool to manage. DuckDB opens a file or directory, builds a query plan, and executes. The binary is under 100MB. It runs on a laptop, inside a Lambda function, or within a Python script invoked from a cron job. For self-hosted analytical pipelines, the operational surface area is close to zero.
SQL directly on files
The feature that makes DuckDB practical for Seller Central data specifically is glob-based file scanning. A single SQL statement queries thousands of CSVs simultaneously:
SELECT
date_trunc('month', "Date") AS month,
SUM("Units Ordered") AS units,
SUM("Ordered Product Sales") AS revenue
FROM read_csv_auto('/data/seller-central/business-reports/*.csv')
GROUP BY 1
ORDER BY 1;
read_csv_auto infers schema from file contents – it samples rows, detects column types, handles different delimiters, and merges schemas across files where column sets differ. For Seller Central data, where Amazon has revised column names at least four times over the past three years, this matters. The query runs against whatever files exist in the directory at execution time, with no import step, no schema migration, and no stale cache to invalidate.
For TSV exports from the advertising console, the same approach applies with an explicit format hint:
FROM read_csv('/data/advertising/*.tsv', delim=' ', header=true)
This is not a convenience feature. It is a different mental model for working with file-based data: the file system becomes the database. Schema is inferred, not declared. Queries compose across file boundaries without any preprocessing step. For ad-hoc analysis, where the question changes faster than an ETL pipeline can be updated, this architecture is the correct default.
Performance characteristics
On our reference dataset – 4,200 files, approximately 2.8GB uncompressed – a full revenue aggregation grouped by marketplace, ASIN, and month completes in 4.1 seconds on a standard laptop (Apple M2, 16GB RAM). The equivalent PostgreSQL query, on imported and indexed data, completes in 2.8 seconds. But that comparison excludes the forty-minute initial import, the schema migrations required each time Amazon changes a column name, and the ongoing ETL maintenance cost. On total operational cost, DuckDB wins decisively for this workload.
If query performance becomes a bottleneck – typically at multi-terabyte scale – DuckDB reads Parquet natively, and converting existing CSV files to Parquet is a one-time operation:
COPY (FROM read_csv_auto('/data/**/*.csv'))
TO '/data/consolidated.parquet'
(FORMAT PARQUET);
Subsequent queries against the Parquet file run 8 to 12 times faster than against the original CSVs. Parquet stores column statistics – minimum and maximum values per row group – which DuckDB uses to skip entire sections of data without reading them. A filter on marketplace = 'DE' over a sorted Parquet file may read less than ten percent of the data on disk. This predicate pushdown is not available against flat CSV files.
Building a standalone dashboard with Chart.js
The output of DuckDB queries is data. The practical question is what to do with it. For internal reporting at small to medium Amazon operations, a standalone HTML file powered by Chart.js is often more useful than a hosted dashboard – no server to run, no credentials to manage, no Superset or Metabase instance to maintain.
The pipeline works in three steps. First, DuckDB exports query results to JSON:
COPY (
SELECT
strftime(date_trunc('week', "Date"), '%Y-%m-%d') AS week,
SUM("Units Ordered") AS units,
SUM("Ordered Product Sales") AS revenue
FROM read_csv_auto('/data/**/*.csv')
GROUP BY 1 ORDER BY 1
) TO 'weekly.json' (FORMAT JSON, ARRAY true);
Second, a Python script reads those JSON files and embeds the data directly into an HTML template referencing Chart.js via CDN. Third, the HTML file opens in a browser with no web server required.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Seller Dashboard</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
</head>
<body>
<canvas id="revenue"></canvas>
<script>
const rows = /* JSON_DATA_HERE */;
new Chart(document.getElementById('revenue'), {
type: 'line',
data: {
labels: rows.map(r => r.week),
datasets: [{ label: 'Revenue', data: rows.map(r => r.revenue), tension: 0.3 }]
}
});
</script>
</body>
</html>
The result is a self-contained file that can be shared via email, placed on a private S3 bucket, or opened directly from a local directory. For operational teams that need weekly reporting without the overhead of live dashboard infrastructure, this approach is pragmatic and maintainable. The dashboard regenerates by running a shell script: DuckDB queries the CSVs, Python builds the HTML, done. No infrastructure, no recurring costs, no maintenance surface.
When DuckDB beats PostgreSQL
DuckDB is not a replacement for PostgreSQL. The comparison is only meaningful when the workload is genuinely analytical – large scans, aggregations, group-bys – and the data is primarily read rather than written. PostgreSQL wins decisively on concurrent writes, referential integrity, row-level access control, and the full operational requirements of a production application. It is the correct choice for a multi-user system where records are inserted, updated, and queried by application code.
DuckDB wins when:
- The data already exists as files on disk, with no import step planned
- The schema is not fully known in advance, or changes frequently
- Queries are ad-hoc and access patterns cannot be predicted in advance
- The operational overhead of a database server is not justified by project longevity or scale
Amazon Seller Central data satisfies all four criteria precisely. The schema changes with every Seller Central release. The files already exist locally after export. The business questions change week to week depending on what is being optimised. And the alternative – a maintained PostgreSQL ETL pipeline – represents weeks of engineering time for a problem that DuckDB solves in an afternoon.
The practical rule: if you would ordinarily reach for Excel or a Jupyter notebook to analyse this data, reach for DuckDB instead. The SQL is more expressive, the performance is orders of magnitude better for anything beyond a few thousand rows, and the file-based architecture means your data stays where it already is.
Related Insights
- tva-fetch | How Complete Data Ownership Transforms Amazon Selling Operations
- Building Production-Ready Data Infrastructure for Amazon Sellers: Introducing tva-fetch
- Traefik Reverse Proxy: The Complete Self-Hosting Guide for HTTPS and SSL Automation
- Self-Hosting Windmill on Ubuntu: Complete Docker Setup Tutorial with PostgreSQL Troubleshooting