Building an Amazon Data Warehouse with FastAPI and TimescaleDB

Amazon's SP-API provides access to a substantial volume of operational data — orders, inventory, settlements, advertising, catalog performance. But in reality, the data is only useful in the form you can query. Amazon's own reports are transient, rate-limited, and not designed for the operational questions that matter most to a seller managing a multi-marketplace business. You can see what happened. You cannot easily ask what is happening across all marketplaces at once, or what the demand trend was over the last ninety days for a specific ASIN in a specific fulfillment region.

This post describes the architecture we use for an Amazon data warehouse: FastAPI as the API layer, TimescaleDB as the time-series store, and Celery for the asynchronous SP-API report pipeline. The goal is a system where every piece of data available through Amazon's API is stored in a queryable form that you control, with enough structure to support demand forecasting and operational analytics without requiring a data engineering team to maintain it.

Why Amazon's Own Tools Are Insufficient

Seller Central reports cover most operational needs for a single marketplace with a small catalog. The gaps become apparent at scale. Reports are generated on demand and expire. There is no persistent query layer — you run a report, download a file, and the file is the data. Running the same report six months from now gives you the data from six months from now, not from today.

This means that any historical analysis requires either that someone downloaded and stored every relevant report at the right interval, or that you reconstruct history from transaction-level exports — which are themselves rate-limited, paginable by limited windows, and structured for individual record retrieval rather than bulk analysis.

Amazon's Business Reports dashboard provides aggregate trend data, but the granularity and the retention window are fixed by Amazon's product decisions, not by your analytical needs. The advertising console has its own data model that does not join cleanly to the fulfillment data. Settlement data arrives on a two-week cycle and requires non-trivial parsing to reconcile against order data.

A custom warehouse solves the persistence and queryability problems simultaneously. You own the storage, you define the retention, and you write the queries that answer the questions your business actually needs to answer — not the questions Amazon's dashboards were designed to address.

Technology Choices

We chose TimescaleDB as the primary store because most Amazon operational data is inherently time-series. Orders have timestamps. Inventory snapshots are point-in-time. Settlement periods have start and end dates. Advertising spend and impressions accumulate over time. TimescaleDB is PostgreSQL with hypertable extensions that make time-range queries over large datasets substantially more efficient than standard PostgreSQL, while maintaining full SQL compatibility. There is no new query language to learn and no operational model that differs materially from a standard Postgres installation.

FastAPI provides the API layer between the Celery workers that pull data from SP-API and the internal consumers that query the warehouse. It handles authentication, request validation, and response serialization with minimal boilerplate. For an internal service that is not exposed to the public internet, FastAPI's automatic OpenAPI documentation is particularly useful: it provides a queryable interface definition that makes it straightforward to write new consumers without needing to reverse-engineer the data model from the database schema.

Celery manages the asynchronous SP-API report pipeline. SP-API report generation is asynchronous by design — you request a report, poll for its status, download the result when it is ready, and process the file. This maps naturally onto a Celery task chain: a task to initiate the report request, a periodic task to poll for report status, a task to download and store the raw file, and a task to parse and load the structured data into TimescaleDB. Redis serves as the Celery broker and result backend.

Data Model

The core tables reflect the primary SP-API data domains. The design principle is that each table should answer a specific class of operational question without requiring joins across multiple report types.

Orders is a hypertable partitioned on purchase_date. Each row represents a single order item rather than a single order, because item-level analysis — by ASIN, by fulfillment channel, by marketplace — is the most common analytical grain. Order-level aggregates are derived from item-level records. The hypertable partition interval is set to one week, which provides good query performance for the rolling window queries used in demand forecasting.

Inventory snapshots stores the output of the inventory reports at the time they are pulled. FBA inventory is not an event stream — it is a state that changes continuously and is not fully reconstructible from order records alone (returns, adjustments, and transfers affect inventory without corresponding orders). We pull inventory snapshots twice daily and store them as point-in-time records. TimescaleDB's continuous aggregate functionality computes rolling daily and weekly averages from these snapshots without requiring the query to scan the full snapshot history.

Settlements stores parsed settlement reports with each line item as a row. The settlement data model is one of the more complex parsing challenges in the SP-API ecosystem — Amazon uses different transaction type codes across marketplaces, and the fee line items use naming conventions that are not fully consistent across regions. We maintain a mapping table that normalizes fee type names across markets to a consistent internal taxonomy.

Catalog stores a snapshot of the product catalog — ASINs, titles, categories, variation relationships — updated weekly. This table is primarily a lookup reference for the other tables and does not use a hypertable structure, since the data does not have meaningful time-series characteristics.

TimescaleDB in Practice

Hypertables are the core TimescaleDB feature. Creating a hypertable from a standard PostgreSQL table requires a single command specifying the time column and the partition interval. After that, standard SQL queries work without modification, and TimescaleDB's query planner automatically applies chunk exclusion — skipping partitions that fall outside the query's time range — to make time-bounded queries fast even on large datasets.

Continuous aggregates are the second feature we use extensively. A continuous aggregate is a materialized view that TimescaleDB updates incrementally as new data arrives in the hypertable. For demand forecasting, we maintain continuous aggregates for rolling 7, 30, and 90-day sales totals by ASIN and marketplace. These aggregates are updated automatically on a defined schedule, and queries against them run against pre-computed data rather than the full hypertable, which makes dashboard-level analytics fast enough for interactive use.

CREATE MATERIALIZED VIEW daily_sales_by_asin
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 day', purchase_date) AS day,
  asin,
  marketplace_id,
  SUM(quantity_ordered) AS units_sold,
  SUM(item_price_amount) AS revenue
FROM order_items
GROUP BY day, asin, marketplace_id;

This view is refreshed automatically by TimescaleDB's background policy, which we configure to run hourly. The 90-day rolling sums used in forecasting are computed as window function queries over this materialized view, not over the raw order items table.

The Celery Pipeline

The SP-API report pipeline runs as a set of Celery tasks on a schedule managed by Celery Beat. Each report type has its own schedule: orders are pulled hourly, inventory twice daily, settlements on a two-week cycle that mirrors Amazon's settlement periods.

The pipeline for each report type follows the same structure: a scheduled task initiates a report request via SP-API, storing the request ID in Redis. A polling task runs every few minutes, checks the report status via SP-API, and when the report is ready, enqueues a download task. The download task fetches the report document, stores the raw file to disk, and enqueues a processing task. The processing task parses the file and loads structured records into TimescaleDB.

Deduplication is handled at the processing step rather than the database level. SP-API occasionally returns duplicate records across overlapping report windows. We maintain a hash of each record's identifying fields and check for existing records before insert, which prevents duplicate rows from accumulating in the warehouse over time.

Error handling follows a simple policy: transient SP-API errors (rate limit responses, temporary service unavailability) trigger a Celery retry with exponential backoff. Persistent errors — malformed report documents, unexpected schema changes, parsing failures — are logged with the full context and halt processing for that report type without affecting other pipelines. We review the error log as part of the weekly operational review rather than treating these as urgent incidents unless they affect settlement reconciliation data.

Demand Forecasting Foundation

The data warehouse does not implement demand forecasting directly. It provides the data foundation that makes forecasting possible: clean time-series sales data at the ASIN-marketplace-day grain, with enough history to identify seasonal patterns and compute statistically meaningful trend estimates.

The continuous aggregates described above produce the rolling 7, 30, and 90-day demand signals. These signals feed into simple forecasting models — weighted moving averages and linear trend extrapolation for stable products, with manual adjustment factors for products with known seasonality. The models run as FastAPI endpoints that query the continuous aggregates and return point forecasts with confidence intervals.

More sophisticated forecasting approaches become tractable once the data foundation is in place. We have experimented with Holt-Winters exponential smoothing for products with clear seasonal cycles, and with simple regression against external signals (advertising spend, price changes) for products where those variables are meaningful demand drivers. But in reality, the improvement in forecast accuracy from these more complex approaches over well-tuned weighted moving averages is modest for most products — the data quality and history length matter more than the model complexity.

What Becomes Possible

The specific capabilities that the warehouse enables are worth naming, because they represent a qualitative change in what can be done operationally rather than just an incremental improvement in reporting speed.

Cross-marketplace inventory planning becomes tractable. With sales velocity data at the ASIN-marketplace level and current FBA inventory from the snapshot table, we can compute days-of-supply for every SKU in every marketplace and generate reorder recommendations automatically rather than through per-marketplace manual review.

Settlement reconciliation becomes auditable. With order data and settlement data stored in the same system, it becomes possible to verify that every order item was settled at the expected rate — identifying fee discrepancies, incorrect reimbursements, or settlement errors that are otherwise visible only to sellers who manually reconcile line items. This check runs as a weekly report against the warehouse rather than requiring manual work.

Advertising attribution becomes possible at the ASIN level. By joining advertising data against order data on shared identifiers, we can compute the contribution of advertising spend to organic rank improvements over time — a question that Amazon's advertising console does not answer directly, because it shows advertising-attributed sales but not the downstream organic sales effect of the rank changes that advertising drove.

Related Insights

Flatfile Automation: Managing Product Listings Across Multiple Amazon Marketplaces — the catalog management layer that sits alongside the data warehouse
Distribution Engineering: Building Systems That Sell So You Don't Have To — the strategic framework the data warehouse supports
Over a Hundred Docker Containers: Our Monthly Health Check Routine — the infrastructure maintenance discipline that keeps this stack healthy