セルフホスティングサービスのディザスタリカバリー：バックアップ戦略

Self-hosting gives you control that managed services do not offer: you choose the hardware, the network, the data residency, and the cost structure. But in reality, that control comes with a responsibility that managed services handle quietly: when something goes wrong, it is entirely your problem. For a single-server setup running multiple production services – Supabase, a CRM, a DAM, a Git server – the backup and recovery strategy is not a background concern. It is the difference between a recoverable incident and a permanent data loss.

This post documents the backup strategy we run for a production self-hosted environment. The services involved are a Supabase stack (PostgreSQL as the primary data store), a self-hosted CRM (Twenty, which also runs PostgreSQL), a Gitea Git server, and a Digital Asset Management system. All of them run as Docker containers on a single Hetzner VPS. The strategy has to account for all of them without creating unacceptable load windows or complex orchestration dependencies.

単一サーバーがリスクを高める理由

A single-server setup has no built-in redundancy. There is no standby replica, no multi-availability-zone failover, no automatic database restore triggered by a health check. When the server fails – whether due to hardware failure, a botched upgrade, a storage issue, or a security incident – the only recovery path is a backup. If the backup is incomplete, stale, or untested, the recovery path is limited accordingly.

This is a different threat model than a managed cloud service, where hardware failure is largely invisible because the infrastructure provider handles failover. On a single VPS, you plan explicitly for the failure modes that managed services abstract away. The planning is not particularly complex, but it requires doing it before the incident rather than during it.

バックアップが必要なもの

Each service has its own data profile. Understanding what data is irreplaceable and where it lives is the prerequisite for designing the backup strategy.

Supabase stores its data in a PostgreSQL instance managed by the Supabase stack. The database contains all application data, user records, and Supabase’s own auth and storage metadata. The storage buckets (file uploads) are stored on disk and need separate handling from the database backup.

Twenty CRM is backed by its own PostgreSQL instance, separate from Supabase’s. It contains contact records, opportunity data, and workflow state. This database is smaller in volume but highly sensitive – losing CRM data is more disruptive operationally than losing application data that can be recreated.

Gitea stores repositories on disk as bare Git objects. The repository data is in principle reconstructable from developer workstations (since every clone is a full backup), but the issue tracker data, pull request comments, and team configuration live only in Gitea’s database and are not present in any clone. Both the Git objects and the Gitea database need backing up.

DAM (Digital Asset Management) stores original files, processed derivatives, and metadata. The original files are the irreplaceable part; derivatives can be regenerated. The metadata lives in a database that records file relationships, tags, and usage rights.

pg_dump のオーケストレーション

PostgreSQL’s pg_dump is the right tool for logical database backups. It exports a SQL representation of the database that can be restored to any compatible PostgreSQL version, which is more portable than physical backups (which require the same PostgreSQL version and binary layout).

The backup script for each PostgreSQL instance follows the same pattern:

#!/usr/bin/env bash
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="$1"
CONTAINER_NAME="$2"
BACKUP_DIR="/opt/backups/postgres"
OUTPUT_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.sql.gz"

mkdir -p "${BACKUP_DIR}"

docker exec "${CONTAINER_NAME}"   pg_dump -U postgres -d "${DB_NAME}"   | gzip -9 > "${OUTPUT_FILE}"

echo "Backup complete: ${OUTPUT_FILE} ($(du -sh "${OUTPUT_FILE}" | cut -f1))"

The set -euo pipefail at the top is important: it causes the script to exit immediately if any command fails, including commands in a pipeline. Without pipefail, a failed pg_dump followed by a successful gzip would produce a compressed file containing an error message, which the restore attempt would treat as a corrupt backup.

The output is compressed with gzip -9 inline rather than as a post-processing step. This keeps the disk footprint small and avoids writing an uncompressed dump to disk first, which matters on servers where free space is a managed constraint.

分散スケジューリング

Running all backup jobs simultaneously would create an I/O contention window that could degrade the services being backed up. A Supabase instance under load does not benefit from competing with a pg_dump for disk I/O. The solution is staggered scheduling – each backup job starts at a different time, with enough spacing to allow the previous job to complete before the next begins.

The schedule we use, expressed as cron entries:

# Supabase PostgreSQL backup — 02:00 daily
0 2 * * * /opt/scripts/pg-backup.sh supabase supabase-db

# Twenty CRM PostgreSQL backup — 02:30 daily
30 2 * * * /opt/scripts/pg-backup.sh twenty twenty-db

# Gitea database backup — 03:00 daily
0 3 * * * /opt/scripts/pg-backup.sh gitea gitea-db

# DAM database backup — 03:30 daily
30 3 * * * /opt/scripts/pg-backup.sh dam dam-db

# Gitea repository objects — 04:00 daily
0 4 * * * /opt/scripts/git-objects-backup.sh

# Storage files (Supabase + DAM originals) — 04:30 daily
30 4 * * * /opt/scripts/files-backup.sh

The 30-minute gaps between jobs are conservative – most database backups complete in a few minutes for databases of this size. But the gaps also account for the upload step that follows each backup: the compressed file is uploaded to object storage before the next job begins, so the local disk is not accumulating multiple day’s worth of backups simultaneously.

オブジェクトストレージのアップロードと保持

Backups that exist only on the same server as the services they protect are not backups – they are snapshots that will be lost in the same incident that destroys the data they protect. Every backup must be uploaded to a separate storage location before the local copy can be considered complete.

We use an S3-compatible object storage provider (separate from the VPS provider) and upload with rclone, which handles retries, resumable transfers, and verification automatically:

rclone copy "${OUTPUT_FILE}" "backup-remote:tva-backups/postgres/${DB_NAME}/"   --checksum   --transfers 1   --log-level INFO

The --checksum flag verifies the transfer using MD5 checksums rather than just modification time and size, which catches any corruption during transfer.

Retention is enforced by a separate cleanup job that runs weekly. The retention policy is: daily backups retained for seven days, weekly backups (Sunday’s daily backup, renamed by a weekly job) retained for four weeks, monthly backups (first Sunday of each month) retained for six months. This gives a reasonable window for detecting data loss that is not immediately obvious – a corrupted row introduced three weeks ago can still be recovered from a weekly backup.

The retention cleanup uses rclone delete with a filter on modification time rather than deleting by file name pattern, which is more reliable when file naming conventions are not perfectly consistent:

rclone delete "backup-remote:tva-backups/postgres/"   --min-age 7d   --include "*_daily_*"   --dry-run

The --dry-run flag is used during testing of the cleanup job. Remove it only after confirming that the filter patterns match exactly what should be deleted.

リストアテスト

A backup that has never been tested is a hypothesis. The only way to know that a backup is restorable is to restore it. We run a monthly restore test for each database, using a temporary Docker container that is isolated from the production stack:

#!/usr/bin/env bash
set -euo pipefail

BACKUP_FILE="$1"
TEST_CONTAINER="restore-test-$(date +%s)"

# Start a temporary PostgreSQL container
docker run -d   --name "${TEST_CONTAINER}"   -e POSTGRES_PASSWORD=testpass   -e POSTGRES_DB=testdb   postgres:15-alpine

# Wait for PostgreSQL to be ready
sleep 5

# Restore the backup
zcat "${BACKUP_FILE}" | docker exec -i "${TEST_CONTAINER}"   psql -U postgres -d testdb

# Verify row counts match expectations
docker exec "${TEST_CONTAINER}"   psql -U postgres -d testdb   -c "SELECT schemaname, tablename, n_live_tup FROM pg_stat_user_tables ORDER BY n_live_tup DESC LIMIT 10;"

# Clean up
docker stop "${TEST_CONTAINER}" && docker rm "${TEST_CONTAINER}"

The row count verification is not exhaustive – it checks that the main tables have plausible row counts, not that every row is correct. A more thorough test would run the application’s own health checks against the restored database, but the row count check catches the most common failure modes: a truncated backup, a failed restore that produced an empty database, or a version incompatibility that caused silent data loss.

実際のリカバリーはどのように見えるか

The backup strategy is only as useful as the recovery procedure that uses it. The recovery runbook – written as a Markdown document in the server’s configuration repository – documents the exact sequence of steps to restore each service on a fresh VPS.

The sequence for a full-server failure is: provision a new VPS, install Docker and the application stack from the configuration repository, download the most recent backup files from object storage, restore each database in dependency order (Supabase first, then CRM, then the others), restore file storage, update DNS, and verify each service. The estimated time for a full recovery is under two hours if all backups are current and the runbook is followed without improvisation.

The key word in that estimate is “followed without improvisation.” The runbook should be specific enough that a person who has never touched the system before could execute it successfully. Commands should be copy-pasteable, not described in prose. Every step should have a verification check. Ambiguity in a recovery runbook is a liability that compounds under the stress of an actual incident.

We test the runbook annually by performing a full recovery to a staging environment. The test reliably surfaces at least one step that has changed since the runbook was last written – a Docker image version that moved, a configuration key that was renamed, an environment variable that was added. Catching these in a scheduled test is far preferable to discovering them during an actual recovery.