tva
← Insights

Más de Cien Contenedores Docker: Nuestra Rutina Mensual de Verificación de Salud

Running more than a hundred Docker containers in production is not unusual if you have built up services gradually over several years. A self-hosted Supabase stack accounts for thirteen containers on its own. Add a web frontend, multiple API services, background workers, a monitoring stack, log aggregation, and the number climbs faster than most plans anticipate. What becomes difficult is not deploying these containers — it is maintaining them.

Most documentation covers getting containers running. Very little covers what happens six months later when disk fills on a Sunday, or when you discover that a third of your containers have no restart policy, or when an SSL certificate expired quietly because the monitoring alert was muted. This post documents our monthly routine for a production server carrying over a hundred containers.

Starting With Disk

Disk is the most common acute failure mode in long-running Docker environments. Docker accumulates data in ways that are not visible to standard system tools. Running df -h shows filesystem usage, but it does not tell you that Docker is holding fifty gigabytes of stopped container layers, dangling images, and build cache from six months of iterative deploys.

The correct starting point is docker system df, which breaks down disk usage by images, containers, local volumes, and build cache. The output is often surprising. We have seen servers where the build cache alone exceeded all running container layers combined — accumulated silently from months of CI-triggered builds that never got cleaned up.

But in reality, the number that matters most is the reclaimable column. Before touching anything, we establish a baseline: how much space is currently reclaimable, and what is the month-over-month trajectory. If reclaimable space is growing, the pruning schedule needs to become more aggressive or more frequent.

Our cleaning sequence runs in this order. First, volumes no longer attached to any container:

docker volume prune -f

Then dangling images — layers that carry no tag and are not referenced by any running or stopped container:

docker image prune -f

And finally, if we have confirmed that a full rebuild is feasible within our recovery window, all unused images older than seven days:

docker image prune -a --filter "until=168h" -f

The -a flag removes all unused images, not only dangling ones. We run this only after verifying that all services can be rebuilt from registry within our acceptable recovery time. That verification happens before the command, not after.

Restart Policies

Restart policies determine what happens when a container exits unexpectedly or when the Docker daemon restarts after a host reboot. Most deployment guides mention this briefly. But in reality, an incorrectly set restart policy is how you end up with services that were silently stopped for two weeks — and no one noticed because the monitoring alert was pointed at the wrong endpoint.

Docker provides four restart policies. no is the default: the container does not restart under any circumstance. always restarts the container whenever it stops, including on daemon restart, regardless of exit code. unless-stopped behaves like always but respects explicit stops — if you run docker stop before a reboot, the container stays stopped after the reboot. on-failure[:max-retries] restarts only on non-zero exit codes, with an optional retry limit before giving up.

For stateless web services and API workers, we use unless-stopped. If we deliberately stop a container during a maintenance window, it should remain stopped after the next reboot rather than coming back unexpectedly. always would restart it regardless of why it stopped.

For database migration containers or one-shot initialization jobs, the correct policy is no. A migration that fails should not loop. on-failure:3 is appropriate for containers that should retry briefly against a dependency that may be temporarily unavailable — an external queue consumer waiting for a broker to become reachable, for example — but should not run indefinitely.

Our monthly check runs a single command against all containers:

docker inspect --format '{{.Name}} {{.HostConfig.RestartPolicy.Name}}' $(docker ps -aq)

Any container with policy no that is not an intentional one-shot job gets reviewed. In most cases it means a service was started with docker run during an incident and was never formally added to the compose configuration with a proper restart policy.

Log Rotation

The default Docker logging driver is json-file. By default it imposes no size limits. A container emitting a modest stream of log lines can produce hundreds of gigabytes over several months. This is not a theoretical concern — it is one of the more common causes of disk exhaustion on production servers that were set up without deliberate log management.

The fix is a global policy in /etc/docker/daemon.json:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "5"
  }
}

This caps each container's logs at five files of one hundred megabytes each — five hundred megabytes maximum per container. The Docker daemon must be restarted after this change, and critically, containers must be recreated, not merely restarted, for the new log settings to take effect.

But in reality, the daemon.json setting applies only to containers created after the change. Existing containers retain their original log configuration indefinitely. This is the most common mistake we encounter: the policy is set, the daemon is restarted, and the assumption is that all containers now comply. They do not. Our monthly check verifies the log configuration per-container:

docker inspect --format '{{.Name}} {{.HostConfig.LogConfig}}' $(docker ps -q)

Containers without explicit size limits get recreated with the updated configuration during the next maintenance window. The recreation order matters — stateful services need their data volumes to remain in place, and dependent services need to come up in the right sequence.

SSL Certificate Expiry

SSL certificates expire. Automated monitoring catches most cases. But automated monitoring also gets misconfigured, produces alert fatigue, or fails silently alongside the service it is meant to watch. Our monthly routine includes a manual verification pass independent of any automated system.

For each public-facing domain, we check the certificate directly:

echo | openssl s_client -connect domain.com:443 -servername domain.com 2>/dev/null   | openssl x509 -noout -enddate

This outputs the notAfter date. Anything expiring within thirty days enters the renewal queue immediately, regardless of what any monitoring dashboard says. The manual check is the backstop.

For self-managed certificate infrastructure — which we operate for several internal services — we check intermediate certificates separately from leaf certificates. An expired intermediate causes full chain validation failure even when the leaf certificate itself is still valid. This failure mode is less visible than an expired leaf: browsers and clients may report confusing errors rather than the clear "certificate expired" message most engineers expect.

We maintain a shell script that iterates over a list of domains, extracts the expiry date via openssl, and prints a warning for anything within thirty days and a critical alert for anything within seven days. This script runs as a cron job, but we also run it manually during the monthly review as a secondary confirmation that cron output has been accurate. Cron jobs fail silently more often than most people expect.

Container Resource Limits

Without memory limits, a misbehaving container can exhaust host RAM and trigger the kernel's OOM killer on unrelated processes. Without CPU limits, a runaway process can starve neighboring containers for long enough to cause cascading failures. Neither of these is a rare edge case.

The monthly review checks resource limits on all running containers:

docker stats --no-stream --format "table {{.Name}}	{{.CPUPerc}}	{{.MemUsage}}	{{.MemLimit}}"

Containers showing 0B / 0B in the memory limit column have no constraint set. We review each one and determine an appropriate limit. For stateless HTTP services, a memory limit of two to four times the observed working set is a reasonable starting point. The goal is not to be precise — it is to prevent unlimited growth from taking down co-located services.

We also look at CPU percentage during the stats pass. A container consistently near one hundred percent CPU on a multi-core host suggests either a runaway process or a container that is under-provisioned for its workload. Both conditions warrant investigation before the next month.

Image Update Checks

Base images receive security patches on irregular schedules. A container running an image that was current six months ago may be running against an nginx or PostgreSQL version with known vulnerabilities. We do not automatically pull and redeploy every container each month — that creates more risk than it mitigates. But we do check what is running against what is current.

The practical approach: for each service with a pinned image version, we verify the pinned version against the upstream changelog once per month. For services using a floating tag like latest or 16-alpine, we pull and diff the image digest to determine whether anything changed. If it changed, we review what changed before deploying.

But in reality, the more important discipline is moving away from floating tags. A service that quietly redeployed with a breaking change because its latest tag pointed to a new major version is a harder problem to diagnose than a service that is running a known-old image. Pin versions, then update them deliberately.

The Discipline

Taken together, the monthly health check covers six areas: disk usage and pruning, restart policy verification, log rotation settings, SSL certificate expiry, container resource limits, and image currency. None of these tasks requires more than ninety minutes in total on a well-documented server. The value is not in the individual checks — it is in doing them on a fixed schedule, before something breaks rather than after.

Production systems degrade gradually. Disk accumulates. Logs grow. Certificates age. Images fall behind. None of these processes generates an alert until the threshold is crossed. The monthly check moves the maintenance burden from reactive to predictable, which is a different operational posture entirely.


Related Insights

Artículos relacionados