Building a Production Self-Healing Container Platform

Production infrastructure fails. Containers crash, processes die, disks fill up. The question isn't whether something will break — it's whether your system recovers automatically or waits for a human to notice at 2am. This project builds a self-healing container platform: a production-style environment where services detect failures and restart themselves, with full observability to prove it's working.

What "Self-Healing" Actually Means

Self-healing infrastructure has three layers:

Detection — a health check that knows the difference between "process running" and "service healthy"
Recovery — automatic restart when the health check fails
Observability — metrics and logs that show what failed, when, and how long recovery took

Most tutorials cover layer one. This project implements all three.

Architecture

Vagrant VM → isolated Linux environment (mirrors a real server) └── Docker Compose → orchestrates all services ├── App Container → web service with HTTP health endpoint ├── Watchdog → detects unhealthy containers, triggers recovery ├── Prometheus → scrapes health metrics └── Grafana → real-time recovery dashboards

Docker Health Checks

A Docker health check runs a command inside the container at a defined interval. If it fails a set number of times, Docker marks the container unhealthy — and the restart policy kicks in.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s
restart: unless-stopped

The key detail: restart: unless-stopped means Docker restarts the container automatically when it exits or goes unhealthy — without any human intervention.

The Watchdog Service

Docker's built-in restart policy handles process crashes. The watchdog handles application-level failures — situations where the process is running but the service is broken. It polls each container's health endpoint, and when it detects an unhealthy state, it triggers a controlled container restart via the Docker API.

# Watchdog logic (simplified)
while True:
    status = check_health("http://app:80/health")
    if status != 200:
        log_incident(timestamp, status)
        docker.restart("app-container")
        alert_prometheus(incident=True)
    time.sleep(30)

Observability — Proving It Works

Self-healing is only trustworthy if you can observe it. I wired Prometheus to scrape a custom metric: app_recovery_events_total — a counter that increments every time the watchdog triggers a recovery. Grafana shows this as a timeline, so you can see exactly when failures occurred and how long recovery took.

# PromQL — recovery events over time
rate(app_recovery_events_total[5m])

Vagrant — Clean Test Environment

The entire platform runs inside a Vagrant VM, making it fully reproducible. Destroy it, rebuild it, the platform behaves identically. This is intentional — the goal is to simulate a real server without touching your host machine.

vagrant up
# → Grafana:    http://localhost:3000
# → Prometheus: http://localhost:9090
# → App:        http://localhost:8080

Key Lessons

Health checks must test application behavior, not just process state — a running container can still be serving errors
Restart policies alone aren't enough for application-level failures — you need a watchdog
You can't trust self-healing without observability — metrics prove recovery happened
Vagrant gives you a real Linux environment for free — use it instead of testing on your host

What's Next

Kubernetes migration — replacing Docker Compose with K8s liveness/readiness probes
Slack alerting when recovery events exceed threshold
Multi-service recovery with dependency ordering

dockerself-healingprometheusgrafanavagrantdevopsmonitoring