Production infrastructure fails. Containers crash, processes die, disks fill up. The question isn't whether something will break โ it's whether your system recovers automatically or waits for a human to notice at 2am. This project builds a self-healing container platform: a production-style environment where services detect failures and restart themselves, with full observability to prove it's working.
What "Self-Healing" Actually Means
Self-healing infrastructure has three layers:
- Detection โ a health check that knows the difference between "process running" and "service healthy"
- Recovery โ automatic restart when the health check fails
- Observability โ metrics and logs that show what failed, when, and how long recovery took
Most tutorials cover layer one. This project implements all three.
Architecture
Docker Health Checks
A Docker health check runs a command inside the container at a defined interval. If it fails a set number of times, Docker marks the container unhealthy โ and the restart policy kicks in.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
restart: unless-stopped
The key detail: restart: unless-stopped means Docker restarts the container automatically when it exits or goes unhealthy โ without any human intervention.
The Watchdog Service
Docker's built-in restart policy handles process crashes. The watchdog handles application-level failures โ situations where the process is running but the service is broken. It polls each container's health endpoint, and when it detects an unhealthy state, it triggers a controlled container restart via the Docker API.
# Watchdog logic (simplified)
while True:
status = check_health("http://app:80/health")
if status != 200:
log_incident(timestamp, status)
docker.restart("app-container")
alert_prometheus(incident=True)
time.sleep(30)
Observability โ Proving It Works
Self-healing is only trustworthy if you can observe it. I wired Prometheus to scrape a custom metric: app_recovery_events_total โ a counter that increments every time the watchdog triggers a recovery. Grafana shows this as a timeline, so you can see exactly when failures occurred and how long recovery took.
# PromQL โ recovery events over time
rate(app_recovery_events_total[5m])
Vagrant โ Clean Test Environment
The entire platform runs inside a Vagrant VM, making it fully reproducible. Destroy it, rebuild it, the platform behaves identically. This is intentional โ the goal is to simulate a real server without touching your host machine.
vagrant up
# โ Grafana: http://localhost:3000
# โ Prometheus: http://localhost:9090
# โ App: http://localhost:8080
Key Lessons
- Health checks must test application behavior, not just process state โ a running container can still be serving errors
- Restart policies alone aren't enough for application-level failures โ you need a watchdog
- You can't trust self-healing without observability โ metrics prove recovery happened
- Vagrant gives you a real Linux environment for free โ use it instead of testing on your host
What's Next
- Kubernetes migration โ replacing Docker Compose with K8s liveness/readiness probes
- Slack alerting when recovery events exceed threshold
- Multi-service recovery with dependency ordering