WhatsApp

๐Ÿณ Docker ยท DevOps

Building a Production Self-Healing Container Platform

Production infrastructure fails. Containers crash, processes die, disks fill up. The question isn't whether something will break โ€” it's whether your system recovers automatically or waits for a human to notice at 2am. This project builds a self-healing container platform: a production-style environment where services detect failures and restart themselves, with full observability to prove it's working.

What "Self-Healing" Actually Means

Self-healing infrastructure has three layers:

Most tutorials cover layer one. This project implements all three.

Architecture

Vagrant VM โ†’ isolated Linux environment (mirrors a real server) โ””โ”€โ”€ Docker Compose โ†’ orchestrates all services โ”œโ”€โ”€ App Container โ†’ web service with HTTP health endpoint โ”œโ”€โ”€ Watchdog โ†’ detects unhealthy containers, triggers recovery โ”œโ”€โ”€ Prometheus โ†’ scrapes health metrics โ””โ”€โ”€ Grafana โ†’ real-time recovery dashboards

Docker Health Checks

A Docker health check runs a command inside the container at a defined interval. If it fails a set number of times, Docker marks the container unhealthy โ€” and the restart policy kicks in.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s
restart: unless-stopped

The key detail: restart: unless-stopped means Docker restarts the container automatically when it exits or goes unhealthy โ€” without any human intervention.

The Watchdog Service

Docker's built-in restart policy handles process crashes. The watchdog handles application-level failures โ€” situations where the process is running but the service is broken. It polls each container's health endpoint, and when it detects an unhealthy state, it triggers a controlled container restart via the Docker API.

# Watchdog logic (simplified)
while True:
    status = check_health("http://app:80/health")
    if status != 200:
        log_incident(timestamp, status)
        docker.restart("app-container")
        alert_prometheus(incident=True)
    time.sleep(30)

Observability โ€” Proving It Works

Self-healing is only trustworthy if you can observe it. I wired Prometheus to scrape a custom metric: app_recovery_events_total โ€” a counter that increments every time the watchdog triggers a recovery. Grafana shows this as a timeline, so you can see exactly when failures occurred and how long recovery took.

# PromQL โ€” recovery events over time
rate(app_recovery_events_total[5m])

Vagrant โ€” Clean Test Environment

The entire platform runs inside a Vagrant VM, making it fully reproducible. Destroy it, rebuild it, the platform behaves identically. This is intentional โ€” the goal is to simulate a real server without touching your host machine.

vagrant up
# โ†’ Grafana:    http://localhost:3000
# โ†’ Prometheus: http://localhost:9090
# โ†’ App:        http://localhost:8080

Key Lessons

What's Next

Work With Me

Need help implementing any of this?

I offer consulting for Linux, AWS, CI/CD, Docker, Terraform, and monitoring setup. Let's talk.

Book Free Consultation WhatsApp โ†’