Self-Hosted Monitoring Stack with Prometheus, Grafana, Loki & Ansible

Modern DevOps is built on one principle: if you do it twice, automate it. This project turns a standard Xubuntu laptop into a fully monitored infrastructure node — deploying Prometheus, Grafana, Loki, and Alertmanager end-to-end using Vagrant, Docker Compose, and Ansible. No manual steps, no clicking through UIs — one command and the entire stack is live.

The Problem With Manual Monitoring Setup

Every team eventually builds a monitoring stack. Most build it manually: install Prometheus here, configure Grafana there, wire Alertmanager separately, then document none of it. The result is a stack nobody can reproduce, nobody can debug, and nobody trusts. When the monitoring server dies, the monitoring setup dies with it.

This project solves that. Every component is defined as code. The entire stack is reproducible in minutes on any machine.

Architecture — Three-Tier Observability

Node Exporter → exposes host metrics (CPU, RAM, Disk I/O) on :9100 Prometheus → scrapes exporters, stores time-series on :9091 Grafana → queries Prometheus, renders dashboards on :3000 Loki → collects and indexes log streams Alertmanager → routes alerts → Slack / Email / PagerDuty

Each component runs as a Docker container. Docker Compose defines their relationships, volumes, and restart policies. Ansible provisions the host, installs dependencies, copies configs, and fires up the stack — idempotently.

The Networking Challenge

The biggest technical hurdle: when Prometheus runs inside Docker, localhost refers to the container itself, not the host machine. Node Exporter runs on the host and exposes metrics on port 9100 — but the container can't reach it via localhost:9100.

The fix: use the Docker bridge gateway IP (172.17.0.1) as the scrape target in prometheus.yml. This lets the container reach out to the host's network interface. UFW also needed an explicit allow rule for port 9100 from the Docker interface:

ufw allow from 172.17.0.0/16 to any port 9100

The Ansible Playbook

The full stack — Node Exporter, Prometheus, Grafana, Loki, Alertmanager — is deployed through a single Ansible playbook. This ensures idempotency: run it once, run it ten times, the result is always the same.

- name: Deploy Monitoring Stack
  hosts: localhost
  connection: local
  become: yes
  tasks:
    - name: Run Node Exporter
      community.docker.docker_container:
        name: node-exporter
        image: prom/node-exporter:latest
        state: started
        restart_policy: always
        ports: ["9100:9100"]

    - name: Run Prometheus
      community.docker.docker_container:
        name: prometheus
        image: prom/prometheus:latest
        state: started
        recreate: yes
        volumes:
          - "./prometheus.yml:/etc/prometheus/prometheus.yml"
          - "./alert_rules.yml:/etc/prometheus/alert_rules.yml"
        ports: ["9091:9090"]

    - name: Run Grafana
      community.docker.docker_container:
        name: grafana
        image: grafana/grafana:latest
        state: started
        ports: ["3000:3000"]

Proactive Alerting

Monitoring without alerting is just a dashboard you have to stare at. I wired Alertmanager with a custom rule: if CPU usage exceeds 85% for more than 2 minutes, fire a CRITICAL alert. This moves the stack from passive observation to active incident response.

- alert: HighCPUUsage
  expr: 100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "CPU usage above 85% for 2 minutes"

Vagrant — Reproducible Test Environment

The stack runs inside a Vagrant-managed VM, giving a clean, reproducible environment that mirrors a real server. Spin it up, test the stack, destroy it, spin it up again — zero residue on your host machine.

vagrant up          # provision the VM
ansible-playbook monitoring.yml   # deploy the stack
# → Grafana live at http://localhost:3000

Key Takeaways

Infrastructure is code. Never manually configure what you can define in a playbook.
Docker networking has gotchas. Always check gateway IPs when containers need to reach the host.
Firewalls matter. UFW blocked the first 30 minutes of debugging — check it early.
You don't need cloud budget. This entire stack runs on an 8GB RAM Dell Latitude.

What's Next

Loki + Promtail for centralized log aggregation
Slack webhook integration for alert delivery
Terraform provisioning for cloud deployment
Multi-node scraping across a homelab cluster

prometheusgrafanaansibledockermonitoringlokidevops