infrastructure

Observability Stack

Designer, operator 2023–Present 49 containers, bare metal
Grafana Prometheus Loki Alloy node_exporter process_exporter smartctl_exporter

Why It Matters

Running 49 containers without observability is just guessing. You don’t know what’s wrong until it’s been wrong for a while. At this scale, the observability stack isn’t a nice-to-have — it’s the difference between knowing your system is healthy and finding out after it’s been down for hours.

The Stack

Three pillars, one dashboard:

  • Prometheus — metrics. CPU, memory, disk, network, container restarts, custom application metrics. Stored in-memory with a 30-day retention window.
  • Loki — logs. Structured log aggregation from all containers. Label-based indexing instead of full-text search — faster, smaller storage.
  • Grafana — the view layer. Dashboards for everything: infrastructure health, per-container metrics, log streams, alert status.

Alloy is the collector. It replaces Prometheus’ native scrape config and Pushgateway, pulling metrics from exporters and forwarding them to the right backend.

Exporters

  • node_exporter — host-level metrics (CPU, memory, disk I/O, network)
  • process_exporter — per-process resource usage
  • smartctl_exporter — disk health (SMART attributes, temperature, reallocated sectors)

The last one is the most important. A failing disk doesn’t give you a graceful degradation — it gives you silence, then data loss. smartctl_exporter catches it before that happens.

Alerting

Grafana alerts on threshold breaches. Disk usage above 85%, container restart loops, unexporter processes. The alerts go to my phone through a simple webhook. It’s not PagerDuty, but it works.

What I’d Add Next

Long-term storage (Thanos or Cortex) for historical analysis. Right now 30 days is enough, but patterns only show up over months.

contact

Pick a channel.