infrastructure

Observability Stack

Designer, operator 2023–Present 49 containers, bare metal

Grafana Prometheus Loki Alloy node_exporter process_exporter smartctl_exporter

Why It Matters

Running 49 containers without observability is just guessing. You don’t know what’s wrong until it’s been wrong for a while. At this scale, the observability stack isn’t a nice-to-have — it’s the difference between knowing your system is healthy and finding out after it’s been down for hours.

The Stack

Three pillars, one dashboard:

Prometheus — metrics. CPU, memory, disk, network, container restarts, custom application metrics. Stored in-memory with a 30-day retention window.
Loki — logs. Structured log aggregation from all containers. Label-based indexing instead of full-text search — faster, smaller storage.
Grafana — the view layer. Dashboards for everything: infrastructure health, per-container metrics, log streams, alert status.

Alloy is the collector. It replaces Prometheus’ native scrape config and Pushgateway, pulling metrics from exporters and forwarding them to the right backend.

Exporters

node_exporter — host-level metrics (CPU, memory, disk I/O, network)
process_exporter — per-process resource usage
smartctl_exporter — disk health (SMART attributes, temperature, reallocated sectors)

The last one is the most important. A failing disk doesn’t give you a graceful degradation — it gives you silence, then data loss. smartctl_exporter catches it before that happens.

Alerting

Grafana alerts on threshold breaches. Disk usage above 85%, container restart loops, unexporter processes. The alerts go to my phone through a simple webhook. It’s not PagerDuty, but it works.

What I’d Add Next

Long-term storage (Thanos or Cortex) for historical analysis. Right now 30 days is enough, but patterns only show up over months.

{"menu":[{"name":"Pages","items":[{"label":"Home","subtitle":"Overview","action":"navigate:/","icon":"page"},{"label":"Career","subtitle":"Timeline & principles","action":"navigate:/career","icon":"page"},{"label":"Projects","subtitle":"All projects","action":"navigate:/projects","icon":"page"},{"label":"About","subtitle":"About Meher","action":"navigate:/about","icon":"page"}]},{"name":"Settings","items":[{"label":"Toggle Theme","subtitle":"","action":"toggleTheme","icon":"theme"}]}],"fuse":{"threshold":0.6,"minMatchCharLength":2,"keys":["label","subtitle","searchableText"]},"projects":[{"title":"Global Payroll Platform","description":"Led end-to-end deployment of in-house payroll platform across 6 APAC markets processing $1.2B annually. Defined technical requirements for 2 engineering teams. Coordinated payments integrations via API/SFTP with ISO 20022 XML specifications.","tags":["Python","SQL","API Integration","ISO 20022"],"link":"#","image":null,"techStack":["Python","SQL","API Integration","ISO 20022"],"size":"large","domain":"fintech","icon":null,"featured":true},{"title":"Fraud Detection Engine","description":"Drove fraud detection initiative analyzing 1.5M payment transactions. Reduced manual validation touchpoints by 80% through automated rule-based detection and QuickSight analytics dashboards.","tags":["Python","QuickSight","Analytics"],"link":"#","image":null,"techStack":["Python","QuickSight","SQL"],"size":"medium","domain":"analytics","icon":null},{"title":"Background Check Revamp","description":"Cross-functional policy revamp with Legal, Compliance, and Business. Reduced 90th percentile turnaround time from 4 months to 1 month. Re-engineered verification workflows, cutting manual review time by 50%. Impacted 120K annual hires across India.","tags":["Operations","Compliance","Process Engineering"],"link":"#","image":null,"techStack":["SQL","QuickSight","VBA"],"size":"medium","domain":"data","icon":null},{"title":"Sovereign Homelab","description":"49-container self-hosted infrastructure: Caddy reverse proxy, DNSGuard local resolver, NetBird VPN, Vaultwarden, Redis, PostgreSQL, CouchDB. Full observability via Grafana, Loki, Prometheus, and Alloy. Running on bare metal—no cloud provider.","tags":["Docker","Linux","Caddy","NetBird"],"link":"#","image":null,"techStack":["Docker","Caddy","NetBird","Vaultwarden"],"size":"large","domain":"infrastructure","icon":null,"featured":true},{"title":"AI Node","description":"Local-first LLM orchestration on CUDA GPU. Running llama.cpp server with Open WebUI for private, on-device AI inference—no cloud APIs, no telemetry, no data leaving the box.","tags":["CUDA","llama.cpp","Open WebUI"],"link":"#","image":null,"techStack":["Python","CUDA","llama.cpp"],"size":"medium","domain":"ai","icon":null,"featured":true},{"title":"Matrix Citadel","description":"Self-hosted Matrix federation with full MatrixRTC voice/video via dual LiveKit servers. Twunnel federation bridges, JWT auth services, OpenClaw Gateway for AI agent orchestration. Running on cloudcitadel.in.","tags":["Matrix","LiveKit","Twunnel"],"link":"https://cloudcitadel.in","image":null,"techStack":["Matrix","LiveKit","Twunnel","OpenClaw"],"size":"medium","domain":"infrastructure","icon":null},{"title":"Media & Photo Stack","description":"Self-hosted media ecosystem: Jellyfin streaming, Immich photo library with ML-powered face recognition and reverse image search, full *arr automation pipeline (Radarr, Sonarr, Lidarr, Prowlarr, Bazarr), Jellystat analytics dashboard.","tags":["Jellyfin","Immich","*arr Stack"],"link":"#","image":null,"techStack":["Jellyfin","Immich","PostgreSQL"],"size":"medium","domain":"infrastructure","icon":null},{"title":"Observability Stack","description":"Full infrastructure telemetry: Grafana dashboards, Loki log aggregation, Prometheus metrics, Alloy collector, node_exporter, process_exporter, smartctl_exporter. Monitoring 49 containers and bare-metal health in real time.","tags":["Grafana","Prometheus","Loki","Alloy"],"link":"#","image":null,"techStack":["Grafana","Prometheus","Loki","Alloy"],"size":"medium","domain":"infrastructure","icon":null}],"skills":{"tools":["Python","SQL","QuickSight","VBA","Docker","Linux","Kubernetes","Grafana","Prometheus","Loki","Caddy","Matrix"],"standards":["ISO 20022"],"domains":["Payments","Compliance Engineering","Product Management"]},"contact":{"email":"hi@meherchaitanya.com","channels":[{"label":"Email","url":"mailto:hi@meherchaitanya.com","displayText":"hi@meherchaitanya.com","icon":"mail","external":false},{"label":"LinkedIn","url":"https://linkedin.com/in/meherchaitanya","displayText":"meherchaitanya","icon":"linkedin","external":true},{"label":"Matrix","url":"https://matrix.to/#/@meher:hanumara.online","displayText":"@meher:hanumara.online","icon":"matrix","external":true}]}}