ai

AI Node

Builder, operator 2024–Present Local-first, zero telemetry
CUDA llama.cpp Open WebUI Python

Why Local

The line between convenience and surveillance is thinner than most people think. Every prompt you send to a cloud LLM is logged, analyzed, and potentially used for training. For work-related queries — architecture decisions, code review, process improvements — that’s a real concern.

Modern GPUs are capable enough, and models like Llama have made local inference practical. The setup isn’t complicated.

Setup

llama.cpp server running on a CUDA GPU with Open WebUI as the frontend. The stack is intentionally simple — one inference server, one UI, zero orchestration. No vLLM, no Triton, no Ray. Overkill is a feature when you’re serving a cluster, not when you’re serving yourself.

The models are quantized to fit in VRAM with acceptable quality loss. GGUF format, mostly Q4_K_M quantization. Good enough for reasoning, code review, and drafting. Not good enough for anything that needs precision, but that’s not the use case.

What Works

  • Code review and refactoring suggestions
  • Drafting technical documents and emails
  • Explaining unfamiliar codebases
  • Brainstorming architecture approaches

What Doesn’t

  • Anything requiring factual accuracy without verification
  • Long-context tasks beyond the model’s window
  • Anything you’d trust without reading the output first

The local-first constraint means I can experiment freely. If a model hallucinates or gives bad advice, it doesn’t leak to anyone. That freedom is the whole point.

contact

Pick a channel.