I built local-llm because I wanted Pi to feel local, usable, and not annoyingly fragile.
More specifically, I wanted a coding-agent setup that could run against a local llama.cpp server on my 3090, come up inside a Nix shell, reuse one shared model server across terminals, and still keep the nicer Pi ergonomics I actually wanted: wrappers, model config, wiki support, context-mode, subagents, and a few task-specific local helpers.
That sounds tidy when I say it like that now. It was not tidy while I was putting it together.
The repo is small, but the job it is doing is pretty specific: take a general agent stack and turn it into a repeatable local workstation setup with predictable runtime behavior.
The first commit landed on 2026-05-13, and the whole visible history here runs through 2026-05-27. In that short window I went from basic Pi config cleanup to local wrappers, model sync, a shared llama-server, LLM Wiki integration, context-mode, subagents, and project-specific skills. The shape of the work is pretty clear in hindsight: first get the environment under control, then make local model execution practical, then reduce the amount of repeated fuss around sessions, tooling, and repo-specific workflows.
At a practical level, this repo gives me a local coding-agent environment with a few important properties:
models.json defines the local model, server settings, and llama argumentsbin/pi-sync downloads GGUF weights and writes Pi provider configbin/pi-local makes Pi talk to a shared local llama-serverbin/pi-status shows whether that shared server and its client sessions are aliveThe center of gravity is flake.nix, which is very blunt about the goal: “Local coding environment for pi + llama.cpp on a 3090.” That is exactly what this repo is.
The shell pins the environment I actually needed:
nodejs_24, deno, jq, curlpoppler-utils and docx2txt for document extraction workpi from llm-agentsllama.cppLD_LIBRARY_PATH so the NVIDIA driver libraries are usable at runtimeThat last part is a good example of the whole project. This is not a generic “install some packages” repo. It is a pile of decisions aimed at making the local setup actually run on the machine I have.
What I wanted was not just “an LLM running locally.” That part is easy to say and surprisingly fussy to live with.
I wanted a setup where:
llama-server instead of each starting their own copyThat is the useful part of this repo to me. It is not just “run a model.” It is “make a local model part of a coding workflow that I would actually keep using.”
After a few iterations, I settled on a few bounded layers.
flake.nix handles the shell, binary availability, CUDA-enabled llama.cpp, and the environment variables Pi expects.
That includes things like:
LLAMA_CPP_API_BASEPI_SKIP_VERSION_CHECKIN_LOCAL_LLMLOCAL_LLM_PI_BIN.npmI wanted the shell to be opinionated enough that entering it told me what mattered: Pi version, llama-server version, where auth comes from, and which wrapper scripts control the local model setup.
models.json is the source of truth for the local model.
Right now it defines:
llama-server arguments I wanted for this machinebin/pi-sync turns that into two concrete outputs:
./llama.cpp/models.pi/agent/models.jsonThat matters because drifting config is an easy way to waste time. If the server thinks one thing, Pi thinks another thing, and the filesystem contains a third thing, local inference gets annoying fast.
This is the part that made the setup feel less dumb.
bin/pi-local and bin/_pi-shared manage one shared llama-server across multiple sessions using:
I like this part because it solves a very practical problem in a very small amount of shell. I did not want every terminal to boot its own model server, and I did not want shutdown behavior to depend on exiting in the right order like some cursed little ritual.
bin/pi-status gives me the quick read: root, state dir, API base, log file, server pid, readiness, and active sessions.
The repo also configures Pi itself instead of stopping at infrastructure.
.pi/agent/settings.json shows the broader shape: remote models are still enabled, but llama-cpp/qwen3.6-27b-q5_k_m is part of the model set alongside Copilot and Gemini options. The package list adds the things I clearly wanted available in this project:
Then there are local agents and skills layered on top, like local-edit, local-compress, tone, and git-narrative.
That is an important part of the repo’s identity. I was not just trying to host inference locally. I was trying to make local inference live inside the same actual working environment I use for coding and writing.
The history is short, but it is very readable.
The first commit is Tidy pi config on 2026-05-13.
That is a good starting signal. Before adding cleverer local-model behavior, I first tightened the Pi side of the configuration. That tracks with how I usually end up doing this kind of work: make the defaults less sloppy before adding more moving pieces.
The next clear step is Adding local wrappers to manage pi.
This looks like the point where the project became a real local environment instead of just a configured agent directory. bin/pi, bin/pi-sync, models.json, and flake.nix all move together. That tells the story pretty well: I needed a controlled entrypoint, a model source of truth, and a way to generate the Pi-facing config from it.
A few quick commits then flesh out the setup:
Adding llm wikiSwitching to qwen 3.6De-weaving llamaRun pi-local from cwdThis is the stage where the repo stops being just a launcher and starts becoming an actual working environment. The model choice changes, the wrappers get less awkward, and the wiki shows up as part of the workflow rather than as an afterthought.
The biggest functional shift comes on 2026-05-26:
Adding shared serverEnsuring llama-server is shared between sessionsAdding pi wrapper, adding context-mode, adding sub-agentsThis is the point where I stopped tolerating wasteful or brittle session behavior and made the server lifecycle explicit.
That shows up directly in the shell code. pi-local registers a session file, reuses or starts the shared server under a lock, and removes its session file on exit. The last exiting session shuts the server down. Crashed shells get cleaned up by checking recorded pids.
That is not glamorous work, but it is the difference between a local setup I occasionally demo and one I would actually leave running across multiple terminals.
The last visible pass adds more project-specific behavior:
Fix local-compress agentAdding local-edit bot, adding startup timerAdding tone skill, additional doc extraction toolsThis part is small but telling. Once the environment and server lifecycle were mostly under control, I started shaping the local model toward actual tasks: targeted edits, dense-note compression, tone matching, and repo-history narrative work.
That is where the project stops feeling like “local inference infrastructure” and starts feeling like a local assistant environment.
A few friction points show up pretty clearly in the repo.
This was one of the main practical problems.
Getting a local model to answer one request is easy enough. Getting it to behave well across multiple shells without spawning duplicate servers or leaving zombie state behind is the more annoying problem. The pid file, client files, lock directory, stale-session pruning, readiness checks, and startup timeout logic are all there because the naive version of this setup is fragile.
I really did not want to hand-edit the same truth in three different places.
That is why models.json exists as a source of truth and why pi-sync writes .pi/agent/models.json. Local model setups get weird quickly when the downloaded file, the running server alias, and the agent config stop agreeing.
There is a quiet architectural tension in .pi/agent/settings.json: remote models are still enabled.
I think that is honest.
The local Qwen path has to be useful enough that I actually pick it for some tasks, not just present enough that I can say I run something locally. The task-specific agents and skills are part of how I pushed in that direction. If the local model is best at narrower jobs, then the environment should make those jobs easy to hand off.
By the end of this short burst of work, I had a setup I would actually use:
llama-serverThat is the part I like. The repo is small, but it is doing real integration work. It turns local inference from a one-off experiment into part of a working coding environment.
I bought some convenience here by accepting a little wrapper complexity.
The trade looks like this:
llama.cppI still think that is the right trade.
If I were summarizing this as a portfolio piece, the main point is that I took a vague goal — “make a local LLM setup for coding” — and pushed it into the unglamorous details that make the setup actually usable.
That includes:
None of that is especially flashy. It is just the real work required to make a local toolchain stop fighting back.
The visible history here is short and very concentrated. It still looks scrappy, which is fine. But it is scrappy in a useful direction: fewer manual steps, better session behavior, better task shaping, and a clearer boundary between infrastructure, model config, and project-level agent behavior.
For a repo this small, that is a pretty good return.