Local LLM

Software Ai Nix Developer Tools Infrastructure Project

2026-5-12

I built local-llm because I wanted Pi to feel local, usable, and not annoyingly fragile.

More specifically, I wanted a coding-agent setup that could run against a local llama.cpp server on my 3090, come up inside a Nix shell, reuse one shared model server across terminals, and still keep the nicer Pi ergonomics I actually wanted: wrappers, model config, wiki support, context-mode, subagents, and a few task-specific local helpers.

That sounds tidy when I say it like that now. It was not tidy while I was putting it together.

The repo is small, but the job it is doing is pretty specific: take a general agent stack and turn it into a repeatable local workstation setup with predictable runtime behavior.

The first commit landed on 2026-05-13, and the whole visible history here runs through 2026-05-27. In that short window I went from basic Pi config cleanup to local wrappers, model sync, a shared llama-server, LLM Wiki integration, context-mode, subagents, and project-specific skills. The shape of the work is pretty clear in hindsight: first get the environment under control, then make local model execution practical, then reduce the amount of repeated fuss around sessions, tooling, and repo-specific workflows.

What it does

At a practical level, this repo gives me a local coding-agent environment with a few important properties:

Nix provides the shell and core toolchain
models.json defines the local model, server settings, and llama arguments
bin/pi-sync downloads GGUF weights and writes Pi provider config
bin/pi-local makes Pi talk to a shared local llama-server
bin/pi-status shows whether that shared server and its client sessions are alive
project-local Pi config adds packages, agents, skills, and workflow glue on top

The center of gravity is flake.nix, which is very blunt about the goal: “Local coding environment for pi + llama.cpp on a 3090.” That is exactly what this repo is.

The shell pins the environment I actually needed:

nodejs_24, deno, jq, curl
poppler-utils and docx2txt for document extraction work
pi from llm-agents
CUDA-enabled llama.cpp
a WSL-specific LD_LIBRARY_PATH so the NVIDIA driver libraries are usable at runtime

That last part is a good example of the whole project. This is not a generic “install some packages” repo. It is a pile of decisions aimed at making the local setup actually run on the machine I have.

Why I made it

What I wanted was not just “an LLM running locally.” That part is easy to say and surprisingly fussy to live with.

I wanted a setup where:

the environment is reproducible enough that I do not have to remember every incantation
model downloads and Pi provider config stay in sync
multiple shells can share one llama-server instead of each starting their own copy
Pi can still use project-local packages and skills instead of collapsing into a bare wrapper around a local endpoint
I can switch between remote models and the local Qwen model without the whole setup turning spooky

That is the useful part of this repo to me. It is not just “run a model.” It is “make a local model part of a coding workflow that I would actually keep using.”

Architecture

After a few iterations, I settled on a few bounded layers.

1. Environment and runtime provisioning

flake.nix handles the shell, binary availability, CUDA-enabled llama.cpp, and the environment variables Pi expects.

That includes things like:

LLAMA_CPP_API_BASE
PI_SKIP_VERSION_CHECK
IN_LOCAL_LLM
LOCAL_LLM_PI_BIN
a project-local npm prefix inside .npm

I wanted the shell to be opinionated enough that entering it told me what mattered: Pi version, llama-server version, where auth comes from, and which wrapper scripts control the local model setup.

2. Model definition and sync

models.json is the source of truth for the local model.

Right now it defines:

host and port for the server
provider compatibility details for Pi
the default model id
GGUF source URL
context window and token settings
the exact llama-server arguments I wanted for this machine

bin/pi-sync turns that into two concrete outputs:

the model file in ./llama.cpp/models
Pi provider config in .pi/agent/models.json

That matters because drifting config is an easy way to waste time. If the server thinks one thing, Pi thinks another thing, and the filesystem contains a third thing, local inference gets annoying fast.

3. Wrapper and shared-server control

This is the part that made the setup feel less dumb.

bin/pi-local and bin/_pi-shared manage one shared llama-server across multiple sessions using:

a shared state directory
a server pid file
one client file per active wrapper session
a lock directory for mutations
stale-session pruning based on shell pids

I like this part because it solves a very practical problem in a very small amount of shell. I did not want every terminal to boot its own model server, and I did not want shutdown behavior to depend on exiting in the right order like some cursed little ritual.

bin/pi-status gives me the quick read: root, state dir, API base, log file, server pid, readiness, and active sessions.

4. Project-local Pi behavior

The repo also configures Pi itself instead of stopping at infrastructure.

.pi/agent/settings.json shows the broader shape: remote models are still enabled, but llama-cpp/qwen3.6-27b-q5_k_m is part of the model set alongside Copilot and Gemini options. The package list adds the things I clearly wanted available in this project:

LLM Wiki
context-mode
subagents
todo support
pi-lens

Then there are local agents and skills layered on top, like local-edit, local-compress, tone, and git-narrative.

That is an important part of the repo’s identity. I was not just trying to host inference locally. I was trying to make local inference live inside the same actual working environment I use for coding and writing.

Development arc

The history is short, but it is very readable.

Stage 1 — Clean up the base config

The first commit is Tidy pi config on 2026-05-13.

That is a good starting signal. Before adding cleverer local-model behavior, I first tightened the Pi side of the configuration. That tracks with how I usually end up doing this kind of work: make the defaults less sloppy before adding more moving pieces.

Stage 2 — Add local wrappers and model sync

The next clear step is Adding local wrappers to manage pi.

This looks like the point where the project became a real local environment instead of just a configured agent directory. bin/pi, bin/pi-sync, models.json, and flake.nix all move together. That tells the story pretty well: I needed a controlled entrypoint, a model source of truth, and a way to generate the Pi-facing config from it.

Stage 3 — Expand the environment around the local core

A few quick commits then flesh out the setup:

Adding llm wiki
Switching to qwen 3.6
De-weaving llama
Run pi-local from cwd

This is the stage where the repo stops being just a launcher and starts becoming an actual working environment. The model choice changes, the wrappers get less awkward, and the wiki shows up as part of the workflow rather than as an afterthought.

Stage 4 — Make the shared server behave like infrastructure, not ceremony

The biggest functional shift comes on 2026-05-26:

Adding shared server
Ensuring llama-server is shared between sessions
Adding pi wrapper, adding context-mode, adding sub-agents

This is the point where I stopped tolerating wasteful or brittle session behavior and made the server lifecycle explicit.

That shows up directly in the shell code. pi-local registers a session file, reuses or starts the shared server under a lock, and removes its session file on exit. The last exiting session shuts the server down. Crashed shells get cleaned up by checking recorded pids.

That is not glamorous work, but it is the difference between a local setup I occasionally demo and one I would actually leave running across multiple terminals.

Stage 5 — Add task-shaped local helpers

The last visible pass adds more project-specific behavior:

Fix local-compress agent
Adding local-edit bot, adding startup timer
Adding tone skill, additional doc extraction tools

This part is small but telling. Once the environment and server lifecycle were mostly under control, I started shaping the local model toward actual tasks: targeted edits, dense-note compression, tone matching, and repo-history narrative work.

That is where the project stops feeling like “local inference infrastructure” and starts feeling like a local assistant environment.

What got hard

A few friction points show up pretty clearly in the repo.

Shared server lifecycle

This was one of the main practical problems.

Getting a local model to answer one request is easy enough. Getting it to behave well across multiple shells without spawning duplicate servers or leaving zombie state behind is the more annoying problem. The pid file, client files, lock directory, stale-session pruning, readiness checks, and startup timeout logic are all there because the naive version of this setup is fragile.

Config drift between model, server, and agent

I really did not want to hand-edit the same truth in three different places.

That is why models.json exists as a source of truth and why pi-sync writes .pi/agent/models.json. Local model setups get weird quickly when the downloaded file, the running server alias, and the agent config stop agreeing.

Keeping the local path useful enough to choose

There is a quiet architectural tension in .pi/agent/settings.json: remote models are still enabled.

I think that is honest.

The local Qwen path has to be useful enough that I actually pick it for some tasks, not just present enough that I can say I run something locally. The task-specific agents and skills are part of how I pushed in that direction. If the local model is best at narrower jobs, then the environment should make those jobs easy to hand off.

Result

By the end of this short burst of work, I had a setup I would actually use:

enter a Nix shell and get the right toolchain
sync a defined GGUF model into a local workspace
generate Pi provider config from the same source of truth
start or reuse one shared llama-server
run Pi against that local endpoint from the current working directory
inspect server/session state quickly
layer wiki, context-mode, subagents, and local task helpers on top

That is the part I like. The repo is small, but it is doing real integration work. It turns local inference from a one-off experiment into part of a working coding environment.

Trade-offs

I bought some convenience here by accepting a little wrapper complexity.

The trade looks like this:

more shell glue, but less manual startup and shutdown fuss
a pinned local environment, but more Nix and machine-specific setup detail
project-local model sync, but another config transformation step to maintain
one shared server, but some bookkeeping around locks, pids, and session files
local model support inside Pi, but explicit compatibility shims for llama.cpp

I still think that is the right trade.

What I think this shows

If I were summarizing this as a portfolio piece, the main point is that I took a vague goal — “make a local LLM setup for coding” — and pushed it into the unglamorous details that make the setup actually usable.

That includes:

reproducible environment design with Nix
local model packaging and sync workflows
shared service lifecycle management
agent and provider configuration for partial OpenAI compatibility
layering higher-level tooling on top of a local inference path instead of treating the model as the whole product

None of that is especially flashy. It is just the real work required to make a local toolchain stop fighting back.

Status

The visible history here is short and very concentrated. It still looks scrappy, which is fine. But it is scrappy in a useful direction: fewer manual steps, better session behavior, better task shaping, and a clearer boundary between infrastructure, model config, and project-level agent behavior.

For a repo this small, that is a pretty good return.

Related Projects:

3D Printed Drones Upgrading Shop Air Compressor Aluminum Mountains Layered Backup Workflow Bosswatt Prototype