AboutCapabilitiesPortfolioExplore
Projects
3D Printed Drones
Upgrading Shop Air Compressor
Aluminum Mountains
Layered Backup Workflow
Bosswatt Prototype
Bronco Bumper
Machining a camera mount
Chuck Back Plates
Climbing Wall
CNC Surface Grinder Retrofit
5C Collet Box
Truck Crane Frame Fabrication
Leaf Spring Crossbow
Desktop Organizer
Engraving Tools
Festival Totem
Quick fix: Fixing household goods with a 3d printer
Fixture Table
Pirhana Plant
Building a Generator in Several Distinct Steps
Geodesic Treehouse
Hanging Plotter
Hardtail Mini Bike
Welding an Infinity Cube
Lavender Sculpture
Local LLM
Small part: Mouse Axle
Mr Fixums' Lathe Handwheel Repair
Company Logo Sulpture
Northwest Waxworks
Obsidian Remotion
Penny Fakething Freak Bike
Picture Framing
Building My Portfolio Into a Publishing System
DIY Press Brake
Building the Official Prusa Printer Enclosure out of Ikea Lack Tables
Scale License Plate
Metal Rose
Extending the Shop
Slip Roller
Staübli Reborn: Industrial Robot with a Modern Control System
Full Suspension Mini Bike
Protoyping with python by creating an email task digest
Machining a Tube Bender
Turbofurnace
First Welding Project: Making a Weld Cart
Weld Cart (for everlast)
Welding Positioner

Local LLM

Software Ai Nix Developer Tools Infrastructure Project
2026-5-12

I built local-llm because I wanted Pi to feel local, usable, and not annoyingly fragile.

More specifically, I wanted a coding-agent setup that could run against a local llama.cpp server on my 3090, come up inside a Nix shell, reuse one shared model server across terminals, and still keep the nicer Pi ergonomics I actually wanted: wrappers, model config, wiki support, context-mode, subagents, and a few task-specific local helpers.

That sounds tidy when I say it like that now. It was not tidy while I was putting it together.

The repo is small, but the job it is doing is pretty specific: take a general agent stack and turn it into a repeatable local workstation setup with predictable runtime behavior.

The first commit landed on 2026-05-13, and the whole visible history here runs through 2026-05-27. In that short window I went from basic Pi config cleanup to local wrappers, model sync, a shared llama-server, LLM Wiki integration, context-mode, subagents, and project-specific skills. The shape of the work is pretty clear in hindsight: first get the environment under control, then make local model execution practical, then reduce the amount of repeated fuss around sessions, tooling, and repo-specific workflows.

What it does

At a practical level, this repo gives me a local coding-agent environment with a few important properties:

  1. Nix provides the shell and core toolchain
  2. models.json defines the local model, server settings, and llama arguments
  3. bin/pi-sync downloads GGUF weights and writes Pi provider config
  4. bin/pi-local makes Pi talk to a shared local llama-server
  5. bin/pi-status shows whether that shared server and its client sessions are alive
  6. project-local Pi config adds packages, agents, skills, and workflow glue on top

The center of gravity is flake.nix, which is very blunt about the goal: “Local coding environment for pi + llama.cpp on a 3090.” That is exactly what this repo is.

The shell pins the environment I actually needed:

  • nodejs_24, deno, jq, curl
  • poppler-utils and docx2txt for document extraction work
  • pi from llm-agents
  • CUDA-enabled llama.cpp
  • a WSL-specific LD_LIBRARY_PATH so the NVIDIA driver libraries are usable at runtime

That last part is a good example of the whole project. This is not a generic “install some packages” repo. It is a pile of decisions aimed at making the local setup actually run on the machine I have.

Why I made it

What I wanted was not just “an LLM running locally.” That part is easy to say and surprisingly fussy to live with.

I wanted a setup where:

  • the environment is reproducible enough that I do not have to remember every incantation
  • model downloads and Pi provider config stay in sync
  • multiple shells can share one llama-server instead of each starting their own copy
  • Pi can still use project-local packages and skills instead of collapsing into a bare wrapper around a local endpoint
  • I can switch between remote models and the local Qwen model without the whole setup turning spooky

That is the useful part of this repo to me. It is not just “run a model.” It is “make a local model part of a coding workflow that I would actually keep using.”

Architecture

After a few iterations, I settled on a few bounded layers.

1. Environment and runtime provisioning

flake.nix handles the shell, binary availability, CUDA-enabled llama.cpp, and the environment variables Pi expects.

That includes things like:

  • LLAMA_CPP_API_BASE
  • PI_SKIP_VERSION_CHECK
  • IN_LOCAL_LLM
  • LOCAL_LLM_PI_BIN
  • a project-local npm prefix inside .npm

I wanted the shell to be opinionated enough that entering it told me what mattered: Pi version, llama-server version, where auth comes from, and which wrapper scripts control the local model setup.

2. Model definition and sync

models.json is the source of truth for the local model.

Right now it defines:

  • host and port for the server
  • provider compatibility details for Pi
  • the default model id
  • GGUF source URL
  • context window and token settings
  • the exact llama-server arguments I wanted for this machine

bin/pi-sync turns that into two concrete outputs:

  • the model file in ./llama.cpp/models
  • Pi provider config in .pi/agent/models.json

That matters because drifting config is an easy way to waste time. If the server thinks one thing, Pi thinks another thing, and the filesystem contains a third thing, local inference gets annoying fast.

3. Wrapper and shared-server control

This is the part that made the setup feel less dumb.

bin/pi-local and bin/_pi-shared manage one shared llama-server across multiple sessions using:

  • a shared state directory
  • a server pid file
  • one client file per active wrapper session
  • a lock directory for mutations
  • stale-session pruning based on shell pids

I like this part because it solves a very practical problem in a very small amount of shell. I did not want every terminal to boot its own model server, and I did not want shutdown behavior to depend on exiting in the right order like some cursed little ritual.

bin/pi-status gives me the quick read: root, state dir, API base, log file, server pid, readiness, and active sessions.

4. Project-local Pi behavior

The repo also configures Pi itself instead of stopping at infrastructure.

.pi/agent/settings.json shows the broader shape: remote models are still enabled, but llama-cpp/qwen3.6-27b-q5_k_m is part of the model set alongside Copilot and Gemini options. The package list adds the things I clearly wanted available in this project:

  • LLM Wiki
  • context-mode
  • subagents
  • todo support
  • pi-lens

Then there are local agents and skills layered on top, like local-edit, local-compress, tone, and git-narrative.

That is an important part of the repo’s identity. I was not just trying to host inference locally. I was trying to make local inference live inside the same actual working environment I use for coding and writing.

Development arc

The history is short, but it is very readable.

Stage 1 — Clean up the base config

The first commit is Tidy pi config on 2026-05-13.

That is a good starting signal. Before adding cleverer local-model behavior, I first tightened the Pi side of the configuration. That tracks with how I usually end up doing this kind of work: make the defaults less sloppy before adding more moving pieces.

Stage 2 — Add local wrappers and model sync

The next clear step is Adding local wrappers to manage pi.

This looks like the point where the project became a real local environment instead of just a configured agent directory. bin/pi, bin/pi-sync, models.json, and flake.nix all move together. That tells the story pretty well: I needed a controlled entrypoint, a model source of truth, and a way to generate the Pi-facing config from it.

Stage 3 — Expand the environment around the local core

A few quick commits then flesh out the setup:

  • Adding llm wiki
  • Switching to qwen 3.6
  • De-weaving llama
  • Run pi-local from cwd

This is the stage where the repo stops being just a launcher and starts becoming an actual working environment. The model choice changes, the wrappers get less awkward, and the wiki shows up as part of the workflow rather than as an afterthought.

Stage 4 — Make the shared server behave like infrastructure, not ceremony

The biggest functional shift comes on 2026-05-26:

  • Adding shared server
  • Ensuring llama-server is shared between sessions
  • Adding pi wrapper, adding context-mode, adding sub-agents

This is the point where I stopped tolerating wasteful or brittle session behavior and made the server lifecycle explicit.

That shows up directly in the shell code. pi-local registers a session file, reuses or starts the shared server under a lock, and removes its session file on exit. The last exiting session shuts the server down. Crashed shells get cleaned up by checking recorded pids.

That is not glamorous work, but it is the difference between a local setup I occasionally demo and one I would actually leave running across multiple terminals.

Stage 5 — Add task-shaped local helpers

The last visible pass adds more project-specific behavior:

  • Fix local-compress agent
  • Adding local-edit bot, adding startup timer
  • Adding tone skill, additional doc extraction tools

This part is small but telling. Once the environment and server lifecycle were mostly under control, I started shaping the local model toward actual tasks: targeted edits, dense-note compression, tone matching, and repo-history narrative work.

That is where the project stops feeling like “local inference infrastructure” and starts feeling like a local assistant environment.

What got hard

A few friction points show up pretty clearly in the repo.

Shared server lifecycle

This was one of the main practical problems.

Getting a local model to answer one request is easy enough. Getting it to behave well across multiple shells without spawning duplicate servers or leaving zombie state behind is the more annoying problem. The pid file, client files, lock directory, stale-session pruning, readiness checks, and startup timeout logic are all there because the naive version of this setup is fragile.

Config drift between model, server, and agent

I really did not want to hand-edit the same truth in three different places.

That is why models.json exists as a source of truth and why pi-sync writes .pi/agent/models.json. Local model setups get weird quickly when the downloaded file, the running server alias, and the agent config stop agreeing.

Keeping the local path useful enough to choose

There is a quiet architectural tension in .pi/agent/settings.json: remote models are still enabled.

I think that is honest.

The local Qwen path has to be useful enough that I actually pick it for some tasks, not just present enough that I can say I run something locally. The task-specific agents and skills are part of how I pushed in that direction. If the local model is best at narrower jobs, then the environment should make those jobs easy to hand off.

Result

By the end of this short burst of work, I had a setup I would actually use:

  • enter a Nix shell and get the right toolchain
  • sync a defined GGUF model into a local workspace
  • generate Pi provider config from the same source of truth
  • start or reuse one shared llama-server
  • run Pi against that local endpoint from the current working directory
  • inspect server/session state quickly
  • layer wiki, context-mode, subagents, and local task helpers on top

That is the part I like. The repo is small, but it is doing real integration work. It turns local inference from a one-off experiment into part of a working coding environment.

Trade-offs

I bought some convenience here by accepting a little wrapper complexity.

The trade looks like this:

  • more shell glue, but less manual startup and shutdown fuss
  • a pinned local environment, but more Nix and machine-specific setup detail
  • project-local model sync, but another config transformation step to maintain
  • one shared server, but some bookkeeping around locks, pids, and session files
  • local model support inside Pi, but explicit compatibility shims for llama.cpp

I still think that is the right trade.

What I think this shows

If I were summarizing this as a portfolio piece, the main point is that I took a vague goal — “make a local LLM setup for coding” — and pushed it into the unglamorous details that make the setup actually usable.

That includes:

  • reproducible environment design with Nix
  • local model packaging and sync workflows
  • shared service lifecycle management
  • agent and provider configuration for partial OpenAI compatibility
  • layering higher-level tooling on top of a local inference path instead of treating the model as the whole product

None of that is especially flashy. It is just the real work required to make a local toolchain stop fighting back.

Status

The visible history here is short and very concentrated. It still looks scrappy, which is fine. But it is scrappy in a useful direction: fewer manual steps, better session behavior, better task shaping, and a clearer boundary between infrastructure, model config, and project-level agent behavior.

For a repo this small, that is a pretty good return.

Related Projects:
3D Printed Drones Upgrading Shop Air Compressor Aluminum Mountains Layered Backup Workflow Bosswatt Prototype
Featured Work
Welding PositionerSurface Grinder Retrofit
Company Info
About UsContactAffiliate DisclosurePrivacy Policy
Specific Solutions LLC
Portland, OR