Local LLM UI Roadmap

TL;DR

The thing that separates a toy local LLM UI from one that feels like a real product isn't features — it's capability surface. A bored friend looks at a chat box and thinks "I could do that on chatgpt.com." What makes them lean in: it does something the cloud can't (your private docs, your own voice, your own files) — or it does something cloud charges $20/month for (vision, tool calling, RAG with citations) running entirely on your hardware.

This plan prioritises those two things over "more chat features." Each phase's headline feature is meant to be jaw-droppable in one demo move.

Each phase's headline feature is meant to be jaw-droppable in one demo move.

design principle

Phase 1 — Make it feel like a product

~3-4 hours, autonomous. Headline demo: paste an image, vision model describes it. Cmd-K finds anything. A sidebar of past chats appears.

60 Min

Multi-conversation sidebar

Search, rename, pin. Closes the "where do my chats go" gap the current single-thread UI has.

30 Min

Cmd-K command palette

Fuzzy-find commands, models, past chats. One keystroke and you're a power user — à la Linear, VSCode, Raycast.

45 Min

Image paste & drop → vision model

Ollama already supports images: [] on /api/chat. ollama pull qwen2.5vl:7b once and the UI auto-routes attachments to the vision model.

30 Min

System prompts / personas

"Coder," "Reviewer," "Brain dump," "Therapist." Biggest UX leap for least code. Per-thread, swap mid-conversation.

20 Min

Export conversation

Markdown file or shareable HTML, one click. Friend asks "can I send this to a colleague" — yes.

20 Min

Token / cost dashboard

Cumulative tokens, energy at $/kWh, "you saved $X this week vs cloud." Frames the local-LLM value in a number people understand.

End of Phase 1: the UI looks like a small startup built it. The friend stops being bored and starts asking "wait, can it look at images?"

Phase 2 — Make it actually do things

~1 day. Headline demo: "Hey, summarise the top 3 articles about X this week and save the notes to my Obsidian vault" — and it does.

3-4 Hrs

Tool calling

Via Ollama's tool_calls: web search (firecrawl), file read/write, calculator, shell. Visible reasoning: "Model wants to search the web for X — allow?" Watching the model use tools is the demo.

2 Hrs

Voice in (push-to-talk)

Browser MediaRecorder → your existing whisper-ptt. Hold spacebar to talk. Half the infra exists already.

3 Hrs

RAG over a folder

Drag a folder, embed locally with nomic-embed-text, ask cited questions. Drop your Obsidian vault on it. Cloud LLMs can't do this with your private files.

1 Hr

Multi-model comparison

Same prompt → 2-4 models side-by-side, streaming in parallel. M3 Ultra can run several at once. Visually impressive AND useful for real evals.

Architectural decision in Phase 2

The Python launcher graduates from "thin proxy" to actual backend (FastAPI). Tool calling, RAG, and voice all need server-side work the browser can't do alone. Two-file simplicity dies here — but the gain is enormous, and almost every Phase 3 feature reuses the same backend.

Phase 3 — Make it agentic

~1 week. Headline demo: type a sentence. Model browses the web via Playwright, takes screenshots, fills a form, reports back — all local.

2 Days

MCP bridge

Ollama models can call your existing MCP servers — Gmail, Calendar, Drive, Playwright, Firecrawl, context7. Local model gets the exact same toolset Claude Code has. Privacy preserved end-to-end.

1 Day

In-browser Python sandbox

Pyodide. Model writes Python → runs → sees output → corrects. The "code interpreter" demo, but no backend round-trip needed.

2 Days

Vision + browser control

Playwright screenshots → vision model → click/type actions. "Find a Bangkok restaurant under ฿800 with good reviews" → watch it browse.

1 Day

Cross-thread memory

Claude-style fact extraction with retrieval. The thing that makes a local assistant feel like yours. "You mentioned last week that…"

1 Day

Plugin / skill system

Claude Code-style. Drop a markdown file into skills/, it becomes a slash command with behaviours. Friend can extend it without touching the codebase.

Trade-offs to know

Take now unblocks everything later Phase 2 when capability starts to matter Manage ongoing discipline needed Defer only if you'll really use it

Choice	Cost	Verdict
localStorage → IndexedDB	~30 min refactor; slight code complexity. Same browser-only storage model, just bigger and queryable.	Take now
Stdlib proxy → FastAPI backend	One more process to run, `pip install` step. Unlocks tool calling, RAG, MCP, scheduled tasks, real auth.	Take now
Two-file simplicity → multi-file build (Vite + TS)	Lose `python3 launch.py` magic. Gain hot-reload, TS types, real testing, proper component reuse.	Phase 2
Pure local → MCP bridge	Each MCP server is a trust boundary that wants thinking-through. Local model becomes a real assistant.	Phase 3
More features → less "calm minimal"	Visual surface grows; the original design language starts drifting. Needs intentional pruning.	Manage
Browser-only → desktop wrapper (Tauri / Electron)	Build complexity, packaging story, code-signing. Gains: real FS access, true PWA, dock icon, native notifications.	Defer

Quick wins — small things that unlock big options

5–30 minute moves that turn Phase 2 and 3 from "weeks of work" into "weekend of work."

5 Min

Pull a vision model

ollama pull qwen2.5vl:7b. Phase 1's image-paste is then one HTML tweak away.

5 Min

Pull an embedding model

ollama pull nomic-embed-text. RAG becomes a 200-line addition, not "build embeddings infra first."

30 Min

Move launcher to FastAPI

Even before adding capability. Phase 2 then doesn't require rewriting infrastructure — features just bolt on.

10 Min

Inventory your MCP servers

Write down which existing servers you'd actually want a local model to use. Phase 3 MCP bridge maps directly to that list.

30 Min

Chat storage → IndexedDB

Unblocks unlimited threads, attachments, full-text search. Single biggest leverage move in Phase 1.

5 Min

Decide on native wrapper

Stay browser-only, or open the door to Tauri later? Affects Phase 3 architecture if "native eventually = yes."

Start here

Phase 1 in this order, today

IndexedDB migration · 30 min · no visible change, but unblocks everything that follows
Multi-conversation sidebar · 60 min · immediate visible win, closes the "where do chats go" question
Cmd-K command palette · 30 min · the UI suddenly feels high-end
Image paste → vision model · 45 min · the "ohhh" moment for the friend

Total ~2.5–3 hours wall clock. By the end, the friend stops complaining and starts asking "what else can it do?" — exactly the state you want to be in to start Phase 2.

Time estimates assume autonomous execution by an AI coding agent with the existing ollama-ui codebase as starting point. Human-pair estimates scale roughly 2–3×.
Phase 3 MCP-bridge complexity depends on which servers you wire up — some are 10-line bridges, others are full proxy reimplementations.
Vision model quality varies a lot at Ollama-pullable sizes. qwen2.5vl:7b is decent for general scenes; llava:13b is older but stable. Don't expect GPT-4V quality at 7B.
Voice-in latency depends on the local Whisper variant: whisper-tiny is essentially instant, whisper-large-v3 is more accurate but 2–3× slower on Apple Silicon.
RAG quality is dominated by embedding model choice and chunking strategy, not the LLM. Phase 2 RAG will be roughly 80% of state-of-art with nomic-embed-text and naive chunking; closing that last 20% is its own project.
Plan written for a Mac Studio M3 Ultra with 256 GB unified memory. On a smaller machine, multi-model comparison and Phase 3's parallel-agents need rethinking.