The thing that separates a toy local LLM UI from one that feels like a real product isn't features — it's capability surface. A bored friend looks at a chat box and thinks "I could do that on chatgpt.com." What makes them lean in: it does something the cloud can't (your private docs, your own voice, your own files) — or it does something cloud charges $20/month for (vision, tool calling, RAG with citations) running entirely on your hardware.
This plan prioritises those two things over "more chat features." Each phase's headline feature is meant to be jaw-droppable in one demo move.
Each phase's headline feature is meant to be jaw-droppable in one demo move.
Phase 1 — Make it feel like a product
~3-4 hours, autonomous. Headline demo: paste an image, vision model describes it. Cmd-K finds anything. A sidebar of past chats appears.
Multi-conversation sidebar
Search, rename, pin. Closes the "where do my chats go" gap the current single-thread UI has.
Cmd-K command palette
Fuzzy-find commands, models, past chats. One keystroke and you're a power user — à la Linear, VSCode, Raycast.
Image paste & drop → vision model
Ollama already supports images: [] on /api/chat. ollama pull qwen2.5vl:7b once and the UI auto-routes attachments to the vision model.
System prompts / personas
"Coder," "Reviewer," "Brain dump," "Therapist." Biggest UX leap for least code. Per-thread, swap mid-conversation.
Export conversation
Markdown file or shareable HTML, one click. Friend asks "can I send this to a colleague" — yes.
Token / cost dashboard
Cumulative tokens, energy at $/kWh, "you saved $X this week vs cloud." Frames the local-LLM value in a number people understand.
End of Phase 1: the UI looks like a small startup built it. The friend stops being bored and starts asking "wait, can it look at images?"
Phase 2 — Make it actually do things
~1 day. Headline demo: "Hey, summarise the top 3 articles about X this week and save the notes to my Obsidian vault" — and it does.
Tool calling
Via Ollama's tool_calls: web search (firecrawl), file read/write, calculator, shell. Visible reasoning: "Model wants to search the web for X — allow?" Watching the model use tools is the demo.
Voice in (push-to-talk)
Browser MediaRecorder → your existing whisper-ptt. Hold spacebar to talk. Half the infra exists already.
RAG over a folder
Drag a folder, embed locally with nomic-embed-text, ask cited questions. Drop your Obsidian vault on it. Cloud LLMs can't do this with your private files.
Multi-model comparison
Same prompt → 2-4 models side-by-side, streaming in parallel. M3 Ultra can run several at once. Visually impressive AND useful for real evals.
Architectural decision in Phase 2
The Python launcher graduates from "thin proxy" to actual backend (FastAPI). Tool calling, RAG, and voice all need server-side work the browser can't do alone. Two-file simplicity dies here — but the gain is enormous, and almost every Phase 3 feature reuses the same backend.
Phase 3 — Make it agentic
~1 week. Headline demo: type a sentence. Model browses the web via Playwright, takes screenshots, fills a form, reports back — all local.
MCP bridge
Ollama models can call your existing MCP servers — Gmail, Calendar, Drive, Playwright, Firecrawl, context7. Local model gets the exact same toolset Claude Code has. Privacy preserved end-to-end.
In-browser Python sandbox
Pyodide. Model writes Python → runs → sees output → corrects. The "code interpreter" demo, but no backend round-trip needed.
Vision + browser control
Playwright screenshots → vision model → click/type actions. "Find a Bangkok restaurant under ฿800 with good reviews" → watch it browse.
Cross-thread memory
Claude-style fact extraction with retrieval. The thing that makes a local assistant feel like yours. "You mentioned last week that…"
Plugin / skill system
Claude Code-style. Drop a markdown file into skills/, it becomes a slash command with behaviours. Friend can extend it without touching the codebase.
Trade-offs to know
| Choice | Cost | Verdict |
|---|---|---|
| localStorage → IndexedDB | ~30 min refactor; slight code complexity. Same browser-only storage model, just bigger and queryable. | Take now |
| Stdlib proxy → FastAPI backend | One more process to run, pip install step. Unlocks tool calling, RAG, MCP, scheduled tasks, real auth. |
Take now |
| Two-file simplicity → multi-file build (Vite + TS) | Lose python3 launch.py magic. Gain hot-reload, TS types, real testing, proper component reuse. |
Phase 2 |
| Pure local → MCP bridge | Each MCP server is a trust boundary that wants thinking-through. Local model becomes a real assistant. | Phase 3 |
| More features → less "calm minimal" | Visual surface grows; the original design language starts drifting. Needs intentional pruning. | Manage |
| Browser-only → desktop wrapper (Tauri / Electron) | Build complexity, packaging story, code-signing. Gains: real FS access, true PWA, dock icon, native notifications. | Defer |
Quick wins — small things that unlock big options
5–30 minute moves that turn Phase 2 and 3 from "weeks of work" into "weekend of work."
Pull a vision model
ollama pull qwen2.5vl:7b. Phase 1's image-paste is then one HTML tweak away.
Pull an embedding model
ollama pull nomic-embed-text. RAG becomes a 200-line addition, not "build embeddings infra first."
Move launcher to FastAPI
Even before adding capability. Phase 2 then doesn't require rewriting infrastructure — features just bolt on.
Inventory your MCP servers
Write down which existing servers you'd actually want a local model to use. Phase 3 MCP bridge maps directly to that list.
Chat storage → IndexedDB
Unblocks unlimited threads, attachments, full-text search. Single biggest leverage move in Phase 1.
Decide on native wrapper
Stay browser-only, or open the door to Tauri later? Affects Phase 3 architecture if "native eventually = yes."
Start here
Phase 1 in this order, today
- IndexedDB migration · 30 min · no visible change, but unblocks everything that follows
- Multi-conversation sidebar · 60 min · immediate visible win, closes the "where do chats go" question
- Cmd-K command palette · 30 min · the UI suddenly feels high-end
- Image paste → vision model · 45 min · the "ohhh" moment for the friend
Total ~2.5–3 hours wall clock. By the end, the friend stops complaining and starts asking "what else can it do?" — exactly the state you want to be in to start Phase 2.
- Time estimates assume autonomous execution by an AI coding agent with the existing ollama-ui codebase as starting point. Human-pair estimates scale roughly 2–3×.
- Phase 3 MCP-bridge complexity depends on which servers you wire up — some are 10-line bridges, others are full proxy reimplementations.
- Vision model quality varies a lot at Ollama-pullable sizes.
qwen2.5vl:7bis decent for general scenes;llava:13bis older but stable. Don't expect GPT-4V quality at 7B. - Voice-in latency depends on the local Whisper variant:
whisper-tinyis essentially instant,whisper-large-v3is more accurate but 2–3× slower on Apple Silicon. - RAG quality is dominated by embedding model choice and chunking strategy, not the LLM. Phase 2 RAG will be roughly 80% of state-of-art with
nomic-embed-textand naive chunking; closing that last 20% is its own project. - Plan written for a Mac Studio M3 Ultra with 256 GB unified memory. On a smaller machine, multi-model comparison and Phase 3's parallel-agents need rethinking.