Build a privacy-first mobile search assistant using Puma and open LLMs — local-only, offline-capable, and secure
Hook: You're tired of handing every personal search, note, and document to cloud AI. You want a fast, reliable mobile assistant that runs in the browser, works offline, and never sends your data to a third-party. In 2026 this is achievable: modern mobile browsers (Puma included) + WebAssembly WebNN/WebGPU runtimes + quantized open models let you run inference and store embeddings entirely on-device. This guide shows how.
Why this matters in 2026
Late 2025 and early 2026 saw two shifts that make local, privacy-first assistants practical:
- Broad mobile support for WebAssembly with threads and SIMD, and widespread WebGPU/WebNN implementations in major browsers and in privacy-first alternatives like Puma Browser.
- Open LLMs released in compact, well-quantized formats (GGUF/ggml, GGML3-era quantizations), plus browser-targeted runtimes (wasm builds of llama.cpp, WebLLM, hnswlib-wasm), enabling reasonable latency on modern phones.
“Puma Browser is an example of a mobile-first browser that embraces local AI — letting you run models and keep data on-device.”
High-level architecture — what you’ll build
Goal: a mobile web app (runs in Puma or a modern browser) that:
- Runs a client-side LLM runtime for answers and query understanding (WebAssembly or WebGPU-backed).
- Creates and caches embeddings locally for documents, pages, and notes.
- Performs local vector search (HNSW or brute-force) against the cached embeddings.
- Syncs encrypted data across devices without exposing plaintext — using end-to-end encryption and sealed storage / relay-only servers.
- Works offline and degrades gracefully when models can’t be loaded.
Components
- Client-side LLM runtime: llama.cpp WASM, WebLLM, or other WebAssembly/WebNN runtime that supports inference and embeddings.
- Embedding pipeline: local embedder (model or embed-only network) that turns text into vectors, cached in IndexedDB.
- Vector DB: small in-browser HNSW or brute-force index (hnswlib-wasm or custom lightweight index).
- Secure sync: end-to-end encryption of blobs + relay or peer-to-peer (WebRTC) for sync without server-side decryption.
- UI/UX: fast search UI with incremental indexing, feedback loop for re-ranking, and offline-first UX. See the principles from UX Design for Conversational Interfaces when designing prompts and responses.
Step-by-step implementation
The steps below assume a modern mobile browser (Puma, Chrome/Chromium with WebGPU/WebAssembly support, or Safari with WebNN). We'll include minimal code snippets and design patterns you can adapt.
1) Choose the model and runtime
Tradeoffs:
- Smaller quantized models (7B-13B quantized) produce good embeddings and run locally with acceptable latency on flagship phones.
- If you only need embeddings, use a dedicated lightweight embedder (128–1024 dims) — fewer cycles and smaller memory.
Practical picks in 2026:
- Embedding-only: a compact open embedder exported to GGUF/ONNX for wasm.
- Full LLM for generation: llama-3-small or a Mistral-derived 6–8B quantized model compiled to ggml/GGUF and wasm via llama.cpp WASM or WebLLM.
Example: load a WASM LLM runtime (pseudo-code; replace with the runtime you pick)
// init webllm runtime (pseudocode)
import WebLLM from 'webllm-wasm';
const runtime = await WebLLM.create({
modelPath: '/models/gguf/llama3-small.gguf',
useWebGPU: true,
threads: navigator.hardwareConcurrency - 1
});
2) Embeddings: compute, quantize, and cache
Compute embeddings on-device and store them in IndexedDB as encrypted blobs. For long-term performance use a compact numeric format (Float32Array or quantized Int8) and an efficient index.
Simple embedding code example (assumes runtime exposes encode method):
async function embedText(text) {
// runtime could be a dedicated embedder or LLM that supports .embed()
const vec = await runtime.embed(text); // Float32Array
return vec;
}
// store embeddings in IndexedDB (using idb or a small wrapper)
async function storeEmbedding(id, vec) {
const db = await getDb();
await db.put('embeddings', { id, vec });
}
Optimization: quantize embeddings to Int8 or Float16 to shrink storage. Use a simple linear quantization if you don't want complex tooling.
3) Build a local vector index
Options:
- For small collections (<= 10k vectors): a brute-force cosine search with WebAssembly-accelerated dot product is fine.
- For larger sets: use HNSW via hnswlib-wasm or a JS port. HNSW gives sublinear search time on-device.
Brute-force search example (cosine):
function cosine(a, b) {
let dot = 0, na = 0, nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-10);
}
async function search(queryVec, topK = 10) {
const all = await db.getAll('embeddings');
const scored = all.map(x => ({ id: x.id, score: cosine(queryVec, x.vec) }));
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, topK);
}
4) Assemble the prompt and run local inference
After retrieving the top-K relevant items, build a context-aware prompt. Keep the prompt concise and use instruction templates to reduce token usage — critical for mobile.
const topDocs = await search(queryVec, 5);
const context = topDocs.map(d => documents[d.id].text).join('\n---\n');
const prompt = `You are a privacy-first mobile assistant. Using the following context, answer concisely.\n\nContext:\n${context}\n\nQuestion: ${userQuery}`;
const resp = await runtime.generate(prompt, { maxTokens: 256 });
5) Secure sync without cloud decryption
The key idea: servers only store encrypted blobs. Devices hold the keys (or derive them using a passphrase or hardware-backed credential). Sync can use:
- Relay storage: a simple HTTP server/Cloud Storage that stores encrypted blobs; it cannot decrypt them.
- Peer-to-peer via WebRTC: direct device-to-device sync with signaling and end-to-end encryption.
Encryption pattern (recommended):
- Derive a symmetric key per user using Argon2/PBKDF2 or use a private key stored in WebAuthn (platform authenticator) to wrap/unwrap the symmetric key.
- Encrypt each document+embedding blob with AES-GCM and store/set the cipher text on the relay.
- Store metadata/plain indices only locally — the server only sees opaque blob IDs and timestamps.
Example: encrypt a blob with WebCrypto
async function deriveKeyFromPassphrase(pass) {
const enc = new TextEncoder();
const salt = enc.encode('unique-salt-2026');
const base = await crypto.subtle.importKey('raw', enc.encode(pass), 'PBKDF2', false, ['deriveKey']);
return crypto.subtle.deriveKey({
name: 'PBKDF2', salt, iterations: 200_000, hash: 'SHA-256'
}, base, { name: 'AES-GCM', length: 256 }, false, ['encrypt', 'decrypt']);
}
async function encryptBlob(key, data) {
const iv = crypto.getRandomValues(new Uint8Array(12));
const ct = await crypto.subtle.encrypt({ name: 'AES-GCM', iv }, key, data);
return { iv: Array.from(iv), ct: new Uint8Array(ct) };
}
For multi-device access, wrap the symmetric key using each device's public key (WebAuthn / Public Key) and store the wrapped key on the server. That way the server never sees plaintext.
6) Offline-first UX and background sync
Design patterns:
- Index and embed on content ingestion (background task). Use Web Workers to avoid blocking UI.
- Use service workers for network sync attempts; if online, upload encrypted blobs; else queue them in IndexedDB.
- Show clear indicators: model loaded, embeddings current, sync status (queued / synced).
Performance, storage, and battery tradeoffs
Practical constraints:
- Model size vs accuracy: smaller models use less RAM and battery but may hallucinate more. Use hybrid: small local model + optional cloud-only heavy model (opt-in) for difficult tasks.
- Embedding dimension: 512–1024 dims is a good middle ground. Quantize if storage is tight.
- Indexing cost: building HNSW can be CPU-heavy; build incrementally and offload to Web Worker.
Advanced strategies & tricks
Client-side model cascade
Run a fast, tiny model first to handle most queries. If it fails or is unsure, escalate to a larger local model (if available). This reduces latency and battery use.
Hybrid embeddings
Keep two embedding types: cheap semantic embeddings for most search tasks and expensive embeddings (higher-dim) for specialized corpora. Use the cheap embedding to filter candidates and expensive to re-rank.
Progressive quantization
Compress older, rarely-accessed vectors more aggressively. Keep recent vectors high-fidelity. This reduces storage while maintaining relevance.
Sealed-key sync using WebAuthn
Use WebAuthn to store device public keys and seal a symmetric key per user. When adding devices, use a short QR/OTP flow to grant the new device access to the encrypted key — avoids exposing the passphrase.
Security checklist
- Use AES-GCM (or XChaCha20-Poly1305 via libsodium) for blob encryption.
- Derive keys using Argon2 or PBKDF2 with high iteration counts; prefer platform-backed keys (WebAuthn) when possible.
- Never log plaintext data in analytics.
- Design your relay to be zero-knowledge: it stores only ciphertexts and metadata like last-modified timestamps.
2026 Trends & future predictions
Expect these developments to further empower local mobile assistants:
- Better mobile WebNN and WebGPU drivers: optimized GPU kernels reduce inference time on phones.
- Standardized client-side model packaging: easier distribution of GGUF/ONNX/flatbuffers for browser runtimes.
- More compact open models: sub-4B models with near-LLM performance for embedding and retrieval.
- Privacy regulations: stronger legal demand for data-minimizing local AI will accelerate adoption of zero-knowledge sync patterns.
Full example: minimal flow
End-to-end flow summary you can prototype today:
- Load a wasm embedder at app start (lazy load to save RAM).
- User highlights text: compute embedding -> quantize -> store encrypted blob in IndexedDB and add to HNSW index (Web Worker).
- User queries: compute query embedding -> search local index -> assemble context -> run local LLM -> show answer.
- On network: upload encrypted blobs to your relay; store wrapped symmetric keys for each device public key.
Common pitfalls and how to avoid them
- Model memory spikes: lazy-load models and release memory when idle; use streaming inference if supported.
- Slow cold starts: persist a tiny cached model or warm the model during idle time.
- Sync conflicts: use vector timestamping and CRDT-like merge rules for local metadata.
Actionable takeaways
- Start with a separate local embedder — it’s cheaper than running a full LLM for every query.
- Use IndexedDB + Web Worker + HNSW (or brute-force for small sets) to keep searches snappy on-device.
- Encrypt everything before sync. Use WebAuthn to avoid password copying and to bind devices securely.
- Profile on target devices (mid-range Android, iPhone) — CPU, RAM, and battery behavior vary widely.
Where to prototype & test
Tools and libs to explore in 2026:
- llama.cpp WASM / WebLLM for browser inference.
- hnswlib-wasm or lightweight JS HNSW implementations for vector search.
- idb (IndexedDB helper), WebCrypto, and WebAuthn for storage and keys.
- Puma Browser as a test platform for mobile-first local-AI behavior.
Final notes — why build this now
Privacy-first, local mobile assistants are no longer a niche. In 2026 the stack matured: low-level browser APIs, quantized open models, and portable runtimes make it practical to ship user-first experiences that keep data on-device. The approach in this guide balances privacy, performance, and usability, letting you ship a real product without depending on third-party inference APIs.
Call to action
Ready to prototype? Start by packaging a compact embedder for browser use, wire up IndexedDB + a small HNSW index, and experiment with AES-GCM encrypted blob sync to a relay server. Build a minimal MVP in a week and iterate on model size and sync UX. Share your progress, benchmarks, and device profiles — privacy-first mobile assistants will improve fastest when engineers publish real-world measurements.
Related Reading
- How to Design Cache Policies for On-Device AI Retrieval (2026 Guide)
- Legal & Privacy Implications for Cloud Caching in 2026: A Practical Guide
- Integrating On-Device AI with Cloud Analytics: Feeding ClickHouse from Raspberry Pi Micro Apps
- Observability for Edge AI Agents in 2026: Queryable Models, Metadata Protection and Compliance-First Patterns
- LEGO Zelda vs Other Licensed Nintendo Sets: How This Ocarina of Time Release Compares
- Cashtags, Tips, and Live Badges: Monetization Tools Every Touring Jazz Band Should Know
- Cosy Kitchen: 10 Comfort Food Recipes That Shine with Extra Virgin Olive Oil
- Weekend Ski Escapes from London: Using Multi-Resort Passes to Maximise Value
- Retail Shakeups and Your Cleanser Closet: How Leadership and Store Growth Affect Prices and Selection