Build a Privacy-First Mobile Search Assistant Using Puma and Open LLMs
Build a privacy-first mobile assistant that runs fully in-browser using Puma, WebAssembly LLMs, cached embeddings, and end-to-end encrypted sync.
Build a privacy-first mobile search assistant using Puma and open LLMs — local-only, offline-capable, and secure
Hook: You're tired of handing every personal search, note, and document to cloud AI. You want a fast, reliable mobile assistant that runs in the browser, works offline, and never sends your data to a third-party. In 2026 this is achievable: modern mobile browsers (Puma included) + WebAssembly WebNN/WebGPU runtimes + quantized open models let you run inference and store embeddings entirely on-device. This guide shows how.
Why this matters in 2026
Late 2025 and early 2026 saw two shifts that make local, privacy-first assistants practical:
- Broad mobile support for WebAssembly with threads and SIMD, and widespread WebGPU/WebNN implementations in major browsers and in privacy-first alternatives like Puma Browser.
- Open LLMs released in compact, well-quantized formats (GGUF/ggml, GGML3-era quantizations), plus browser-targeted runtimes (wasm builds of llama.cpp, WebLLM, hnswlib-wasm), enabling reasonable latency on modern phones.
“Puma Browser is an example of a mobile-first browser that embraces local AI — letting you run models and keep data on-device.”
High-level architecture — what you’ll build
Goal: a mobile web app (runs in Puma or a modern browser) that:
- Runs a client-side LLM runtime for answers and query understanding (WebAssembly or WebGPU-backed).
- Creates and caches embeddings locally for documents, pages, and notes.
- Performs local vector search (HNSW or brute-force) against the cached embeddings.
- Syncs encrypted data across devices without exposing plaintext — using end-to-end encryption and sealed storage / relay-only servers.
- Works offline and degrades gracefully when models can’t be loaded.
Components
- Client-side LLM runtime: llama.cpp WASM, WebLLM, or other WebAssembly/WebNN runtime that supports inference and embeddings.
- Embedding pipeline: local embedder (model or embed-only network) that turns text into vectors, cached in IndexedDB.
- Vector DB: small in-browser HNSW or brute-force index (hnswlib-wasm or custom lightweight index).
- Secure sync: end-to-end encryption of blobs + relay or peer-to-peer (WebRTC) for sync without server-side decryption.
- UI/UX: fast search UI with incremental indexing, feedback loop for re-ranking, and offline-first UX. See the principles from UX Design for Conversational Interfaces when designing prompts and responses.
Step-by-step implementation
The steps below assume a modern mobile browser (Puma, Chrome/Chromium with WebGPU/WebAssembly support, or Safari with WebNN). We'll include minimal code snippets and design patterns you can adapt.
1) Choose the model and runtime
Tradeoffs:
- Smaller quantized models (7B-13B quantized) produce good embeddings and run locally with acceptable latency on flagship phones.
- If you only need embeddings, use a dedicated lightweight embedder (128–1024 dims) — fewer cycles and smaller memory.
Practical picks in 2026:
- Embedding-only: a compact open embedder exported to GGUF/ONNX for wasm.
- Full LLM for generation: llama-3-small or a Mistral-derived 6–8B quantized model compiled to ggml/GGUF and wasm via llama.cpp WASM or WebLLM.
Example: load a WASM LLM runtime (pseudo-code; replace with the runtime you pick)
// init webllm runtime (pseudocode)
import WebLLM from 'webllm-wasm';
const runtime = await WebLLM.create({
modelPath: '/models/gguf/llama3-small.gguf',
useWebGPU: true,
threads: navigator.hardwareConcurrency - 1
});
2) Embeddings: compute, quantize, and cache
Compute embeddings on-device and store them in IndexedDB as encrypted blobs. For long-term performance use a compact numeric format (Float32Array or quantized Int8) and an efficient index.
Simple embedding code example (assumes runtime exposes encode method):
async function embedText(text) {
// runtime could be a dedicated embedder or LLM that supports .embed()
const vec = await runtime.embed(text); // Float32Array
return vec;
}
// store embeddings in IndexedDB (using idb or a small wrapper)
async function storeEmbedding(id, vec) {
const db = await getDb();
await db.put('embeddings', { id, vec });
}
Optimization: quantize embeddings to Int8 or Float16 to shrink storage. Use a simple linear quantization if you don't want complex tooling.
3) Build a local vector index
Options:
- For small collections (<= 10k vectors): a brute-force cosine search with WebAssembly-accelerated dot product is fine.
- For larger sets: use HNSW via hnswlib-wasm or a JS port. HNSW gives sublinear search time on-device.
Brute-force search example (cosine):
function cosine(a, b) {
let dot = 0, na = 0, nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-10);
}
async function search(queryVec, topK = 10) {
const all = await db.getAll('embeddings');
const scored = all.map(x => ({ id: x.id, score: cosine(queryVec, x.vec) }));
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, topK);
}
4) Assemble the prompt and run local inference
After retrieving the top-K relevant items, build a context-aware prompt. Keep the prompt concise and use instruction templates to reduce token usage — critical for mobile.
const topDocs = await search(queryVec, 5);
const context = topDocs.map(d => documents[d.id].text).join('\n---\n');
const prompt = `You are a privacy-first mobile assistant. Using the following context, answer concisely.\n\nContext:\n${context}\n\nQuestion: ${userQuery}`;
const resp = await runtime.generate(prompt, { maxTokens: 256 });
5) Secure sync without cloud decryption
The key idea: servers only store encrypted blobs. Devices hold the keys (or derive them using a passphrase or hardware-backed credential). Sync can use:
- Relay storage: a simple HTTP server/Cloud Storage that stores encrypted blobs; it cannot decrypt them.
- Peer-to-peer via WebRTC: direct device-to-device sync with signaling and end-to-end encryption.
Encryption pattern (recommended):
- Derive a symmetric key per user using Argon2/PBKDF2 or use a private key stored in WebAuthn (platform authenticator) to wrap/unwrap the symmetric key.
- Encrypt each document+embedding blob with AES-GCM and store/set the cipher text on the relay.
- Store metadata/plain indices only locally — the server only sees opaque blob IDs and timestamps.
Example: encrypt a blob with WebCrypto
async function deriveKeyFromPassphrase(pass) {
const enc = new TextEncoder();
const salt = enc.encode('unique-salt-2026');
const base = await crypto.subtle.importKey('raw', enc.encode(pass), 'PBKDF2', false, ['deriveKey']);
return crypto.subtle.deriveKey({
name: 'PBKDF2', salt, iterations: 200_000, hash: 'SHA-256'
}, base, { name: 'AES-GCM', length: 256 }, false, ['encrypt', 'decrypt']);
}
async function encryptBlob(key, data) {
const iv = crypto.getRandomValues(new Uint8Array(12));
const ct = await crypto.subtle.encrypt({ name: 'AES-GCM', iv }, key, data);
return { iv: Array.from(iv), ct: new Uint8Array(ct) };
}
For multi-device access, wrap the symmetric key using each device's public key (WebAuthn / Public Key) and store the wrapped key on the server. That way the server never sees plaintext.
6) Offline-first UX and background sync
Design patterns:
- Index and embed on content ingestion (background task). Use Web Workers to avoid blocking UI.
- Use service workers for network sync attempts; if online, upload encrypted blobs; else queue them in IndexedDB.
- Show clear indicators: model loaded, embeddings current, sync status (queued / synced).
Performance, storage, and battery tradeoffs
Practical constraints:
- Model size vs accuracy: smaller models use less RAM and battery but may hallucinate more. Use hybrid: small local model + optional cloud-only heavy model (opt-in) for difficult tasks.
- Embedding dimension: 512–1024 dims is a good middle ground. Quantize if storage is tight.
- Indexing cost: building HNSW can be CPU-heavy; build incrementally and offload to Web Worker.
Advanced strategies & tricks
Client-side model cascade
Run a fast, tiny model first to handle most queries. If it fails or is unsure, escalate to a larger local model (if available). This reduces latency and battery use.
Hybrid embeddings
Keep two embedding types: cheap semantic embeddings for most search tasks and expensive embeddings (higher-dim) for specialized corpora. Use the cheap embedding to filter candidates and expensive to re-rank.
Progressive quantization
Compress older, rarely-accessed vectors more aggressively. Keep recent vectors high-fidelity. This reduces storage while maintaining relevance.
Sealed-key sync using WebAuthn
Use WebAuthn to store device public keys and seal a symmetric key per user. When adding devices, use a short QR/OTP flow to grant the new device access to the encrypted key — avoids exposing the passphrase.
Security checklist
- Use AES-GCM (or XChaCha20-Poly1305 via libsodium) for blob encryption.
- Derive keys using Argon2 or PBKDF2 with high iteration counts; prefer platform-backed keys (WebAuthn) when possible.
- Never log plaintext data in analytics.
- Design your relay to be zero-knowledge: it stores only ciphertexts and metadata like last-modified timestamps.
2026 Trends & future predictions
Expect these developments to further empower local mobile assistants:
- Better mobile WebNN and WebGPU drivers: optimized GPU kernels reduce inference time on phones.
- Standardized client-side model packaging: easier distribution of GGUF/ONNX/flatbuffers for browser runtimes.
- More compact open models: sub-4B models with near-LLM performance for embedding and retrieval.
- Privacy regulations: stronger legal demand for data-minimizing local AI will accelerate adoption of zero-knowledge sync patterns.
Full example: minimal flow
End-to-end flow summary you can prototype today:
- Load a wasm embedder at app start (lazy load to save RAM).
- User highlights text: compute embedding -> quantize -> store encrypted blob in IndexedDB and add to HNSW index (Web Worker).
- User queries: compute query embedding -> search local index -> assemble context -> run local LLM -> show answer.
- On network: upload encrypted blobs to your relay; store wrapped symmetric keys for each device public key.
Common pitfalls and how to avoid them
- Model memory spikes: lazy-load models and release memory when idle; use streaming inference if supported.
- Slow cold starts: persist a tiny cached model or warm the model during idle time.
- Sync conflicts: use vector timestamping and CRDT-like merge rules for local metadata.
Actionable takeaways
- Start with a separate local embedder — it’s cheaper than running a full LLM for every query.
- Use IndexedDB + Web Worker + HNSW (or brute-force for small sets) to keep searches snappy on-device.
- Encrypt everything before sync. Use WebAuthn to avoid password copying and to bind devices securely.
- Profile on target devices (mid-range Android, iPhone) — CPU, RAM, and battery behavior vary widely.
Where to prototype & test
Tools and libs to explore in 2026:
- llama.cpp WASM / WebLLM for browser inference.
- hnswlib-wasm or lightweight JS HNSW implementations for vector search.
- idb (IndexedDB helper), WebCrypto, and WebAuthn for storage and keys.
- Puma Browser as a test platform for mobile-first local-AI behavior.
Final notes — why build this now
Privacy-first, local mobile assistants are no longer a niche. In 2026 the stack matured: low-level browser APIs, quantized open models, and portable runtimes make it practical to ship user-first experiences that keep data on-device. The approach in this guide balances privacy, performance, and usability, letting you ship a real product without depending on third-party inference APIs.
Call to action
Ready to prototype? Start by packaging a compact embedder for browser use, wire up IndexedDB + a small HNSW index, and experiment with AES-GCM encrypted blob sync to a relay server. Build a minimal MVP in a week and iterate on model size and sync UX. Share your progress, benchmarks, and device profiles — privacy-first mobile assistants will improve fastest when engineers publish real-world measurements.
Related Reading
- How to Design Cache Policies for On-Device AI Retrieval (2026 Guide)
- Legal & Privacy Implications for Cloud Caching in 2026: A Practical Guide
- Integrating On-Device AI with Cloud Analytics: Feeding ClickHouse from Raspberry Pi Micro Apps
- Observability for Edge AI Agents in 2026: Queryable Models, Metadata Protection and Compliance-First Patterns
- LEGO Zelda vs Other Licensed Nintendo Sets: How This Ocarina of Time Release Compares
- Cashtags, Tips, and Live Badges: Monetization Tools Every Touring Jazz Band Should Know
- Cosy Kitchen: 10 Comfort Food Recipes That Shine with Extra Virgin Olive Oil
- Weekend Ski Escapes from London: Using Multi-Resort Passes to Maximise Value
- Retail Shakeups and Your Cleanser Closet: How Leadership and Store Growth Affect Prices and Selection
Related Topics
technique
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
From Our Network
Trending stories across our publication group