MobilePrivacyHow-to

Build a Privacy-First Mobile Search Assistant Using Puma and Open LLMs

ttechnique

2026-01-29

9 min read

Build a privacy-first mobile assistant that runs fully in-browser using Puma, WebAssembly LLMs, cached embeddings, and end-to-end encrypted sync.

Build a privacy-first mobile search assistant using Puma and open LLMs — local-only, offline-capable, and secure

Hook: You're tired of handing every personal search, note, and document to cloud AI. You want a fast, reliable mobile assistant that runs in the browser, works offline, and never sends your data to a third-party. In 2026 this is achievable: modern mobile browsers (Puma included) + WebAssembly WebNN/WebGPU runtimes + quantized open models let you run inference and store embeddings entirely on-device. This guide shows how.

Why this matters in 2026

Late 2025 and early 2026 saw two shifts that make local, privacy-first assistants practical:

Broad mobile support for WebAssembly with threads and SIMD, and widespread WebGPU/WebNN implementations in major browsers and in privacy-first alternatives like Puma Browser.
Open LLMs released in compact, well-quantized formats (GGUF/ggml, GGML3-era quantizations), plus browser-targeted runtimes (wasm builds of llama.cpp, WebLLM, hnswlib-wasm), enabling reasonable latency on modern phones.

“Puma Browser is an example of a mobile-first browser that embraces local AI — letting you run models and keep data on-device.”

High-level architecture — what you’ll build

Goal: a mobile web app (runs in Puma or a modern browser) that:

Runs a client-side LLM runtime for answers and query understanding (WebAssembly or WebGPU-backed).
Creates and caches embeddings locally for documents, pages, and notes.
Performs local vector search (HNSW or brute-force) against the cached embeddings.
Syncs encrypted data across devices without exposing plaintext — using end-to-end encryption and sealed storage / relay-only servers.
Works offline and degrades gracefully when models can’t be loaded.

Components

Client-side LLM runtime: llama.cpp WASM, WebLLM, or other WebAssembly/WebNN runtime that supports inference and embeddings.
Embedding pipeline: local embedder (model or embed-only network) that turns text into vectors, cached in IndexedDB.
Vector DB: small in-browser HNSW or brute-force index (hnswlib-wasm or custom lightweight index).
Secure sync: end-to-end encryption of blobs + relay or peer-to-peer (WebRTC) for sync without server-side decryption.
UI/UX: fast search UI with incremental indexing, feedback loop for re-ranking, and offline-first UX. See the principles from UX Design for Conversational Interfaces when designing prompts and responses.

Step-by-step implementation

The steps below assume a modern mobile browser (Puma, Chrome/Chromium with WebGPU/WebAssembly support, or Safari with WebNN). We'll include minimal code snippets and design patterns you can adapt.

1) Choose the model and runtime

Tradeoffs:

Smaller quantized models (7B-13B quantized) produce good embeddings and run locally with acceptable latency on flagship phones.
If you only need embeddings, use a dedicated lightweight embedder (128–1024 dims) — fewer cycles and smaller memory.

Practical picks in 2026:

Embedding-only: a compact open embedder exported to GGUF/ONNX for wasm.
Full LLM for generation: llama-3-small or a Mistral-derived 6–8B quantized model compiled to ggml/GGUF and wasm via llama.cpp WASM or WebLLM.

Example: load a WASM LLM runtime (pseudo-code; replace with the runtime you pick)

// init webllm runtime (pseudocode)
import WebLLM from 'webllm-wasm';
const runtime = await WebLLM.create({
  modelPath: '/models/gguf/llama3-small.gguf',
  useWebGPU: true,
  threads: navigator.hardwareConcurrency - 1
});

2) Embeddings: compute, quantize, and cache

Compute embeddings on-device and store them in IndexedDB as encrypted blobs. For long-term performance use a compact numeric format (Float32Array or quantized Int8) and an efficient index.

Simple embedding code example (assumes runtime exposes encode method):

async function embedText(text) {
  // runtime could be a dedicated embedder or LLM that supports .embed()
  const vec = await runtime.embed(text); // Float32Array
  return vec;
}

// store embeddings in IndexedDB (using idb or a small wrapper)
async function storeEmbedding(id, vec) {
  const db = await getDb();
  await db.put('embeddings', { id, vec });
}

Optimization: quantize embeddings to Int8 or Float16 to shrink storage. Use a simple linear quantization if you don't want complex tooling.

3) Build a local vector index

Options:

For small collections (<= 10k vectors): a brute-force cosine search with WebAssembly-accelerated dot product is fine.
For larger sets: use HNSW via hnswlib-wasm or a JS port. HNSW gives sublinear search time on-device.

Brute-force search example (cosine):

function cosine(a, b) {
  let dot = 0, na = 0, nb = 0;
  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb) + 1e-10);
}

async function search(queryVec, topK = 10) {
  const all = await db.getAll('embeddings');
  const scored = all.map(x => ({ id: x.id, score: cosine(queryVec, x.vec) }));
  scored.sort((a, b) => b.score - a.score);
  return scored.slice(0, topK);
}

4) Assemble the prompt and run local inference

After retrieving the top-K relevant items, build a context-aware prompt. Keep the prompt concise and use instruction templates to reduce token usage — critical for mobile.

const topDocs = await search(queryVec, 5);
const context = topDocs.map(d => documents[d.id].text).join('\n---\n');
const prompt = `You are a privacy-first mobile assistant. Using the following context, answer concisely.\n\nContext:\n${context}\n\nQuestion: ${userQuery}`;

const resp = await runtime.generate(prompt, { maxTokens: 256 });

5) Secure sync without cloud decryption

The key idea: servers only store encrypted blobs. Devices hold the keys (or derive them using a passphrase or hardware-backed credential). Sync can use:

Relay storage: a simple HTTP server/Cloud Storage that stores encrypted blobs; it cannot decrypt them.
Peer-to-peer via WebRTC: direct device-to-device sync with signaling and end-to-end encryption.

Encryption pattern (recommended):

Derive a symmetric key per user using Argon2/PBKDF2 or use a private key stored in WebAuthn (platform authenticator) to wrap/unwrap the symmetric key.
Encrypt each document+embedding blob with AES-GCM and store/set the cipher text on the relay.
Store metadata/plain indices only locally — the server only sees opaque blob IDs and timestamps.

Example: encrypt a blob with WebCrypto

async function deriveKeyFromPassphrase(pass) {
  const enc = new TextEncoder();
  const salt = enc.encode('unique-salt-2026');
  const base = await crypto.subtle.importKey('raw', enc.encode(pass), 'PBKDF2', false, ['deriveKey']);
  return crypto.subtle.deriveKey({
    name: 'PBKDF2', salt, iterations: 200_000, hash: 'SHA-256'
  }, base, { name: 'AES-GCM', length: 256 }, false, ['encrypt', 'decrypt']);
}

async function encryptBlob(key, data) {
  const iv = crypto.getRandomValues(new Uint8Array(12));
  const ct = await crypto.subtle.encrypt({ name: 'AES-GCM', iv }, key, data);
  return { iv: Array.from(iv), ct: new Uint8Array(ct) };
}

For multi-device access, wrap the symmetric key using each device's public key (WebAuthn / Public Key) and store the wrapped key on the server. That way the server never sees plaintext.

6) Offline-first UX and background sync

Design patterns:

Index and embed on content ingestion (background task). Use Web Workers to avoid blocking UI.
Use service workers for network sync attempts; if online, upload encrypted blobs; else queue them in IndexedDB.
Show clear indicators: model loaded, embeddings current, sync status (queued / synced).

Performance, storage, and battery tradeoffs

Practical constraints:

Model size vs accuracy: smaller models use less RAM and battery but may hallucinate more. Use hybrid: small local model + optional cloud-only heavy model (opt-in) for difficult tasks.
Embedding dimension: 512–1024 dims is a good middle ground. Quantize if storage is tight.
Indexing cost: building HNSW can be CPU-heavy; build incrementally and offload to Web Worker.

Advanced strategies & tricks

Client-side model cascade

Run a fast, tiny model first to handle most queries. If it fails or is unsure, escalate to a larger local model (if available). This reduces latency and battery use.

Hybrid embeddings

Keep two embedding types: cheap semantic embeddings for most search tasks and expensive embeddings (higher-dim) for specialized corpora. Use the cheap embedding to filter candidates and expensive to re-rank.

Progressive quantization

Compress older, rarely-accessed vectors more aggressively. Keep recent vectors high-fidelity. This reduces storage while maintaining relevance.

Sealed-key sync using WebAuthn

Use WebAuthn to store device public keys and seal a symmetric key per user. When adding devices, use a short QR/OTP flow to grant the new device access to the encrypted key — avoids exposing the passphrase.

Security checklist

Use AES-GCM (or XChaCha20-Poly1305 via libsodium) for blob encryption.
Derive keys using Argon2 or PBKDF2 with high iteration counts; prefer platform-backed keys (WebAuthn) when possible.
Never log plaintext data in analytics.
Design your relay to be zero-knowledge: it stores only ciphertexts and metadata like last-modified timestamps.

2026 Trends & future predictions

Expect these developments to further empower local mobile assistants:

Better mobile WebNN and WebGPU drivers: optimized GPU kernels reduce inference time on phones.
Standardized client-side model packaging: easier distribution of GGUF/ONNX/flatbuffers for browser runtimes.
More compact open models: sub-4B models with near-LLM performance for embedding and retrieval.
Privacy regulations: stronger legal demand for data-minimizing local AI will accelerate adoption of zero-knowledge sync patterns.

Full example: minimal flow

End-to-end flow summary you can prototype today:

Load a wasm embedder at app start (lazy load to save RAM).
User highlights text: compute embedding -> quantize -> store encrypted blob in IndexedDB and add to HNSW index (Web Worker).
User queries: compute query embedding -> search local index -> assemble context -> run local LLM -> show answer.
On network: upload encrypted blobs to your relay; store wrapped symmetric keys for each device public key.

Common pitfalls and how to avoid them

Model memory spikes: lazy-load models and release memory when idle; use streaming inference if supported.
Slow cold starts: persist a tiny cached model or warm the model during idle time.
Sync conflicts: use vector timestamping and CRDT-like merge rules for local metadata.

Actionable takeaways

Start with a separate local embedder — it’s cheaper than running a full LLM for every query.
Use IndexedDB + Web Worker + HNSW (or brute-force for small sets) to keep searches snappy on-device.
Encrypt everything before sync. Use WebAuthn to avoid password copying and to bind devices securely.
Profile on target devices (mid-range Android, iPhone) — CPU, RAM, and battery behavior vary widely.

Where to prototype & test

Tools and libs to explore in 2026:

llama.cpp WASM / WebLLM for browser inference.
hnswlib-wasm or lightweight JS HNSW implementations for vector search.
idb (IndexedDB helper), WebCrypto, and WebAuthn for storage and keys.
Puma Browser as a test platform for mobile-first local-AI behavior.

Final notes — why build this now

Privacy-first, local mobile assistants are no longer a niche. In 2026 the stack matured: low-level browser APIs, quantized open models, and portable runtimes make it practical to ship user-first experiences that keep data on-device. The approach in this guide balances privacy, performance, and usability, letting you ship a real product without depending on third-party inference APIs.

Call to action

Ready to prototype? Start by packaging a compact embedder for browser use, wire up IndexedDB + a small HNSW index, and experiment with AES-GCM encrypted blob sync to a relay server. Build a minimal MVP in a week and iterate on model size and sync UX. Share your progress, benchmarks, and device profiles — privacy-first mobile assistants will improve fastest when engineers publish real-world measurements.

technique

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.