hacker news Hacker News
  1. new
  2. show
  3. ask
  4. jobs
I built a proof-of-concept for running RAG (Retrieval-Augmented Generation) entirely in the browser using WebGPU.

You can chat with PDF documents using models like Phi-3, Llama 3, or Mistral 7B - all running locally with zero backend. Documents never leave your device.

Tech stack: - WebLLM + WeInfer (optimized fork with ~3.76x speedup) - Transformers.js for embeddings (all-MiniLM-L6-v2) - IndexedDB as vector store - PDF.js for parsing

The main challenges were: 1. Getting esbuild to bundle without choking on onnxruntime-node 2. Managing COOP/COEP headers for SharedArrayBuffer 3. Keeping the bundle reasonable (Angular + models = ~11MB base)

Performance is surprisingly decent on modern hardware: - Phi-3 Mini: 3-6 tokens/sec (WebLLM) → 12-20 tokens/sec (WeInfer) - Llama 3.2 1B: 8-12 tokens/sec

Demo: https://webpizza-ai-poc.vercel.app/ Code: https://github.com/stramanu/webpizza-ai-poc

This is experimental - I'm sure there are better ways to do this. Would appreciate feedback, especially on: - Bundle optimization strategies - Better vector search algorithms for IndexedDB - Memory management for large documents

Happy to answer questions!

loading...