The current Moonshine model produces a lot of noise when used in latency-sensitive applications like teleprompters. An improved ASR model would help stabilize inference speed and transcription quality, enhancing user experience.
I built a small teleprompter that scrolls based on your voice instead of timing or continous scroll. You just paste a script, press "start speaking", and it highlights the current word as you speak. If you pause it waits; if you skip lines it finds its place again. The interesting part was getting the entire pipeline to run entirely in the browser: Mic → VAD → Moonshine ONNX ASR → fuzzy script matching → scroll Speech recognition runs locally using WebGPU or WASM, and the page works offline after the first load (model is cached). The tricky thing turned out to be tracking the current spoken word in the script with messy ASR output (the current moonshine model produce a lot of noise when used in this latency-sensitive way). The matcher uses token indexing, banded Levenshtein distance, and phonetic normalization to stay aligned. Repo: [https://github.com/larsbaunwall/promptme-ai](https://github.com/larsbaunwall/promptme-ai) I'm looking forward to try the new Moonshine streaming variants when they are available in Transforms.js (in the browser). That should stabilize inference speed and transcription quality. I especially found it interesting to design a responsive experience despite the low-quality transcription I could get running. That meant lot's of speculative prediction. Curious to hear if anyone tried a similar thing?