blog

Unyap - On-Device Voice Note Transcription

Some people send voice notes. Some people would rather read them. I am firmly in the second camp, and “let me find somewhere quiet to listen to a 90-second WhatsApp ramble” is not a thing I enjoy doing. The obvious fix is transcription, but every easy option ships your audio off to someone else’s server. A private voice note from a friend is exactly the kind of thing I do not want to hand to a cloud API. So I built Unyap: an Android app that transcribes voice notes to text entirely on the phone, using whisper.cpp. No audio and no text ever leaves the device.

Words streaming in as they decode
Words stream in as they decode
Finished transcript, ready to copy or share
Copy or share the result
One-time model download on first launch
One-time model download

How it works

You share a voice note to Unyap from WhatsApp (or any app, or pick a file in-app), the words stream in as they are transcribed, and then you select, copy, or share the result. That is the whole product. The interesting part is everything it does not do: there is no account, no upload, no “processing on our servers”. The app requests a single permission, INTERNET, and uses it exactly once - to download the Whisper model on first launch. After that it never touches the network again.

flowchart LR
  Share[Shared voice note] --> Dec[MediaCodec decode]
  Dec --> PCM[16 kHz mono PCM]
  PCM --> W[whisper.cpp on CPU]
  W -- token by token --> UI[Streaming transcript]

The model is Whisper Small, quantized to q8_0 - about 190 MB, downloaded once from Hugging Face and cached in app storage. It runs on the CPU through a small JNI bridge, with flash_attn on and the GPU explicitly disabled (on phones the CPU path is both faster to spin up and far less of a compatibility minefield than trying to get a GPU backend working across a thousand SoCs).

Streaming words as they decode

This is my favourite detail. whisper.cpp’s normal API hands you text one segment at a time - roughly a sentence - which means you stare at a spinner and then a whole line appears at once. I wanted the word-by-word teletype feel instead, where the transcript visibly grows as the model thinks.

There is no official “give me each token” callback, but there is a logits_filter_callback - a hook meant for biasing the logits during sampling. It fires on every decoding step, and the most recently sampled token is sitting right there in the token buffer. So I repurpose it: grab the last token, skip the special ones (anything at or above the end-of-text token), turn the rest into a string, and push it straight across JNI into Kotlin.

whisper_token eot = whisper_token_eot(ctx);
const whisper_token_data &last = tokens[n_tokens - 1];
if (last.id >= eot) return;            // skip special tokens

const char *text = whisper_token_to_str(ctx, last.id);
if (text && text[0] != '\0') {
    jstring jtext = env->NewStringUTF(text);
    env->CallVoidMethod(cb->listener, cb->onSegment, jtext);  // -> Kotlin
    env->DeleteLocalRef(jtext);
}

On the Kotlin side the transcription runs on a background thread and the tokens come out as a Flow<StreamEvent> built with callbackFlow, buffered so the decoder never blocks on the UI. Compose just collects the flow and appends. The decode finishes with a Done event carrying the full text, the elapsed time, and the auto-detected language - language detection is left on (language = nullptr), so German and English voice notes both just work, along with the 90-odd other languages Whisper knows.

Decoding the audio first

Whisper wants 16 kHz mono float PCM, and a WhatsApp voice note is none of those things - it is usually Opus in an OGG container. Rather than pull in a heavy audio library, I let Android do it: MediaExtractor + MediaCodec decode whatever audio/* you throw at it, and I downmix and resample to 16 kHz on the way out.

The one trap worth mentioning is memory. A long recording is a lot of samples, and the obvious MutableList<Short> boxes every single one into a heap object - roughly 16 bytes to store 2 bytes of audio, which OOM-crashes the app on anything lengthy. The fix is unglamorous but necessary: a plain growable ShortArray that doubles when full, keeping it at 2 bytes per sample.

// A MutableList<Short> boxes every sample (~16 bytes each) and
// OOM-crashes on long recordings; a raw ShortArray stays at 2 bytes.
var pcm = ShortArray(1 shl 18)   // ~256K samples to start

Not crashing on older phones

This is the part I expected to be trivial and absolutely was not. whisper.cpp is fast on ARM because it leans on modern instruction-set extensions - dotprod, fp16 arithmetic, i8mm. If you build with a high -march to get those, the resulting binary contains instructions that simply do not exist on older CPUs, and the app dies with a SIGILL (illegal instruction) the moment it hits one. Build at the safe baseline instead and you leave a big chunk of performance on the table for everyone on a recent phone.

The clean way out is to not choose at build time at all. I compile the JNI core at the plain arm64-v8a baseline (ARMv8.0-A), and let ggml build every CPU variant from armv8.0 up to armv9.2 as separate, dlopen-able backend libraries:

set(GGML_BACKEND_DL       ON  CACHE BOOL "" FORCE)  # each variant a loadable .so
set(GGML_CPU_ALL_VARIANTS ON  CACHE BOOL "" FORCE)  # armv8.0 .. armv9.2
set(GGML_NATIVE           OFF CACHE BOOL "" FORCE)  # never -march=native

At startup the app loads all the variants and ggml scores them against what the CPU actually advertises through getauxval(HWCAP), then picks the best one the device genuinely supports. A Pixel 8 gets the i8mm kernels; a five-year-old phone quietly falls back to the baseline and still runs. Same APK, no SIGILL, no feature detection code of my own.

A second Play Store hoop: apps targeting SDK 35+ must support 16 KB memory pages, so the native libraries need their ELF segments aligned to 16 KB. That is a couple of linker flags set before any target is defined, so every library - mine and all the ggml variants - inherits the alignment. With -O3, LTO, dead-code stripping and hidden visibility on top, the native side stays lean.

The result

On a Pixel 8 it runs at roughly real time - about 15 seconds of audio transcribed in about 15 seconds - which is plenty for the voice-note use case, and it gets there without a single packet leaving the phone. The whole thing is Kotlin + Jetpack Compose over a thin JNI layer, targeting arm64-v8a, min SDK 29.

Unyap is on Google Play, and the signed APK is on the GitHub releases page if you would rather sideload it. It is open source under MIT, with whisper.cpp bundled as a submodule. Code is on GitHub.