How do you use this tool?
- Click the upload area or drag and drop an audio file (MP3, WAV, M4A, OGG, WebM).
- Pick a quality tier: Fast (~152 MB, mobile-friendly), Accurate (~291 MB, default) or Precise (~968 MB, desktop).
- Force a language if auto-detect struggles with short clips or strong accents — otherwise leave it on auto.
- Click Transcribe. The model loads into your browser cache once, then works offline forever after.
- Copy the transcript or download it as TXT (plain text) or SRT (with timestamps for subtitles).
What does this tool do?
This tool converts spoken audio into a plain-text transcript — without uploading anything. It uses a compact speech-recognition model compiled to WebAssembly, running directly inside your browser tab. You get the full transcript in a scrollable, editable panel that you can copy or download as TXT or SRT.
Supported input formats include MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — the most common formats produced by phones, voice recorders, video editors, and meeting apps.
How Does It Work?
The pipeline runs in two stages, both on your device:
-
Decode and normalize. The Web Audio API decodes your file and resamples it to 16 kHz mono — the input format speech-recognition models expect. Stereo channels are averaged to a single mono signal.
-
Inference. A compact transformer model converts each 30-second window of audio into text tokens, then merges the windows into a continuous transcript with timestamps. Everything runs inside your browser tab — no cloud API, no third-party service.
The model is downloaded once on first use and cached in your browser. After that, transcription works fully offline.
Three Quality Tiers — Which to Pick?
The choice trades download size and speed against recognition accuracy:
| Tier | Model size | Speed | Best for |
|---|---|---|---|
| Fast | ~152 MB | very fast | Mobile, short voice memos, quick notes |
| Accurate | ~291 MB | balanced | Default for meetings, interviews, podcasts |
| Precise | ~968 MB | slower | Studio recordings, lectures, accented speech |
Pick a tier in the model selector below the upload area. Each tier is cached separately, so you can switch back and forth.
How does language detection work?
The model auto-detects the spoken language from the first 30 seconds of audio. If detection misfires — common with short clips or heavy accents — use the language dropdown to force a specific language before transcribing.
| Setting | When to use |
|---|---|
| Auto-detect | Monolingual recordings ≥ 30 seconds |
| Force language | Short clips, strong regional accents |
| English | Podcasts, meetings, dictation |
| German | German-language interviews, lectures |
| French/Spanish | Native-speaker recordings |
TXT or SRT — Which Export?
The download dialog offers two formats:
- TXT — plain running text, one paragraph. Best for meeting minutes, blog drafts, research notes.
- SRT — SubRip subtitle format with start/end timestamps per block (
00:01:23,456 --> 00:01:28,910). Imports directly into YouTube, Premiere Pro, DaVinci Resolve, CapCut, VLC, and most editors that handle captions.
For social-video subtitles, download SRT and import it into your editor — font, size and position are rendered by the player.
What are common use cases?
Meeting notes. Drop a recorded Zoom or Teams call and get a rough transcript to clean up into minutes. A 1-hour meeting typically produces a 5,000–8,000-word transcript.
Podcast show notes. Transcribe an episode to pull quotes, build timestamped chapters, or generate an SEO-friendly description.
Video captions. Extract dialogue, format as SRT, drop into your video editor for closed captions. Improves accessibility for deaf viewers and silent autoplay on social.
Dictation cleanup. iPhone or Android voice memos transcribed in seconds, then edited as plain text.
Academic research. Qualitative researchers transcribe interview recordings without sending sensitive data to a third-party transcription service — GDPR-friendly, no DPA required.
What are the best-practice tips?
- Quiet environment beats post-processing filters every time.
- Mic distance 20–30 cm reduces plosives and distortion.
- Speak deliberately — slow, clear delivery boosts recognition, especially for technical vocabulary.
- 128 kbps MP3 is plenty for transcription. Higher bitrates don’t improve accuracy.
- Split long recordings into 30–60-minute chunks before transcribing — more stable and gives natural break points.
- Force the language for clips under 30 seconds or with strong accents.
Which related tools help next?
From the kittokit ecosystem for the full audio-to-text workflow:
- Speech Enhancer — Remove noise, echo and background hum before transcribing. Translates directly into higher word accuracy.
- Character Counter — Count words, characters and reading time of your transcript. Handy for trimming meeting minutes into newsletter or blog length.
- Text Diff — Compare two transcript versions, e.g. raw vs. proofread. Highlights changes word-by-word.
Last updated: