Does my audio get sent to a server?

No. The transcription model runs entirely inside your browser via WebAssembly. There is no backend, no API key, no logging. Your audio file never leaves your device — suitable for confidential medical, legal or HR recordings.

Which audio formats are supported?

Officially MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — those cover phone voice recorders, Zoom/Teams exports, podcast files and most DAW outputs. Other formats your browser can decode (FLAC, AIFF on Safari, AMR) often work too. If a file is rejected, convert to MP3 at 128 kbps — higher bitrates won't improve accuracy.

How long does transcription take?

On a modern laptop, processing runs at roughly 2–4× real-time. A 5-minute recording typically finishes in 1–2 minutes; a 30-minute meeting in 8–15 minutes. Older devices and phones are proportionally slower.

Can I export SRT subtitles with timestamps?

Yes. The download dialog offers TXT (plain text) or SRT (SubRip format, HH:MM:SS,mmm timestamps). SRT imports directly into Premiere Pro, DaVinci Resolve, CapCut, VLC and YouTube's caption editor.

How accurate is the transcription?

Clear, close-mic English speech in a quiet environment typically reaches 90–95% word accuracy. Background music, strong regional accents, technical jargon and overlapping speakers all increase error rates. Always proofread before publishing.

Can I transcribe multiple speakers?

Speaker diarization (labeling who said what) is not currently supported. The output is a single continuous transcript. You can manually insert speaker names after copying.

Is there a file size limit?

No hard cap, but in-browser processing of files over ~200 MB can exhaust browser memory on older devices. For recordings over 2 hours, splitting into 30-minute segments before uploading is strongly recommended.

Does it work offline?

Yes. Once the model has loaded into the browser cache, transcription runs fully offline. Useful on flights, train rides, or when handling confidential audio without external connections.

Audio Transcription — Speech to Text in Your Browser

What does this tool do?

This tool converts spoken audio into a plain-text transcript — without uploading anything. It uses a compact speech-recognition model compiled to WebAssembly, running directly inside your browser tab. You get the full transcript in a scrollable, editable panel that you can copy or download as TXT or SRT.

Supported input formats include MP3, WAV, M4A (AAC), OGG Vorbis, and WebM Opus — the most common formats produced by phones, voice recorders, video editors, and meeting apps.

How Does It Work?

The pipeline runs in two stages, both on your device:

Decode and normalize. The Web Audio API decodes your file and resamples it to 16 kHz mono — the input format speech-recognition models expect. Stereo channels are averaged to a single mono signal.
Inference. A compact transformer model converts each 30-second window of audio into text tokens, then merges the windows into a continuous transcript with timestamps. Everything runs inside your browser tab — no cloud API, no third-party service.

The model is downloaded once on first use and cached in your browser. After that, transcription works fully offline.

Three Quality Tiers — Which to Pick?

The choice trades download size and speed against recognition accuracy:

Tier	Model size	Speed	Best for
Fast	~152 MB	very fast	Mobile, short voice memos, quick notes
Accurate	~291 MB	balanced	Default for meetings, interviews, podcasts
Precise	~968 MB	slower	Studio recordings, lectures, accented speech

Pick a tier in the model selector below the upload area. Each tier is cached separately, so you can switch back and forth.

How does language detection work?

The model auto-detects the spoken language from the first 30 seconds of audio. If detection misfires — common with short clips or heavy accents — use the language dropdown to force a specific language before transcribing.

Setting	When to use
Auto-detect	Monolingual recordings ≥ 30 seconds
Force language	Short clips, strong regional accents
English	Podcasts, meetings, dictation
German	German-language interviews, lectures
French/Spanish	Native-speaker recordings

TXT or SRT — Which Export?

The download dialog offers two formats:

TXT — plain running text, one paragraph. Best for meeting minutes, blog drafts, research notes.
SRT — SubRip subtitle format with start/end timestamps per block (00:01:23,456 --> 00:01:28,910). Imports directly into YouTube, Premiere Pro, DaVinci Resolve, CapCut, VLC, and most editors that handle captions.

For social-video subtitles, download SRT and import it into your editor — font, size and position are rendered by the player.

What are common use cases?

Meeting notes. Drop a recorded Zoom or Teams call and get a rough transcript to clean up into minutes. A 1-hour meeting typically produces a 5,000–8,000-word transcript.

Podcast show notes. Transcribe an episode to pull quotes, build timestamped chapters, or generate an SEO-friendly description.

Video captions. Extract dialogue, format as SRT, drop into your video editor for closed captions. Improves accessibility for deaf viewers and silent autoplay on social.

Dictation cleanup. iPhone or Android voice memos transcribed in seconds, then edited as plain text.

Academic research. Qualitative researchers transcribe interview recordings without sending sensitive data to a third-party transcription service — GDPR-friendly, no DPA required.

What are the best-practice tips?

Quiet environment beats post-processing filters every time.
Mic distance 20–30 cm reduces plosives and distortion.
Speak deliberately — slow, clear delivery boosts recognition, especially for technical vocabulary.
128 kbps MP3 is plenty for transcription. Higher bitrates don’t improve accuracy.
Split long recordings into 30–60-minute chunks before transcribing — more stable and gives natural break points.
Force the language for clips under 30 seconds or with strong accents.

From the kittokit ecosystem for the full audio-to-text workflow:

Speech Enhancer — Remove noise, echo and background hum before transcribing. Translates directly into higher word accuracy.
Character Counter — Count words, characters and reading time of your transcript. Handy for trimming meeting minutes into newsletter or blog length.
Text Diff — Compare two transcript versions, e.g. raw vs. proofread. Highlights changes word-by-word.

Audio Transcription — Speech to Text

How It Works

Paste text or code

Instant processing

Copy result

Privacy

How do you use this tool?

What does this tool do?

How Does It Work?

Three Quality Tiers — Which to Pick?

How does language detection work?

TXT or SRT — Which Export?

What are common use cases?

What are the best-practice tips?

How It Works

Paste text or code

Instant processing

Copy result

Privacy

What does this tool do?

How Does It Work?

Three Quality Tiers — Which to Pick?

How does language detection work?

TXT or SRT — Which Export?

What are common use cases?

What are the best-practice tips?

Which related tools help next?