How do you use this tool?
- Type or paste your text into the input field — or drop a .txt/.md/.json file (up to 10 MB) onto it.
- Pick a tokenizer family (BPE / WordPiece / Unigram) — token tiles update live.
- Switch views: Token stream · 3-family compare · Multilingual · Algorithm steps.
Why a tokenizer playground at all?
If you work with language models, sooner or later you bump into a number that matters more than the word count: the token count. Tokens are the units models perceive — usually smaller than a word, larger than a letter. The Tokenizer Playground breaks your text into exactly those units, live, and shows every token as a colour-coded tile with token ID, byte length, and byte offset. Hover over any token to see precisely which vocabulary entry matched.
Unlike a plain word counter, a tokenizer playground also illuminates the algorithm. Three families are included: Byte-Pair Encoding, WordPiece and Unigram. These are the three dominant schools from which every contemporary multilingual language model derives its tokenizer. The playground shows them not as a black box but as a traceable step-by-step algorithm.
Three families, three strategies — what sets them apart?
BPE family (Byte-Pair Encoding) starts every run at single characters. Pairs are then merged in a trained frequency order until no rule applies. The signature trait: a space before each word start is preserved as a special character (Ġ in the display) — making “the” and ” the” two different tokens. This family is the most common choice for decoder-only generation models. Algorithm reference: Sennrich, Haddow, Birch 2016.
WordPiece family picks the longest sub-word that matches the vocabulary, starting from the left. Continuation pieces inside the same word get the marker ”##” — so play + ##ing. It typically lowercases input first. This family appears in classic bidirectional encoder models for classification and understanding. Consequence: the same word in different capitalisations produces identical tokens.
Unigram family (SentencePiece-style) treats tokenisation as an optimisation problem. Each vocabulary entry has a log-probability; Viterbi finds the segmentation with the highest total. Word starts carry a Unicode marker (▁). This family is the standard choice for multilingual encoder-decoder models and is preferred when a mix of Latin script, Asian scripts and special characters is the rule. Algorithm reference: Kudo 2018.
What does the multilingual heatmap show?
The heatmap tab takes one identical content — the classic pangram “the quick brown fox jumps over the lazy dog” — and translates it into six languages: English, German, French, Spanish, Japanese and Arabic. For each language the playground counts words (by Unicode word boundary) and tokens (by the currently selected family), and computes the ratio tokens per word. Above 2 gets expensive; above 4 the language is structurally penalised.
The phenomenon is well-documented. A 2023 study showed that the same translation across 22 languages produces 1.5× to 14× token differences — with English always at the cheaper end. The playground heatmap shows that effect immediately: English typically lands at 1.0–1.3 tokens per word, German at 1.5–2.0 because of compounds, Japanese at 2.5 and above because of word-boundary ambiguity. You can see at a glance why multilingual applications have a cost problem that pure word-counting would hide.
How does the algorithm trace show the tokenisation step by step?
A black box is hard to learn from. So the algorithm-steps tab shows the full run of tokenisation on your current input. For BPE the input is first broken into single characters. Then in each step the highest-ranked merge pair is fused and the intermediate state shown. You see, for example: “merge ‘t’ + ‘h’ → ‘th’ (rule #1)”, then “merge ‘th’ + ‘e’ → ‘the’ (rule #22)”, then “no more merges apply — done”.
For WordPiece each step looks different. The trace shows the left cursor and the sub-word match: “match ‘play’ (4 chars)”, then “match ‘##ing’ (3 chars)”. If no sub-word is found, “no match at position N — [UNK]” appears and the run ends. For Unigram the trace shows the Viterbi backtrack: every position gets a log-probability, the path with the highest sum is chosen, and tokens are shown in the order they appear.
What does the token ID next to each piece mean?
Every tokenizer has a fixed mapping: token string → token ID. The ID is the number the model actually receives — the string is just the human-readable form. The playground shows the ID next to each token tile. Token ID = -1 means: this piece is not in the bundled sample vocabulary. The tokenizer then falls back to single characters, which inflates the token count.
This is exactly the out-of-vocabulary behaviour of real tokenizers in practice. You see it with rare proper names, technical terms, foreign-language interjections, or emoji. If you test a prompt containing “Đorđević” you will see that one name cost 8–12 tokens — while a common English given name fits in one or two.
Why the three-family comparison?
In the compare tab your input runs through all three families in parallel. Right above the table sit three token counters and a Δ value — the difference to the most efficient family. This answers a recurring question in practice: “Would I save tokens by switching families?” The answer depends on the input. English plain prose is very similar across all three. Source code is significantly tighter in BPE because merge rules learn common code sequences. CJK text is more efficient in Unigram (with a good SentencePiece vocabulary) because multi-character tokens for common syllables exist.
The comparison also shows a second dimension: not just the count, but where the boundaries fall. BPE marks every word start with a space marker, WordPiece marks every continuation piece with ##, Unigram marks every word start with a Unicode ▁. These three markers look different but conceptually cost the same.
What is deliberately NOT in the playground?
Three jobs sit outside the scope on purpose. First: no cost calculator with dollar or euro prices — those change often and are vendor-specific. Second: no chat-template builder (system + user + assistant tags). Different models use different conventions, and a template builder couples you too tightly to one model. Third: no vocabulary uploads — that would be a security surface (manipulated vocab files could carry unsafe pieces).
These gaps are deliberate. They keep the playground lightweight, fast, and vendor-neutral. If you need an exact token-cost estimate for a specific model, use that model’s official tokenizer. If you want to understand the algorithm and compare three families in parallel — that is what the playground is built for.
What about privacy and GDPR?
Your text stays in browser memory. There is no server, no localStorage, no cookies, no analytics, no network request. The dev-tools Network panel stays empty after the page loads. Close the tab and everything is gone — by design. The “Copy statistics” button also makes only a local Clipboard API call, no network traffic.
Under GDPR no personal data in the sense of Art. 4 is processed, because processing happens locally on your device. Even sensitive prompts (code snippets, personal notes, legal text) stay in the tab. This architecture makes the playground usable in enterprise environments with strict compliance requirements.
Last updated: