What is a tokenizer and why do I need a playground?

A tokenizer breaks text into the units a language model operates on — tokens. A token is usually smaller than a word and larger than a single character. Knowing the token count of your prompt tells you how much context-window budget you are using, whether an output will fit, and where multilingual inputs sneak in unexpected costs. The playground makes that visible: every token is a tile with its ID, byte length, and byte offset.

Which three algorithm families are compared?

Three families that dominate modern language models. BPE-style (Byte-Pair Encoding) starts from characters and merges pairs by learned frequency — typical for decoder-only models. WordPiece-style picks the longest matching sub-word from the left — typical for bidirectional encoder-only models. Unigram-style (SentencePiece) computes the most likely segmentation via log-probabilities — typical for multilingual encoder-decoder models. All three run locally in your browser, without loading any model weights.

Why do different languages need different token counts?

Tokenizers are trained on a corpus that is usually English-dominant. Common English words like "the" or "and" become one token. German compounds like "Krankenversicherungsvertrag" can fragment into eight or ten pieces. Japanese, Arabic and other scripts without word boundaries often get split character by character — which multiplies the token count per word. This is the documented tokenizer-unfairness phenomenon (Petrov, Malkin et al., 2023). The heatmap tab shows this with the same sentence in six languages.

Where does my input go — is anything stored or uploaded?

Your text never leaves this browser tab. No server request, no cookies, no analytics, no `localStorage`. You can verify it in dev tools: close the tab, open the Network panel, type into the input field — not a single HTTP request is fired. The three tokenizer families run as pure JavaScript modules in the browser. Under GDPR, no personal data in the sense of Art. 4 is processed because the processing happens locally on your device.

How accurate is the token count in the playground?

The playground implements all three algorithms with a representative sample vocabulary per family (about 600 entries each). That is enough to demonstrate the algorithm mechanics correctly — merge order in BPE, greedy longest match in WordPiece, Viterbi backtrack in Unigram. For an exact prediction of token costs for a specific commercial model, you should use that model's official tokenizer; the playground is a learning and comparison platform, not a cost calculator.

What exactly does the algorithm trace show me?

The step-by-step run of tokenisation on your input. For BPE you see each merge step: "merge 'th' + 'e' → 'the' (rule #22)". For WordPiece you see the greedy longest match for each sub-word: "match 'play' (4 chars)". For Unigram you see the Viterbi backtrack with log-probabilities. The trace is capped at 40 steps so the panel stays readable — for long texts, trace the first pre-token unit.

Can I drop in whole files?

Yes, up to 10 MB. Drag and drop a .txt, .md, .json, .csv, .log, .html, .xml or .yaml file onto the input field. Bigger files are rejected with a message — the limit is not algorithm speed (a 10 MB file tokenises in under a second), but the browser's render performance when displaying tens of thousands of tokens. If you need more, trim the text before dropping.

Why do some tokens show "id = -1"?

-1 means: this piece is not in the bundled sample vocabulary for the selected family. The playground falls back to character-level tokens — exactly the same behaviour real tokenizers exhibit for out-of-vocabulary pieces. You see this for rare words, exotic Unicode characters, or languages not covered by the demo vocab. The point: out-of-vocabulary always costs more tokens than well-covered languages.

Tokenizer Playground — three families side by side

Why a tokenizer playground at all?

If you work with language models, sooner or later you bump into a number that matters more than the word count: the token count. Tokens are the units models perceive — usually smaller than a word, larger than a letter. The Tokenizer Playground breaks your text into exactly those units, live, and shows every token as a colour-coded tile with token ID, byte length, and byte offset. Hover over any token to see precisely which vocabulary entry matched.

Unlike a plain word counter, a tokenizer playground also illuminates the algorithm. Three families are included: Byte-Pair Encoding, WordPiece and Unigram. These are the three dominant schools from which every contemporary multilingual language model derives its tokenizer. The playground shows them not as a black box but as a traceable step-by-step algorithm.

Three families, three strategies — what sets them apart?

BPE family (Byte-Pair Encoding) starts every run at single characters. Pairs are then merged in a trained frequency order until no rule applies. The signature trait: a space before each word start is preserved as a special character (Ġ in the display) — making “the” and ” the” two different tokens. This family is the most common choice for decoder-only generation models. Algorithm reference: Sennrich, Haddow, Birch 2016.

WordPiece family picks the longest sub-word that matches the vocabulary, starting from the left. Continuation pieces inside the same word get the marker ”##” — so play + ##ing. It typically lowercases input first. This family appears in classic bidirectional encoder models for classification and understanding. Consequence: the same word in different capitalisations produces identical tokens.

Unigram family (SentencePiece-style) treats tokenisation as an optimisation problem. Each vocabulary entry has a log-probability; Viterbi finds the segmentation with the highest total. Word starts carry a Unicode marker (▁). This family is the standard choice for multilingual encoder-decoder models and is preferred when a mix of Latin script, Asian scripts and special characters is the rule. Algorithm reference: Kudo 2018.

What does the multilingual heatmap show?

The heatmap tab takes one identical content — the classic pangram “the quick brown fox jumps over the lazy dog” — and translates it into six languages: English, German, French, Spanish, Japanese and Arabic. For each language the playground counts words (by Unicode word boundary) and tokens (by the currently selected family), and computes the ratio tokens per word. Above 2 gets expensive; above 4 the language is structurally penalised.

The phenomenon is well-documented. A 2023 study showed that the same translation across 22 languages produces 1.5× to 14× token differences — with English always at the cheaper end. The playground heatmap shows that effect immediately: English typically lands at 1.0–1.3 tokens per word, German at 1.5–2.0 because of compounds, Japanese at 2.5 and above because of word-boundary ambiguity. You can see at a glance why multilingual applications have a cost problem that pure word-counting would hide.

How does the algorithm trace show the tokenisation step by step?

A black box is hard to learn from. So the algorithm-steps tab shows the full run of tokenisation on your current input. For BPE the input is first broken into single characters. Then in each step the highest-ranked merge pair is fused and the intermediate state shown. You see, for example: “merge ‘t’ + ‘h’ → ‘th’ (rule #1)”, then “merge ‘th’ + ‘e’ → ‘the’ (rule #22)”, then “no more merges apply — done”.

For WordPiece each step looks different. The trace shows the left cursor and the sub-word match: “match ‘play’ (4 chars)”, then “match ‘##ing’ (3 chars)”. If no sub-word is found, “no match at position N — [UNK]” appears and the run ends. For Unigram the trace shows the Viterbi backtrack: every position gets a log-probability, the path with the highest sum is chosen, and tokens are shown in the order they appear.

What does the token ID next to each piece mean?

Every tokenizer has a fixed mapping: token string → token ID. The ID is the number the model actually receives — the string is just the human-readable form. The playground shows the ID next to each token tile. Token ID = -1 means: this piece is not in the bundled sample vocabulary. The tokenizer then falls back to single characters, which inflates the token count.

This is exactly the out-of-vocabulary behaviour of real tokenizers in practice. You see it with rare proper names, technical terms, foreign-language interjections, or emoji. If you test a prompt containing “Đorđević” you will see that one name cost 8–12 tokens — while a common English given name fits in one or two.

Why the three-family comparison?

In the compare tab your input runs through all three families in parallel. Right above the table sit three token counters and a Δ value — the difference to the most efficient family. This answers a recurring question in practice: “Would I save tokens by switching families?” The answer depends on the input. English plain prose is very similar across all three. Source code is significantly tighter in BPE because merge rules learn common code sequences. CJK text is more efficient in Unigram (with a good SentencePiece vocabulary) because multi-character tokens for common syllables exist.

The comparison also shows a second dimension: not just the count, but where the boundaries fall. BPE marks every word start with a space marker, WordPiece marks every continuation piece with ##, Unigram marks every word start with a Unicode ▁. These three markers look different but conceptually cost the same.

What is deliberately NOT in the playground?

Three jobs sit outside the scope on purpose. First: no cost calculator with dollar or euro prices — those change often and are vendor-specific. Second: no chat-template builder (system + user + assistant tags). Different models use different conventions, and a template builder couples you too tightly to one model. Third: no vocabulary uploads — that would be a security surface (manipulated vocab files could carry unsafe pieces).

These gaps are deliberate. They keep the playground lightweight, fast, and vendor-neutral. If you need an exact token-cost estimate for a specific model, use that model’s official tokenizer. If you want to understand the algorithm and compare three families in parallel — that is what the playground is built for.

Your text stays in browser memory. There is no server, no localStorage, no cookies, no analytics, no network request. The dev-tools Network panel stays empty after the page loads. Close the tab and everything is gone — by design. The “Copy statistics” button also makes only a local Clipboard API call, no network traffic.

Under GDPR no personal data in the sense of Art. 4 is processed, because processing happens locally on your device. Even sensitive prompts (code snippets, personal notes, legal text) stay in the tab. This architecture makes the playground usable in enterprise environments with strict compliance requirements.

Tokenizer Playground — three families side by side

How It Works

Paste text or code

Instant processing

Copy result

Privacy

How do you use this tool?

Why a tokenizer playground at all?

Three families, three strategies — what sets them apart?

What does the multilingual heatmap show?

How does the algorithm trace show the tokenisation step by step?

What does the token ID next to each piece mean?

Why the three-family comparison?

What is deliberately NOT in the playground?

How It Works

Paste text or code

Instant processing

Copy result

Privacy

Why a tokenizer playground at all?

Three families, three strategies — what sets them apart?

What does the multilingual heatmap show?

How does the algorithm trace show the tokenisation step by step?

What does the token ID next to each piece mean?

Why the three-family comparison?

What is deliberately NOT in the playground?

What about privacy and GDPR?