Tokenization, Cost, and Linguistic Inequality

(In a hurry? Visit https://ml-workbench.onrender.com/)

Several weeks ago, I started evaluating multilingual LLMs for a project. I expected to spend time comparing deployable models, price tables, and context windows and ended up digging into how tokenization behaves across natural languages.

I think tokenization sits in an awkward spot in most product conversations. It's low-level enough to get ignored, but it quietly rewrites cost, latency, and context-window assumptions for any product that's planning to serve more multilingual users as LLM usage grows (70.2% of Claude.ai usage in 2025).

I'd like to share a framework for evaluating multilingual tokenizers, and also show how tokenizers help explain:

  • Why the same feature can have very different economics across languages
  • Why some model families feel much more multilingual than others before you ever run a task benchmark
  • Why "best tokenizer" and "best deployed option" are not the same question
  • Why controlled tests (strict aligned evidence) and real-life observations (naturalistic exploratory evidence) should not be mixed together
diagram of an llm

Decoding Text by Language

Understanding tokenization starts with how computers actually "see" text as a string of bytes. Even though we use a universal standard called UTF-8, it doesn't treat every language the same. Some language scripts take up much more digital room than others to say the same thing. While English is quite compact, languages like Arabic, Chinese, or Japanese have a much larger byte footprint from the very start. So, before the tokenizer even begins its work, different languages are already arriving with completely different levels of bulk. Grammar makes it even trickier to compare. Some languages pack a ton of information into just one word or a short string of characters, while others spread it out. For example:

The equivalent of 'The sun rises in the east.' (6 words) in Hindi is 'सूरज पूर्व में उगता है' (5 words).

This means that even when two phrases have the "same meaning," they can look totally different to a computer, making it really hard to treat them as equals.

Tokenization

Instead of seeing a string of bytes, LLMs see tokens, which are learned symbols representing everything from full words to small fragments of script. Most tokenizers used in modern LLMs are subword tokenizers. They try to balance compression and flexibility by assigning tokens to the most frequent reusable chunks of text. When that works well, tokens line up reasonably closely with meaningful units. When it works poorly, text gets broken into many smaller and less useful pieces.

For example, the tokenizer o200k_base (used in ChatGPT 5.x) breaks the same sentence into very different pieces depending on the language:

English7 tokens

The sun rises in the east.

The sun rises in the east .
Hindi9 tokens (1.29x more tokens than the English equivalent)

सूरज पूर्व में उगता है

ूरपूर्वमेंताहै

To understand why models struggle more with some languages than others, you have to look at the relationship between tokenizer design and training data. Tokenizers are usually built from the same broad data distribution approach used to train the model. That means high-resource languages like English tend to get better vocabulary coverage, while lower-resource languages like Hindi, Bengali, Navajo and Swahili are more likely to be represented by smaller, less efficient fragments.

That imbalance matters during both model training and inference. In English, tokens often capture whole words or large subword units. In Hindi, Arabic, or other under-covered languages, tokens are more likely to capture partial pieces. Ten English tokens can represent much more usable meaning than ten Hindi tokens.

This fragmentation creates several downstream consequences:

  • more tokens means more processing cost
  • more tokens means less usable context
  • more tokens means more opportunities for sentence generation to drift token by token
  • more fragmentation usually means weaker practical coverage for that script or language

ML Workbench

I built ML Workbench to explore how multilingual tokenizer families behave across a fixed set of languages and how those differences carry through to real model pricing. My goal is not to answer every multilingual question at once, rather to keep one claim clean: when the same meaning is aligned, how differently do tokenizer families behave, and what happens when those differences hit real model pricing and context limits?

Open-source repository: You can clone it and run benchmarks locally: https://github.com/707/ml-workbench

Interface: In addition to the code for running the benchmarks, I also made a live interface: https://ml-workbench.onrender.com/

The workbench combines a few different views of 10 natural languages in 3 LLM model families (6 tokenizers in total).

The app has four main views:

  • Benchmark measures tokenization behavior under either strict aligned evidence or exploratory streaming text.
  • Catalog maps tokenizer families to deployable models.
  • Scenario Lab translates tokenizer-led inflation into production-facing inference metrics like cost and context loss.
  • Audit which records formulas, sources, and provenance rules.

Datasets and Methodology

The workbench uses two benchmark lanes, and I think keeping them separate is important.

  1. Strict Evidence uses a snapshot of FLORES 200 dataset, a multilingual benchmark built by Meta from aligned translations. Because sentence meanings are aligned across languages, it is the safest option to compute Relative Token Cost and compare tokenizer families. Scenario Lab is also tied to this lane.
  2. Streaming Exploration uses FineWeb2 dataset derived from CommonCrawl. This exists because real user written text is messier and I wanted to compare how they perform against stricter data. This lane is commonly used as a diagnostic lens to help production teams with heuristic signals on a model.

The tool also maps tokenizer evidence to deployable models through OpenRouter's live model metadata, which lets the workbench map tokenizer behavior to real-world-pricing and model performance.

Let's use the workbench to evaluate languages across the GPT-2 legacy, OpenAI o200k, Llama 3 and Qwen 2.5 family of tokenizers using the strict evidence approach.

Relative Token Cost (vs English)

Relative Token Cost, or RTC, is the clearest metric in the workbench.

It is simply:

RTC = source_tokens / english_tokens
RTC

The lower the better. A value of 1.0x means the language takes the same number of tokens as aligned English. A value of 2.0x means it needs twice as many tokens to say the same thing.

Hindi is the harshest case with gpt2 landing at 7.13x, qwen-2.5 at 4.5x, llama-3 at 2.5x, and o200k_base at 1.6x. That is a wide spread for equivalent meaning, and it immediately tells you that tokenizer choice can dominate multilingual cost before model selection even starts.

Chinese Mandarin is where the Qwen result becomes striking, qwen-2.5 comes in at 1.0227x, very close to English while o200k_base, which was released by OpenAI in the same year lags behind at 1.23x. This fits what you would expect from a Chinese tokenizer with stronger Chinese coverage

Efficiency (Bytes per Token)

Bytes per token is the next most useful metric, but it is easier to misread. It asks how much raw UTF-8 text gets compressed into each token:

bytes_per_token = total_bytes / total_tokens
BPT

Higher is usually better. If a tokenizer packs more bytes into each token, it is usually compressing that language more efficiently. But this metric requires caution because high byte-per-character scripts, such as Hindi, can artificially inflate efficiency scores without actually being well aligned to meaningful units. That is why I treat bytes per token as a companion to RTC, not as a replacement for it.

Coverage (Unique Tokens)

Coverage is one of the more revealing metrics in the app, especially when read cautiously.

coverage

The workbench computes unique_tokens as the number of distinct token IDs used when encoding the selected benchmark text. Higher values often suggest broader practical coverage of that language's script or writing patterns.

Japanese and Chinese are good examples of taking caution. gpt2 shows the highest unique-token count for both languages but it does not win on RTC. qwen-2.5 is still the most efficient tokenizer for both. So treat coverage as informative, but it is not a universal proxy for overall tokenization quality.

Fragmentation Signals

The next set of metrics are best read together.

fertility

The word split rate view asks a simple question: how often does the tokenizer break a natural unit into multiple pieces? This is surfaced through continued_word_rate, and lower is generally better.

Subword fertility asks the related question: how many token pieces do I need per word or character on average? A value close to 1.0 is a good sign. Higher values mean the text is being chopped up more aggressively.

In the exploratory lane, GPT-2 legacy becomes the obvious outlier on fertility, reporting a range of 1.33 to 7.93 with a much wider spread than the newer families. These are the kinds of relationships I was hoping the dual-lane approach would surface. The strict lane tells me what is safely true under aligned conditions. The streaming lane tells me whether that same rough pattern still shows up once the text gets messier. In this case, it does.

observed

The Observed Composition tab helps explain why. It shows which scripts the tokenizer pieces actually came from on the selected rows. I use it less as a ranking view and more as a diagnostic sanity check. If a tokenizer looks good on a headline metric but the observed composition suggests weak or awkward script-specific coverage, that is worth paying attention to.

Scenario Lab for Models

I wanted a Scenario Lab to go beyond simple tokenizer tests. Focusing only on a model’s 'taste' ignores the structural costs of scaling that model across real users. Starting with tokenizer evidence gives a clearer picture of where multilingual pressure enters the system.

For this scenario, I used:

  • languages: English, Arabic, Hindi, Japanese
  • benchmark tokenizers: Llama 3 family, Mistral family, Qwen 2.5 family
  • monthly requests: 100,000
  • average input tokens: 600
  • average output tokens: 250
  • reasoning share: 0.1

This is where things started to get more nuanced.

lab

The first chart makes the main deployment lesson obvious: the best tokenizer is not always the cheapest deployed option. In this slice, Qwen 2.5 often looks strong at the tokenizer level, but it ends up with the highest average monthly cost. Hindi is the main reason. It contributes roughly $45 per month on its own at about 4.5 RTC, which pushes the family average sharply upward. Why does optimising for this matter? Because Hindi is the third most spoken language 609 million.

lab

The second chart shows the same story from the context side. Hindi is again the clearest stress case, with context loss approaching 78% in the worst case shown here. Arabic (335M speakers) is the next important language to watch: not as severe as Hindi, but still meaningfully more expensive and more context-hungry than English or Japanese.

That is the decision trap I wanted the tool to surface. "Best tokenizer" and "best model to ship" are often different questions. If your product depends on Hindi or Arabic, the economics and context fit of the deployed model can change materially even when the underlying task is the same.

Closing

If I had to turn this into a practical workflow for model selection, it would look like this:

  1. Benchmark tokenizer families on aligned text
  2. Inspect fragmentation when a chart looks suspicious
  3. Sanity-check the pattern on more naturalistic text
  4. Map tokenizer families to deployable models
  5. Translate token inflation into cost and context consequences
  6. Then argue about which model to ship

I still think the most useful part of the workbench is that it makes a boring but consequential layer legible. Tokenization is easy to ignore right up until it starts quietly rewriting your multilingual cost model.

There are still several open questions I find interesting:

  • which tokenization metric matters most for downstream multilingual quality
  • how much tokenizer coverage and efficiency can be improved at the same time
  • whether tokenizer metrics alone can predict enough multilingual capability to make model screening faster

I hope you've found the ml-workbench as useful as I did. I’ve tested ml-workbench quite a bit, but to get this right, the tool needs more eyes and more testing across all kinds of different languages and use cases so jump in and help out if you can! Whether you’re reporting a bug, suggesting a new metric, or adding a tokenizer I missed, your input makes the whole thing better for everyone.

Also, a quick note to the AI labs: if you’re working on a private model but are okay with sharing your tokenizer's performance data, please get in touch. The rest of us would learn a lot from seeing how the biggest systems actually handle different languages.

Appendix:

  1. Tokenization fragmentation framing was sharpened by papers like Tokenization Disparities as Infrastructure Bias and Reducing Tokenization Premiums for Low-Resource Languages in LLMs. Both are useful because they treat tokenization not as trivia, but as a system-level input into fairness, access, cost, and quality.