The Token Tax on Indic Language Prompts

The Token Tax on Indic Language Prompts

← Back to Blog
April 12, 2026 8 min read AI Development
Research Report · April 2026

A comparative analysis of tokenization overhead across OpenAI, Anthropic, Google Gemini, and Sarvam AI — demonstrated using Hindi and Marathi prompts, with English as the baseline. The findings apply broadly to any Devanagari-script language on these platforms.

All 4 platforms confirmed 3 languages tested Devanagari script English baseline = 100%
01 — Executive Summary

Tokenizer vocabulary budget determines token cost

Across all four platforms tested, Hindi and Marathi prompts consume more tokens than their English equivalent — except on Sarvam AI, where Indic prompts cost fewer tokens than English. The range runs from a 22% saving to a 156% overhead.

This difference is not a property of Hindi or Marathi being more complex languages. It is a consequence of tokenizer vocabulary allocation: how much of the tokenizer's vocabulary budget a platform has reserved for Devanagari script. A tokenizer trained natively on Indic languages encodes a full syllable cluster — such as "मॉड" — as a single token. A tokenizer without dedicated Indic vocabulary falls back to byte-level fragments, splitting the same syllable into three or four tokens. Same script, same characters, radically different token count — determined entirely by the tokenizer architecture, not the language.

0.78×
Sarvam AI · Hindi
Indic prompts are cheaper than English
1.30–1.44×
OpenAI & Anthropic · Hindi
Mild overhead range on frontier labs
2.56×
Anthropic · Marathi
Highest confirmed overhead in dataset

Headline finding: Sarvam AI produces fewer tokens for Hindi and Marathi than for English — a token credit, not a tax. On every other tested platform a tax does exist, ranging from +30% (OpenAI, Hindi) to +156% (Anthropic, Marathi). The claim holds for all frontier Western AI labs, but is disproved for purpose-built Indic AI.

Unexpected finding: OpenAI's GPT-5 series (including O1 and O3) tokenizes Marathi at 1.70×, compared to legacy tiktoken estimates of ~3.4×. The new model family ships dramatically improved multilingual vocabulary — architectural estimates from earlier tokenizer research are now significantly stale.

02 — Methodology

Test design

Three semantically equivalent prompts were submitted to each platform's official tokenizer tool or chat interface. Counts recorded exactly as reported. English serves as the 100% baseline per platform.

Hindi
मल्टीमोडल मॉडल्स का क्या उपयोग है?
34 Unicode code points
Marathi
कोणते मॉडेल्स मराठीत ट्रेन केले गेलेले आहेत?
46 Unicode code points
English (baseline)
Which multimodal models are available in the market?
52 characters (ASCII)
PlatformTokenizerTool usedData status
OpenAI logo OpenAI
GPT-5 series / O1·O3 tokenizer platform.openai.com/tokenizer Confirmed
Anthropic logo Anthropic Claude
Claude BPE tokenizer Uploaded token_usage_report.pdf Confirmed
Google Gemini logo Google Gemini
SentencePiece (multilingual) gemini.google.com/app Confirmed
Sarvam AI logo Sarvam AI
Sarvam Indic-native tokenizer dashboard.sarvam.ai/chat Confirmed
03 — Platform Reports

Token counts by platform

Each block shows confirmed token counts, cost multiplier relative to that platform's English baseline, and a proportional bar.

Sarvam AI logo
Sarvam AI
Indic-native tokenizer · 10 Indian languages in vocabulary
Confirmed
PromptCharsTokensvs English
English
52 9 1.00× baseline
Hindi
34 7 0.78× savings
Marathi
46 8 0.89× savings
Devanagari syllables (akshara) are first-class vocabulary entries. Hindi and Marathi prompts are more compact than English on this platform — not a lower tax, an inverted one.
Token count relative to English
OpenAI logo
OpenAI
GPT-5 series (including O1, O3) · Improved multilingual vocabulary · English baseline: 10 tokens
Confirmed
PromptCharsTokensvs English
English
52 10 1.00× baseline
Hindi
34 13 1.30× overhead
Marathi
46 17 1.70× overhead
English baseline is 10 tokens (vs 9 on all other platforms). Multipliers calculated against this platform's own English baseline. In absolute terms, OpenAI's Hindi count (13) matches Anthropic's, while Marathi (17) is significantly lower than Anthropic's (23).
Token count relative to English
Anthropic logo
Anthropic Claude
BPE tokenizer · Source: uploaded token_usage_report.pdf
Confirmed
PromptCharsTokensvs English
English
52 9 1.00× baseline
Hindi
34 13 1.44× overhead
Marathi
46 23 2.56× overhead
Best among frontier labs for Hindi (+44%). Marathi overhead (+156%) is the highest confirmed value in this dataset — likely due to longer agglutinative word-forms in the Marathi prompt.
Token count relative to English
Google Gemini logo
Google Gemini
SentencePiece multilingual tokenizer · Source: gemini.google.com/app
Confirmed
PromptCharsTokensvs English
English
52 9 1.00× baseline
Marathi
46 16 1.78× overhead
Hindi
34 18 2.00× overhead
Marathi (16 tokens) is cheaper than Hindi (18 tokens) on Gemini — an inversion of the expected pattern. Gemini's vocabulary has stronger coverage of the particular Marathi word-forms in this test than the Hindi ones.
Token count relative to English
04 — Consolidated Comparison

All platforms · all languages

Heatmap key:
Token savings
Mild overhead (<1.5×)
Moderate (1.5–2×)
High (2–3×)
Platform English (baseline) Hindi tokens Hindi multiplier Marathi tokens Marathi multiplier
Sarvam AISarvam AI
9 7 0.78× ↓ 8 0.89× ↓
OpenAIOpenAI
10 (differs) 13 1.30× ↑ 17 1.70× ↑
AnthropicAnthropic Claude
9 13 1.44× ↑ 23 2.56× ↑
Google GeminiGoogle Gemini
9 18 2.00× ↑ 16 1.78× ↑

OpenAI baseline note: OpenAI's tokenizer returns 10 tokens for the English prompt, while all other platforms return 9. Multipliers are calculated against each platform's own English baseline to measure the overhead each platform imposes on its own users.

Key findings from the consolidated data
Sarvam AI
The only platform where Indic prompts cost fewer tokens than English. Hindi at 0.78× and Marathi at 0.89× represent a net token saving — a function of its Indic-native vocabulary rather than a lower overhead rate.
OpenAI
Best-performing frontier platform for both languages. At 1.30× Hindi and 1.70× Marathi, it keeps overhead below the moderate threshold. Its English baseline of 10 tokens (vs 9 elsewhere) does not affect relative comparisons within the platform.
Anthropic Claude
Competitive for Hindi (1.44×, close to OpenAI) but the highest confirmed overhead for Marathi (2.56×). A bilingual app running on Anthropic pays over 2.5× the token cost for Marathi content compared to equivalent English — a significant context window and cost penalty.
Google Gemini
The only platform where Marathi (1.78×) tokenizes more efficiently than Hindi (2.00×) — an inversion of the expected pattern. Gemini is competitive for Marathi but the weakest frontier option for Hindi, costing twice the English token count.
Cost multiplier vs English baseline — all platforms
05 — Analysis

Why the gaps exist

"The token tax on Devanagari is not a property of the language — it is a compiler flag that was never turned on."

Sarvam: vocabulary-native advantage

Sarvam trained its tokenizer natively on 10 Indian languages, giving Devanagari aksharas their own vocabulary slots. A full syllable like "मॉड" maps to one token, not three byte-level fragments. This is the difference between first-class language support and a retrofit.

OpenAI: strongest frontier performance on both languages

OpenAI returns 13 tokens for Hindi and 17 for Marathi against an English baseline of 10 — a 1.30× and 1.70× multiplier respectively. This is the lowest overhead of any frontier platform tested for both languages, and places OpenAI significantly ahead of Anthropic for Marathi (1.70× vs 2.56×) and modestly ahead for Hindi (1.30× vs 1.44×).

Anthropic: Hindi strong, Marathi costly

Claude's tokenizer handles Hindi competitively (1.44× vs OpenAI's 1.30×), but Marathi overhead climbs to 2.56×. The gap likely reflects training corpus distribution: Hindi has greater web representation in multilingual datasets than Marathi, meaning more vocabulary slots were allocated to it.

Gemini: the Hindi–Marathi inversion

Gemini returns 18 tokens for Hindi but only 16 for Marathi, despite Marathi having more Unicode code points. This inversion — confirmed by test — shows tokenization efficiency is prompt-specific. The specific Marathi word-forms tested happen to be better represented in Gemini's SentencePiece vocabulary than their Hindi counterparts.

06 — Conclusions & Routing Guide

What this means for your app

The token tax is real on every frontier Western platform — but it is not uniform, and it is not fixed. Platform choice is the single biggest variable under a developer's control.

Hindi-first workloads
Most economical Sarvam AI — 0.78×
Best frontier model OpenAI — 1.30×
Competitive alternative Anthropic — 1.44×
Highest overhead Gemini — 2.00×
Context window impact −30% on OpenAI · −31% on Claude
Marathi-first workloads
Most economical Sarvam AI — 0.89×
Best frontier model OpenAI — 1.70×
Competitive alternative Gemini — 1.78×
Highest overhead Anthropic — 2.56×
Context window impact −41% on OpenAI · −61% on Claude

Claim verdict — confirmed with data: The token tax is real on OpenAI (+30–70%), Anthropic (+44–156%), and Gemini (+78–100%). It is disproved for Sarvam AI, which shows a token credit of −11% to −22%. The tax is a function of tokenizer design, not of linguistic complexity. It can be engineered away.

Architecture implication: For a bilingual Hindi–Marathi app, Hindi and Marathi do not share the same optimal frontier platform. OpenAI is best for both, but Gemini is a closer second for Marathi while Anthropic is closer for Hindi. A language-aware routing layer at the API gateway level can reduce Indic token spend by 20–40% on frontier models without sacrificing model quality.

Final ranking — confirmed platforms

RankPlatformHindiMarathiAssessment
1
Sarvam AISarvam AI
0.78× 0.89× Token credit. Native Indic tokenizer.
2
OpenAIOpenAI
1.30× 1.70× Best frontier for both languages. GPT-5 improved significantly.
3
AnthropicAnthropic Claude
1.44× 2.56× Strong for Hindi. High Marathi overhead — avoid for Marathi-heavy apps.
4
Google GeminiGoogle Gemini
2.00× 1.78× Highest Hindi overhead. Competitive for Marathi only.
Sources: token_usage_report.pdf · platform.openai.com/tokenizer · gemini.google.com/app · dashboard.sarvam.ai · Token Tax Report v3 · April 2026