The Token Tax on Indic Language Prompts

← Back to Blog

April 12, 2026 8 min read AI Development

Research Report · April 2026

A comparative analysis of tokenization overhead across OpenAI, Anthropic, Google Gemini, and Sarvam AI — demonstrated using Hindi and Marathi prompts, with English as the baseline. The findings apply broadly to any Devanagari-script language on these platforms.

All 4 platforms confirmed 3 languages tested Devanagari script English baseline = 100%

01 — Executive Summary

Tokenizer vocabulary budget determines token cost

Across all four platforms tested, Hindi and Marathi prompts consume more tokens than their English equivalent — except on Sarvam AI, where Indic prompts cost fewer tokens than English. The range runs from a 22% saving to a 156% overhead.

This difference is not a property of Hindi or Marathi being more complex languages. It is a consequence of tokenizer vocabulary allocation: how much of the tokenizer's vocabulary budget a platform has reserved for Devanagari script. A tokenizer trained natively on Indic languages encodes a full syllable cluster — such as "मॉड" — as a single token. A tokenizer without dedicated Indic vocabulary falls back to byte-level fragments, splitting the same syllable into three or four tokens. Same script, same characters, radically different token count — determined entirely by the tokenizer architecture, not the language.

0.78×

Sarvam AI · Hindi
Indic prompts are cheaper than English

1.30–1.44×

OpenAI & Anthropic · Hindi
Mild overhead range on frontier labs

2.56×

Anthropic · Marathi
Highest confirmed overhead in dataset

Headline finding: Sarvam AI produces fewer tokens for Hindi and Marathi than for English — a token credit, not a tax. On every other tested platform a tax does exist, ranging from +30% (OpenAI, Hindi) to +156% (Anthropic, Marathi). The claim holds for all frontier Western AI labs, but is disproved for purpose-built Indic AI.

Unexpected finding: OpenAI's GPT-5 series (including O1 and O3) tokenizes Marathi at 1.70×, compared to legacy tiktoken estimates of ~3.4×. The new model family ships dramatically improved multilingual vocabulary — architectural estimates from earlier tokenizer research are now significantly stale.

02 — Methodology

Test design

Three semantically equivalent prompts were submitted to each platform's official tokenizer tool or chat interface. Counts recorded exactly as reported. English serves as the 100% baseline per platform.

Hindi

मल्टीमोडल मॉडल्स का क्या उपयोग है?

34 Unicode code points

Marathi

कोणते मॉडेल्स मराठीत ट्रेन केले गेलेले आहेत?

46 Unicode code points

English (baseline)

Which multimodal models are available in the market?

52 characters (ASCII)

Platform	Tokenizer	Tool used	Data status
OpenAI	GPT-5 series / O1·O3 tokenizer	platform.openai.com/tokenizer	Confirmed
Anthropic Claude	Claude BPE tokenizer	Uploaded token_usage_report.pdf	Confirmed
Google Gemini	SentencePiece (multilingual)	gemini.google.com/app	Confirmed
Sarvam AI	Sarvam Indic-native tokenizer	dashboard.sarvam.ai/chat	Confirmed

03 — Platform Reports

Token counts by platform

Each block shows confirmed token counts, cost multiplier relative to that platform's English baseline, and a proportional bar.

Sarvam AI

Indic-native tokenizer · 10 Indian languages in vocabulary

Confirmed

Prompt	Chars	Tokens	vs English
English	52	9	1.00× baseline
Hindi	34	7	0.78× savings
Marathi	46	8	0.89× savings

Devanagari syllables (akshara) are first-class vocabulary entries. Hindi and Marathi prompts are more compact than English on this platform — not a lower tax, an inverted one.

Token count relative to English

OpenAI

GPT-5 series (including O1, O3) · Improved multilingual vocabulary · English baseline: 10 tokens

Confirmed

Prompt	Chars	Tokens	vs English
English	52	10	1.00× baseline
Hindi	34	13	1.30× overhead
Marathi	46	17	1.70× overhead

English baseline is 10 tokens (vs 9 on all other platforms). Multipliers calculated against this platform's own English baseline. In absolute terms, OpenAI's Hindi count (13) matches Anthropic's, while Marathi (17) is significantly lower than Anthropic's (23).

Token count relative to English

Anthropic Claude

BPE tokenizer · Source: uploaded token_usage_report.pdf

Confirmed

Prompt	Chars	Tokens	vs English
English	52	9	1.00× baseline
Hindi	34	13	1.44× overhead
Marathi	46	23	2.56× overhead

Best among frontier labs for Hindi (+44%). Marathi overhead (+156%) is the highest confirmed value in this dataset — likely due to longer agglutinative word-forms in the Marathi prompt.

Token count relative to English

Google Gemini

SentencePiece multilingual tokenizer · Source: gemini.google.com/app

Confirmed

Prompt	Chars	Tokens	vs English
English	52	9	1.00× baseline
Marathi	46	16	1.78× overhead
Hindi	34	18	2.00× overhead

Marathi (16 tokens) is cheaper than Hindi (18 tokens) on Gemini — an inversion of the expected pattern. Gemini's vocabulary has stronger coverage of the particular Marathi word-forms in this test than the Hindi ones.

Token count relative to English

04 — Consolidated Comparison

All platforms · all languages

Heatmap key:

Token savings

Mild overhead (<1.5×)

Moderate (1.5–2×)

High (2–3×)

Platform	English (baseline)	Hindi tokens	Hindi multiplier	Marathi tokens	Marathi multiplier
Sarvam AI	9	7	0.78× ↓	8	0.89× ↓
OpenAI	10 (differs)	13	1.30× ↑	17	1.70× ↑
Anthropic Claude	9	13	1.44× ↑	23	2.56× ↑
Google Gemini	9	18	2.00× ↑	16	1.78× ↑

OpenAI baseline note: OpenAI's tokenizer returns 10 tokens for the English prompt, while all other platforms return 9. Multipliers are calculated against each platform's own English baseline to measure the overhead each platform imposes on its own users.

Key findings from the consolidated data

Sarvam AI

The only platform where Indic prompts cost fewer tokens than English. Hindi at 0.78× and Marathi at 0.89× represent a net token saving — a function of its Indic-native vocabulary rather than a lower overhead rate.

OpenAI

Best-performing frontier platform for both languages. At 1.30× Hindi and 1.70× Marathi, it keeps overhead below the moderate threshold. Its English baseline of 10 tokens (vs 9 elsewhere) does not affect relative comparisons within the platform.

Anthropic Claude

Competitive for Hindi (1.44×, close to OpenAI) but the highest confirmed overhead for Marathi (2.56×). A bilingual app running on Anthropic pays over 2.5× the token cost for Marathi content compared to equivalent English — a significant context window and cost penalty.

Google Gemini

The only platform where Marathi (1.78×) tokenizes more efficiently than Hindi (2.00×) — an inversion of the expected pattern. Gemini is competitive for Marathi but the weakest frontier option for Hindi, costing twice the English token count.

Cost multiplier vs English baseline — all platforms

05 — Analysis

Why the gaps exist

"The token tax on Devanagari is not a property of the language — it is a compiler flag that was never turned on."

Sarvam: vocabulary-native advantage

Sarvam trained its tokenizer natively on 10 Indian languages, giving Devanagari aksharas their own vocabulary slots. A full syllable like "मॉड" maps to one token, not three byte-level fragments. This is the difference between first-class language support and a retrofit.

OpenAI: strongest frontier performance on both languages

OpenAI returns 13 tokens for Hindi and 17 for Marathi against an English baseline of 10 — a 1.30× and 1.70× multiplier respectively. This is the lowest overhead of any frontier platform tested for both languages, and places OpenAI significantly ahead of Anthropic for Marathi (1.70× vs 2.56×) and modestly ahead for Hindi (1.30× vs 1.44×).

Anthropic: Hindi strong, Marathi costly

Claude's tokenizer handles Hindi competitively (1.44× vs OpenAI's 1.30×), but Marathi overhead climbs to 2.56×. The gap likely reflects training corpus distribution: Hindi has greater web representation in multilingual datasets than Marathi, meaning more vocabulary slots were allocated to it.

Gemini: the Hindi–Marathi inversion

Gemini returns 18 tokens for Hindi but only 16 for Marathi, despite Marathi having more Unicode code points. This inversion — confirmed by test — shows tokenization efficiency is prompt-specific. The specific Marathi word-forms tested happen to be better represented in Gemini's SentencePiece vocabulary than their Hindi counterparts.

06 — Conclusions & Routing Guide

What this means for your app

The token tax is real on every frontier Western platform — but it is not uniform, and it is not fixed. Platform choice is the single biggest variable under a developer's control.

Hindi-first workloads

Most economical Sarvam AI — 0.78×

Best frontier model OpenAI — 1.30×

Competitive alternative Anthropic — 1.44×

Highest overhead Gemini — 2.00×

Context window impact −30% on OpenAI · −31% on Claude

Marathi-first workloads

Most economical Sarvam AI — 0.89×

Best frontier model OpenAI — 1.70×

Competitive alternative Gemini — 1.78×

Highest overhead Anthropic — 2.56×

Context window impact −41% on OpenAI · −61% on Claude

Claim verdict — confirmed with data: The token tax is real on OpenAI (+30–70%), Anthropic (+44–156%), and Gemini (+78–100%). It is disproved for Sarvam AI, which shows a token credit of −11% to −22%. The tax is a function of tokenizer design, not of linguistic complexity. It can be engineered away.

Architecture implication: For a bilingual Hindi–Marathi app, Hindi and Marathi do not share the same optimal frontier platform. OpenAI is best for both, but Gemini is a closer second for Marathi while Anthropic is closer for Hindi. A language-aware routing layer at the API gateway level can reduce Indic token spend by 20–40% on frontier models without sacrificing model quality.

Final ranking — confirmed platforms

Rank	Platform	Hindi	Marathi	Assessment
1	Sarvam AI	0.78×	0.89×	Token credit. Native Indic tokenizer.
2	OpenAI	1.30×	1.70×	Best frontier for both languages. GPT-5 improved significantly.
3	Anthropic Claude	1.44×	2.56×	Strong for Hindi. High Marathi overhead — avoid for Marathi-heavy apps.
4	Google Gemini	2.00×	1.78×	Highest Hindi overhead. Competitive for Marathi only.

Sources: token_usage_report.pdf · platform.openai.com/tokenizer · gemini.google.com/app · dashboard.sarvam.ai · Token Tax Report v3 · April 2026