Let's Break a Myth First: Token ≠ Character Count
When most people first see "this model supports 128K tokens," instinct kicks in: 128K tokens = 128,000 characters? That's almost a novel's worth of text?
If that's your thinking, you've just stepped into one of the most common misconceptions in AI.
A token is not a character, not a word, not a sentence. It's a cutting unit somewhere between characters and words — and exactly how it's cut depends on which tokenizer the model uses.
What Is a Token: Starting from a Single Letter
Let's start with English to build intuition.
For GPT-series models: "hello" is 1 token, "world" is 1 token, "tokenization" gets split into "token" + "ization" = 2 tokens.
Why? Because these models use BPE — Byte Pair Encoding.
BPE Algorithm: AI's "Compression Dictionary"
BPE's core idea is simple: find the most common character combinations in text, assign each one a number.
During training, the algorithm starts from individual characters and repeatedly merges the most frequent adjacent pairs:
- Round 1: "e" + "s" appears most frequently → merged into "es", assigned #256
- Round 2: "t" + "h" appears most frequently → merged into "th", assigned #257
- Round 3: "th" + "e" appears most frequently → merged into "the", assigned #258
- … continues until the vocabulary reaches its target size (typically 50K–100K tokens)
The result: common words become single tokens, rare words get split into multiple tokens.
This is why "tokenization" gets split: the full word isn't common enough, but "token" as a prefix is very common (preserved as one token), and "ization" is a common suffix (also preserved as one token).
Why Does Chinese Cost More Tokens Than English?
This is something many people miss, but it matters a lot: communicating with AI in Chinese typically consumes 2–4x more tokens than English.
The reason lies in BPE training data. Major models (GPT-4, Claude, Gemini) are trained primarily on English text, so the English vocabulary is highly optimized — common English words often map to a single token.
Chinese? Each character typically maps to 1–2 tokens (varies by model), and many Chinese words lack corresponding merge rules, causing them to split into more fragments.
Real-world test data:
- "Hello, how are you?" → approximately 5 tokens
- "你好,你怎么样?" → approximately 10–14 tokens
Same meaning, Chinese consumes 2–3x the tokens of English.
For SmallFireDragon Lab, this difference is very real. Our science articles are written in Chinese — when translated into English, the English version actually uses fewer tokens. That's why using English prompts is sometimes a practical cost-optimization strategy when calling AI APIs.
Practical Token Estimation Rules
Without precise tools, these rules of thumb work well:
| Language | Estimation Rule | Example |
|---|---|---|
| English | 1 token ≈ 4 characters / ~0.75 words | "Hello world" = 2 tokens |
| Chinese | 1 character ≈ 1–2 tokens | "你好世界" ≈ 4–8 tokens |
| Code | 1 token ≈ 3–4 characters (lots of punctuation) | Code is usually more expensive than prose |
| Numbers | Each digit may be a separate token | "1234567" ≈ 7 tokens |
Real-World Impact of Token Limits
Understanding tokens matters because it directly affects two things about how you use AI:
1. Context Capacity Limits
Every model's Context Window is measured in tokens. Claude 3.5 Sonnet's 200K token window translates to roughly 100,000–150,000 Chinese characters. Sounds like a lot — but once you start pasting long documents, codebases, and conversation history, you hit the ceiling fast.
When you hit the ceiling, the model starts "forgetting" the earliest content — which brings us back to the memory limitations discussed in our last article.
2. API Call Costs
Commercial AI APIs charge by token, split between input tokens and output tokens (output is typically more expensive).
For Claude 3.5 Sonnet:
- Input: approximately $3 / million tokens
- Output: approximately $15 / million tokens
Processing a 10,000-character Chinese document consumes roughly 15,000–20,000 input tokens, costing about $0.045–0.06. That sounds small — but running hundreds of calls daily, costs add up quickly.
One lesson from SmallFireDragon Lab: compressing your System Prompt is more impactful than compressing user input. The System Prompt is attached to every request — a 1,000-token System Prompt at 100 calls/day wastes 100,000 input tokens daily.
How to Precisely Count Tokens
Useful tools:
- OpenAI Tokenizer (platform.openai.com/tokenizer): Free, visualizes tokenization for GPT-series models
- tiktoken (Python library): Open-sourced by OpenAI, precise token counting in code
- Anthropic Console: Claude's API playground, shows actual tokens consumed
- Rule-of-thumb estimates: English 1 token ≈ 4 characters; Chinese 1 character ≈ 1.5 tokens
An Often-Overlooked Detail: Special Tokens
Beyond regular text, tokenizers also have special tokens:
<|system|>,<|user|>,<|assistant|>: mark conversation roles<|endoftext|>: text end marker<|pad|>: padding for alignment
These special tokens count toward the total token count too. So the tokens you actually use are typically a bit more than you'd expect.
Three Core Insights About Tokens
- Token ≠ character ≠ word: It's a semantic unit cut by BPE — common sequences are short, rare ones are long
- Chinese costs more than English: Same content in Chinese consumes ~2–3x more tokens than English — factor this into API cost optimization
- Tokens determine two things: How much context you can fit (capacity) + how much each call costs (cost)
Once you truly understand tokens, you can actually read AI product specs with real comprehension — whether that 128K context window is enough is just a quick calculation away.