What Is a Token? It's Not Characters — It's the Unit of AI Thought

Let's Break a Myth First: Token ≠ Character Count

When most people first see "this model supports 128K tokens," instinct kicks in: 128K tokens = 128,000 characters? That's almost a novel's worth of text?

If that's your thinking, you've just stepped into one of the most common misconceptions in AI.

A token is not a character, not a word, not a sentence. It's a cutting unit somewhere between characters and words — and exactly how it's cut depends on which tokenizer the model uses.

What Is a Token: Starting from a Single Letter

Let's start with English to build intuition.

For GPT-series models: "hello" is 1 token, "world" is 1 token, "tokenization" gets split into "token" + "ization" = 2 tokens.

Why? Because these models use BPE — Byte Pair Encoding.

BPE Algorithm: AI's "Compression Dictionary"

BPE's core idea is simple: find the most common character combinations in text, assign each one a number.

During training, the algorithm starts from individual characters and repeatedly merges the most frequent adjacent pairs:

Round 1: "e" + "s" appears most frequently → merged into "es", assigned #256
Round 2: "t" + "h" appears most frequently → merged into "th", assigned #257
Round 3: "th" + "e" appears most frequently → merged into "the", assigned #258
… continues until the vocabulary reaches its target size (typically 50K–100K tokens)

The result: common words become single tokens, rare words get split into multiple tokens.

This is why "tokenization" gets split: the full word isn't common enough, but "token" as a prefix is very common (preserved as one token), and "ization" is a common suffix (also preserved as one token).

Why Does Chinese Cost More Tokens Than English?

This is something many people miss, but it matters a lot: communicating with AI in Chinese typically consumes 2–4x more tokens than English.

The reason lies in BPE training data. Major models (GPT-4, Claude, Gemini) are trained primarily on English text, so the English vocabulary is highly optimized — common English words often map to a single token.

Chinese? Each character typically maps to 1–2 tokens (varies by model), and many Chinese words lack corresponding merge rules, causing them to split into more fragments.

Real-world test data:

"Hello, how are you?" → approximately 5 tokens
"你好，你怎么样？" → approximately 10–14 tokens

Same meaning, Chinese consumes 2–3x the tokens of English.

For SmallFireDragon Lab, this difference is very real. Our science articles are written in Chinese — when translated into English, the English version actually uses fewer tokens. That's why using English prompts is sometimes a practical cost-optimization strategy when calling AI APIs.

Practical Token Estimation Rules

Without precise tools, these rules of thumb work well:

Language	Estimation Rule	Example
English	1 token ≈ 4 characters / ~0.75 words	"Hello world" = 2 tokens
Chinese	1 character ≈ 1–2 tokens	"你好世界" ≈ 4–8 tokens
Code	1 token ≈ 3–4 characters (lots of punctuation)	Code is usually more expensive than prose
Numbers	Each digit may be a separate token	"1234567" ≈ 7 tokens

Real-World Impact of Token Limits

Understanding tokens matters because it directly affects two things about how you use AI:

1. Context Capacity Limits

Every model's Context Window is measured in tokens. Claude 3.5 Sonnet's 200K token window translates to roughly 100,000–150,000 Chinese characters. Sounds like a lot — but once you start pasting long documents, codebases, and conversation history, you hit the ceiling fast.

When you hit the ceiling, the model starts "forgetting" the earliest content — which brings us back to the memory limitations discussed in our last article.

2. API Call Costs

Commercial AI APIs charge by token, split between input tokens and output tokens (output is typically more expensive).

For Claude 3.5 Sonnet:

Input: approximately $3 / million tokens
Output: approximately $15 / million tokens

Processing a 10,000-character Chinese document consumes roughly 15,000–20,000 input tokens, costing about $0.045–0.06. That sounds small — but running hundreds of calls daily, costs add up quickly.

One lesson from SmallFireDragon Lab: compressing your System Prompt is more impactful than compressing user input. The System Prompt is attached to every request — a 1,000-token System Prompt at 100 calls/day wastes 100,000 input tokens daily.

How to Precisely Count Tokens

Useful tools:

OpenAI Tokenizer (platform.openai.com/tokenizer): Free, visualizes tokenization for GPT-series models
tiktoken (Python library): Open-sourced by OpenAI, precise token counting in code
Anthropic Console: Claude's API playground, shows actual tokens consumed
Rule-of-thumb estimates: English 1 token ≈ 4 characters; Chinese 1 character ≈ 1.5 tokens

An Often-Overlooked Detail: Special Tokens

Beyond regular text, tokenizers also have special tokens:

<|system|>, <|user|>, <|assistant|>: mark conversation roles
<|endoftext|>: text end marker
<|pad|>: padding for alignment

These special tokens count toward the total token count too. So the tokens you actually use are typically a bit more than you'd expect.

Three Core Insights About Tokens

Token ≠ character ≠ word: It's a semantic unit cut by BPE — common sequences are short, rare ones are long
Chinese costs more than English: Same content in Chinese consumes ~2–3x more tokens than English — factor this into API cost optimization
Tokens determine two things: How much context you can fit (capacity) + how much each call costs (cost)

Once you truly understand tokens, you can actually read AI product specs with real comprehension — whether that 128K context window is enough is just a quick calculation away.