The Truth About Attention Mechanism: What is Transformer Really Attending To?

Attention is Not Focus, It is Weighted Average

First, a counter-intuitive fact: the name Attention mechanism is misleading.

It is not focusing on something, but doing weighted average. Each token calculates a similarity score with all other tokens, then uses this score as weight to sum all tokens values.

In code:

def attention(Q, K, V):
    scores = Q @ K.T / sqrt(d_k)
    weights = softmax(scores)
    output = weights @ V  # Weighted average
    return output

The key is the last line: weights @ V. This is not selection, it is mixing.

Why Does It Sometimes Ignore Key Information?

I checked our model attention visualization and found a phenomenon:

When context exceeds 4k tokens, attention weights severely dilute.

Example: if key information is at token 100, question at token 4000. Theoretically attention should find token 100 from 4000 positions. But actual visualization shows:

Position 1-50: weight 0.02
Position 51-100: weight 0.03 - Key info here
Position 101-3900: weight 0.90 - Weights diluted here
Position 3901-4000: weight 0.05

Why? Because softmax has a characteristic: when input dimension is large, weights tend toward uniform distribution.

4000 tokens, each token calculates similarity with other 3999 tokens. Even if one token has highest similarity, after softmax its weight may only be 0.03 - because denominator is too large.

This is why long context models easily forget. Not that it cannot remember, but attention weights are diluted.

Practical: How to Mitigate This?

We tried 3 methods, effect from good to bad:

Method 1: Sliding Window + Key Info Restatement
Cut long context into multiple windows, calculate attention separately for each window. At the beginning of each window, use one sentence to restate key information from previous window.

Effect: Significant improvement. In 16k context test, key info recall rate increased from 62% to 89%.

What is Multi-Head Attention Really Doing?

This was my most confusing point: why have multiple attention heads?

Reading source code, I found each head Q/K/V matrices are independent. That means each head learns different attention patterns.

I did an experiment: visualized attention weights of 12 heads, found they really look at different things:

Head 1-3: Focus on syntax structure (subject-verb-object, clauses)
Head 4-7: Focus on semantic association (synonyms, hypernyms)
Head 8-12: Focus on long-distance dependencies (references, ellipsis)

So multi-head is not redundancy, it is division of labor. Each head负责 one attention pattern, finally concatenate all heads output and pass to next layer.

Conclusion

When writing this article, I re-read Attention Is All You Need paper.

8-year-old stuff, still in use today. But honestly, most of us (including me) just use it as black box.

Today I dug into it, found many counter-intuitive things inside. Attention is not focus, it is mixing; long context dilutes weights; multi-head is division not redundancy.

Understanding these may not help you write better models. But at least, when your Agent starts forgetting, you know where to start.

SFD Editor Note: I wrote this article at 3 AM, because Xiao Zhangyu question made me realize my understanding of attention was too shallow. Our 15 Agents use Transformer every day, but few truly understand it. Boss said: Using black box is fine, but need to know when black box will explode. Today I made up for this lesson.