GUIDE

How Attention Works in Transformers

A precise, example-driven explanation of self-attention: queries, keys, values, scaled dot-product scores, softmax weighting, and multi-head attention — with an interactive explorer.

Attention is the mechanism that lets a model decide, for every word it processes, which other words matter. It replaced the recurrent and convolutional machinery that dominated sequence modeling before 2017, and it is the reason large language models can hold a coherent thread across thousands of tokens. This guide explains attention from the ground up: the problem it solves, the exact arithmetic of scaled dot-product attention, why multiple heads help, and where the costs lie. Every claim here is something you can watch happen in the interactive explorer below.

The problem: representing a word in context

The meaning of a word is not fixed; it depends on the words around it. "Bank" in river bank and savings bank are spelled identically but mean different things, and the disambiguating evidence ("river", "savings") can sit several words away. A model that produces one vector per word needs a way to fold in that surrounding evidence.

Earlier architectures did this sequentially. A recurrent network read the sentence left to right, maintaining a hidden state that it updated at each step. In principle the state carried everything seen so far; in practice it was a fixed-size bottleneck. Information from twenty words back had to survive twenty noisy updates to influence the current word, and gradients had to travel the same path during training. Long-range dependencies — agreement between a subject and a verb separated by a relative clause, a pronoun and its distant antecedent — were exactly the cases recurrence handled worst.

Attention removes the bottleneck. Instead of routing information through a sequential state, it lets each token look directly at every other token and pull in what it needs in a single step. The distance between two words no longer determines how hard it is for one to influence the other.

Queries, keys, and values

The core idea is a soft, differentiable lookup. Think of a Python dictionary: you present a key, and you get back the matching value. Attention generalizes this so that the match is not exact but graded — a query is compared against all keys, and the result is a weighted blend of all values, weighted by how well each key matched.

For every token, the model computes three vectors by multiplying the token's embedding by three separate learned weight matrices:

The query (Q) represents what this token is looking for in its context.
The key (K) represents what this token offers to others that look at it.
The value (V) represents the actual content this token contributes when it is attended to.

Queries and keys are used only to compute compatibility scores; values carry the information that actually flows. Separating "how relevant is this token?" (keys) from "what does it contribute?" (values) is what gives attention its flexibility — a token can be highly relevant for routing purposes while contributing a value tuned for a different job.

Because Q, K, and V are produced by learned matrices, the model discovers during training what each token should ask for, advertise, and contribute. None of this is hand-specified.

Scaled dot-product attention, step by step

With queries, keys, and values in hand, a single attention layer computes its output in four steps. The compact formula is:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

Unpacked, that is:

1. Score every pair

For a chosen token's query, take the dot product with the key of every token (including itself). A dot product is large when two vectors point in similar directions, so it measures compatibility: how well does this query match that key? For a sequence of n tokens this yields an n × n matrix of raw scores — one row per query, one column per key.

2. Scale by the square root of the key dimension

Divide every score by $\sqrt{d_k}$ , where $d_k$ is the dimension of the key vectors. This is not cosmetic. The dot product of two random vectors of dimension $d_k$ has variance proportional to $d_k$ , so for large key dimensions the raw scores swing far from zero. Feeding very large or very small numbers into softmax pushes it into a near-one-hot, saturated regime where its gradients almost vanish — training stalls. Dividing by $\sqrt{d_k}$ rescales the scores to a range where softmax stays responsive and gradients flow.

3. Normalize with softmax

Apply softmax along each query's row. Softmax exponentiates the scores and divides by their sum, turning each row into a probability distribution: all weights are positive and sum to one. These are the attention weights. A weight of 0.6 from token A to token B means token A draws 60% of its mixed-in content from token B's value.

4. Mix the values

Multiply the attention weights by the value matrix. Each token's output is the weighted sum of all tokens' value vectors, with the weights from step 3. A token that attends strongly to two others ends up with an output that is mostly a blend of those two values.

The result is a new representation for every token that has incorporated, in one parallel operation, evidence from the entire sequence — with the amount of evidence from each source learned and content-dependent.

FIG. 02Transformer Architecture Explorer

INTERACTIVE

LOADING INSTRUMENT

Fig. 02Comprehensive tool for exploring transformer architectures

Open the explorer above: pick a token, and the highlighted connections show where its query found matching keys. Switch heads (covered next) and the pattern changes — the same sentence, attended to in a different way.

Self-attention versus cross-attention

When the queries, keys, and values are all derived from the same sequence, the mechanism is self-attention: every token attends to its own context, building context-aware representations of its own sentence. This is what runs inside each layer of an encoder, and inside the decoder over the tokens generated so far.

When the queries come from one sequence and the keys and values from another, it is cross-attention. The classic case is translation: the decoder's queries (the sentence being generated) attend to the encoder's keys and values (the source sentence), so each output word can focus on the relevant input words. Modern decoder-only language models lean almost entirely on self-attention, but cross-attention remains central to encoder–decoder and multimodal systems.

Causal masking: not peeking at the future

A language model trained to predict the next token must never see tokens that come after the position it is predicting; otherwise it could trivially copy the answer. Attention enforces this with a causal mask: before softmax, every score from a query to a key at a later position is set to negative infinity. After exponentiation those entries become zero, so each token attends only to itself and earlier tokens. This single masking step is what makes the same attention machinery usable for autoregressive generation.

Multi-head attention

A single attention computation produces one set of weights — one way of relating tokens. But relationships in language are not one-dimensional. One token might need to track grammatical subject–verb agreement, another the referent of a pronoun, another simple adjacency. Forcing all of these through a single weighted average blurs them together.

Multi-head attention runs several attention computations in parallel. The model projects the tokens into h separate lower-dimensional query/key/value sets — one per head — and computes scaled dot-product attention independently in each. Because each head has its own projection matrices, each can specialize: empirically, some heads track syntactic dependencies, some attend to the previous token, some learn to follow coreference. The heads' outputs are concatenated and passed through a final learned projection that mixes them back into a single vector per token.

Crucially, the per-head dimension is the model dimension divided by the number of heads, so multi-head attention costs roughly the same as single-head attention of the full width — you get multiple specialized views for about the price of one. This division of labor is a large part of why attention is so effective in practice, and it is the most visible thing in the explorer: switch heads on the same sentence and watch entirely different connection patterns light up.

Where position information comes from

Attention as described is permutation-equivariant: shuffle the input tokens and the outputs shuffle the same way, because the dot products do not encode order. Yet word order obviously matters — "dog bites man" is not "man bites dog". Transformers therefore inject position separately, through positional encodings added to (or rotated into) the token embeddings before attention. Sinusoidal encodings, learned position embeddings, and rotary position embeddings (RoPE) are the common choices. The attention mechanism itself stays order-agnostic; position is information it consumes, not something it computes.

The cost: quadratic attention

The elegance of letting every token see every token has a price. The score matrix is n × n for a sequence of length n, so both the computation and the memory of a single attention layer grow with the square of the sequence length. Doubling the context roughly quadruples the attention cost. For short sequences this is irrelevant; for documents, codebases, or long chat histories it dominates.

This single fact drives a large branch of modern research. Sparse attention restricts each token to a subset of others (local windows, strided patterns). Linear attention reformulates the math to avoid materializing the full matrix. FlashAttention keeps the exact computation but reorganizes it to be IO-aware, never writing the giant matrix to slow memory, which makes long contexts practical on real hardware. Understanding why attention is quadratic is the prerequisite for understanding every one of these.

Why softmax, and not a plain average

It is reasonable to ask why attention bothers with learned queries and keys at all — why not just average every token's value, or average over a fixed window? The answer is that a plain average is content-blind: it gives "the" and "cat" the same say in representing "sat", regardless of what the sentence needs. The query–key dot product makes the mixing data-dependent — the same layer routes information differently for every input, because the weights are computed from the tokens themselves rather than fixed in advance.

Softmax specifically buys two things. It guarantees the weights are non-negative and sum to one, so the output stays a genuine convex blend of values rather than an unbounded sum that could explode in magnitude. And it is smooth and differentiable, so the model can learn — via gradient descent — to sharpen toward one token or spread across many, sliding continuously between "look hard at the subject" and "average broadly over the clause". A hard, argmax-style lookup would give neither stable scales nor usable gradients.

A concrete walk-through

It helps to trace one token through the mechanism with small numbers. Suppose we are processing the sentence the cat sat and computing the new representation for "sat". Embeddings are tiny — say two dimensions — and the model has already learned its projection matrices.

First, "sat" is projected into its query vector, and every token (including "sat" itself) is projected into a key vector. We take three dot products: query("sat") · key("the"), query("sat") · key("cat"), query("sat") · key("sat"). Imagine these come out as 1.0, 4.0, and 2.0. The middle score is highest because, for predicting or representing a verb, the subject "cat" is the most informative neighbor — that relationship is exactly what the query and key projections were trained to surface.

Next we scale. With a key dimension of two, we divide by $\sqrt{2} \approx 1.41$ , giving roughly 0.71, 2.83, 1.41. Softmax exponentiates and normalizes these into weights that might be approximately 0.10, 0.66, 0.24. They sum to one, and "cat" dominates.

Finally we mix the value vectors: the output for "sat" is $0.10 \times v(\text{the}) + 0.66 \times v(\text{cat}) + 0.24 \times v(\text{sat})$ . The new "sat" vector is now mostly a blend of its own content and "cat"'s — it has become subject-aware. Every other token is updated the same way, in parallel, in the same layer. Nothing about this depends on how far apart the tokens sit; a subject ten words away would be reached by the same single dot product.

Attention is not the whole transformer

Attention mixes information across positions, but a transformer block does more. After attention, each token's vector passes through a small position-wise feed-forward network — two linear layers with a nonlinearity — applied independently to every position. If attention decides what to gather, the feed-forward layer decides what to do with it, and it holds a large share of the model's parameters and learned knowledge.

Two more pieces make deep stacks trainable. Residual connections add each sublayer's input to its output, so information and gradients have a direct path around every sublayer; without them, stacking dozens of attention layers would be unstable. Layer normalization keeps activations at a consistent scale between sublayers. A real model stacks this block — attention, then feed-forward, each wrapped in a residual and a norm — dozens of times. Early layers tend to capture local, surface patterns; later layers compose them into abstract, task-relevant features. Attention is the part that moves information between tokens, but it always operates inside this larger, repeated structure.

What attention patterns look like in practice

Because each head learns its own projections, probing trained models reveals recognizable specializations. Some heads are previous-token heads: nearly all their weight falls on the immediately preceding token, effectively reconstructing a local, recurrence-like signal. Some are positional heads that attend by offset. Most interesting are induction heads, which implement a copy-and-continue pattern: having seen "...A B ... A", they attend from the second "A" back to the first and predict "B". Induction heads are widely believed to underlie much of a model's in-context learning — its ability to pick up and continue a pattern shown only within the current prompt.

These patterns are not designed; they emerge from training and can be read directly off the attention weights. That is what makes the explorer above instructive: the connections you see are the literal weights the model would use, not an illustration of them.

Common misconceptions

A few ideas trip people up. Attention weights are not the same as importance or explanation. A high weight means a value was mixed in heavily at that layer, but with dozens of layers and the feed-forward transformations in between, a single layer's weights are a weak guide to what ultimately drives an output. Attention is not memory. Within one forward pass it has full access to the context window, but it stores nothing between calls; "memory" features are built on top of it (caches, retrieval), not inside it. And more heads is not strictly better: heads share the model width, and beyond a point extra heads add redundancy rather than capability. Knowing what the mechanism does — and does not — claim keeps these straight.

What to take away

Attention is a learned, content-addressable, fully parallel way to mix information across a sequence. Each token forms a query and compares it against every key to decide where to look; softmax turns those comparisons into weights; the weights blend the values into a new, context-aware representation. Multiple heads provide multiple relationship views; masking adapts the mechanism to generation; positional encodings supply the order that the dot products discard; and the quadratic cost explains the entire ecosystem of efficient-attention variants. The mechanism is simple enough to write in a few lines and expressive enough to underpin every modern language model — which is exactly why it is worth understanding by manipulation, not memorization.

FREQUENTLY ASKED

What problem does attention solve?

It lets a model weigh every other token when representing a given token, so meaning that depends on far-apart words (long-range dependencies) is captured directly instead of being squeezed through a fixed-size recurrent state.

What are queries, keys, and values?

Three learned linear projections of each token. The query asks "what am I looking for?", the key advertises "what do I offer?", and the value is the content actually mixed in. A token attends to others by matching its query against their keys.

Why divide the dot product by the square root of the key dimension?

For large key dimensions the raw dot products grow large in magnitude, pushing softmax into saturated regions with tiny gradients. Dividing by sqrt(d_k) keeps the scores at a scale where softmax stays sensitive and training is stable.

What does multi-head attention add?

Several attention computations run in parallel on different learned projections, so each head can specialize (syntax, coreference, position). Their outputs are concatenated and projected, giving the model multiple relationship "views" at once.

How is self-attention different from cross-attention?

In self-attention queries, keys, and values come from the same sequence (a token attends to its own context). In cross-attention the queries come from one sequence and the keys/values from another — e.g. a decoder attending to the encoder output.

Why is attention quadratic in sequence length?

Every token compares with every other token, so an n-token sequence produces an n×n score matrix — O(n²) time and memory. This is why long-context models invest in sparse, linear, or flash-attention variants.

CONTINUE IN THE INTERACTIVE COURSE

NLP Fundamentals: Core Concepts and Architectures

Read the theory here, then build it yourself with the live instruments in the lesson.

Open the lesson →