The Transformer Stack
> At its heart, an LLM processes text
by converting tokens (words or subwords) into numerical vectors,
then passing them through many repeated layers.
Each layer contains two key components:
> 1. Self-Attention
— lets every token "look at" every other token in the context and decide what's relevant.
This is what allows the model to understand relationships like pronoun references,
subject-verb agreement, and long-range dependencies.
> 2. Feed-Forward Network (FFN)
— a small neural network applied independently to each token position,
which stores and retrieves factual associations learned during training.
> After all layers, a final projection maps each token's vector
back to a probability distribution over the vocabulary
— this is how the model picks the next token.