LLM Architecture

LLM = Large Language Models 
> built on the Transformer architecture
  > introduced in the 2017 paper "Attention Is All You Need."

  
    The Transformer Stack

> At its heart, an LLM processes text 
  by converting tokens (words or subwords) into numerical vectors, 
  then passing them through many repeated layers. 
  Each layer contains two key components:

  > 1. Self-Attention 
    — lets every token "look at" every other token in the context and decide what's relevant. 
      This is what allows the model to understand relationships like pronoun references, 
      subject-verb agreement, and long-range dependencies.

  > 2. Feed-Forward Network (FFN) 
    — a small neural network applied independently to each token position, 
      which stores and retrieves factual associations learned during training.

> After all layers, a final projection maps each token's vector 
  back to a probability distribution over the vocabulary 
  — this is how the model picks the next token.

  

  
    Training

> LLMs are trained via next-token prediction 
  (autoregressive training) on massive text corpora. 
> The model learns to compress statistical regularities of language 
  — grammar, facts, reasoning patterns 
  — into its billions of parameters. 
  This is where its "knowledge" lives, frozen in the weights.

  

  
    Key limitations of the base architecture:

> Knowledge is static — frozen at training cutoff
> No access to private or proprietary data
> Context window is finite (though modern models have very large windows)
> Can hallucinate facts it doesn't actually know