I asked "on the other platform" what were the most important improvements to the original 2017 transformer.
That was quite popular and here is a synthesis of the responses:
Apr 28, 2025 06:47- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively
- GQA (Group Query Attention): more Q than (K, V)
- RMSNorm instead of Layernorm: normalize only the scaling
- MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it
- SwiGLU: non-linearity for the FFN block with per-component gating
- RoPE (Rotary Positional Embedding): makes the attention depend only on the relative Q/K positions
- MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.
- Warmup: very short ramping-up of the learning rate, starting from 0
- Cosine schedule: the learning rate varies less at the beginning and end of the schedule
- AdamW: decouples weight includes decay from Adam
- Multi-token prediction: sums the training over multiple future tokens, possibly with additional readout heads.
- FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)
- Ring Attention: takes advantage of multi-node hardware to scale the computation according to the sequence length
- Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.