François Fleuret: I asked "on the other platform" what were the most important improvements to the original 2017 transformer. That was quite popular and here is a synthesis of the responses:

See full post

François Fleuret francois.fleuret.org
I asked "on the other platform" what were the most important improvements to the original 2017 transformer. That was quite popular and here is a synthesis of the responses:
Apr 28, 2025 06:47
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- Prenorm: normalization in the residual blocks before the attention operation and the FFN respectively - GQA (Group Query Attention): more Q than (K, V)

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- RMSNorm instead of Layernorm: normalize only the scaling - MLA (Multi-head Latent Attention): stores a low-rank projection of the attention block input and compute the KV from it - SwiGLU: non-linearity for the FFN block with per-component gating

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- RoPE (Rotary Positional Embedding): makes the attention depend only on the relative Q/K positions - MoE (Mixture of Experts): The FFN block is implemented with multiple MLPs and a gating mechanism selects which ones process each token.

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- Warmup: very short ramping-up of the learning rate, starting from 0 - Cosine schedule: the learning rate varies less at the beginning and end of the schedule - AdamW: decouples weight includes decay from Adam

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- Multi-token prediction: sums the training over multiple future tokens, possibly with additional readout heads. - FlashAttention: computes the attention on the fly, avoiding a memory footprint O(T^2) (+ optimizes very carefully for the GPU!)

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Apr 28, 2025
- Ring Attention: takes advantage of multi-node hardware to scale the computation according to the sequence length - Speculative decoding: a cheaper model generates tokens, and a rejection process corrects this generation to march the full-model distribution.

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙