- It is hard to overstate how cool and powerful is flex attention. @chhillee.bsky.social pytorch.org/blog/flexatten… TL;DR: it is an implementation of the attention operator in pytorch that allows in particular to efficiently "carve" the attention matrix. 1/3
- It does this by generating an optimized cuda kernel on the fly. So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*! 2/3Feb 6, 2025 00:23
- To do so, you concatenate all the sequences to make a batch of a single sequence, and carve the attention matrix into a block-diagonal one (possibly with causal structure in each block) so that sequences cannot look at each other. Magic! 3/3