François Fleuret: It does this by generating an optimized cuda kernel on the fly. So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*! 2/3

François Fleuret francois.fleuret.org · Feb 6, 2025
It is hard to overstate how cool and powerful is flex attention. @chhillee.bsky.social pytorch.org/blog/flexatten… TL;DR: it is an implementation of the attention operator in pytorch that allows in particular to efficiently "carve" the attention matrix. 1/3
https://pytorch.org/blog/flexatten…

pytorch.org

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org
It does this by generating an optimized cuda kernel on the fly. So it's cool for causal masks, but it also allows an amazing trick to deal with batches of sequences of various lengths *without padding*! 2/3
Feb 6, 2025 00:23
0 reposts 0 quotes 0 likes

View on Bluesky Show all post labels
François Fleuret francois.fleuret.org · Feb 6, 2025
To do so, you concatenate all the sequences to make a batch of a single sequence, and carve the attention matrix into a block-diagonal one (possibly with causal structure in each block) so that sequences cannot look at each other. Magic! 3/3

View on Bluesky Show all post labels

An unhandled error has occurred. Reload 🗙

https://pytorch.org/blog/flexatten…