First project of 2025: Vision Transformer Explorer
I built a web app to interactively explore the self-attention maps produced by ViTs. This explains what the model is focusing on when making predictions, and provides insights into its inner workings! 🤯
Try it out yourself! 👇
The app loads a small DINOv2 model into the user's browser and runs it locally using Transformers.js! 🤗
This means you can analyze your own images for free: simply click the image to open the file dialog.
E.g., the model recognizes that long necks and fluffy ears are defining features of llamas! 🦙
Vision Transformers work by dividing images into fixed-size patches (e.g., 14 × 14), flattening each patch into a vector and treating each as a token.
It's fascinating to see what each attention head learns to "focus on". For example, layer 11, head 1 seems to identify eyes. Spooky! 👀
Another interesting thing to see is how the attention maps become far more refined in later layers of the transformer. For example,
First layer (1) – noisy and diffuse, capturing broad general patterns.
Last layer (12) – focused and precise, highlighting specific features.
This project was greatly inspired by Brendan Bycroft's amazing LLM Visualization tool – check it out if you haven't already! Also, thanks to Niels Rogge for adding DINOv2 w/ Registers to transformers! 🤗
Source code:
github.com/huggingface/...
Online demo:
huggingface.co/spaces/webml...
Attention Visualization - a Hugging Face Space by webml-community
Vision Transformer Attention Visualization
Jan 1, 2025 15:37