Let me continue the"or not" with this recent paper:
arxiv.org/abs/2602.02493
Simple but very effective idea building on top of JiT: because you're predicting x directly, you can add perceptual losses on top of flow matching. In the paper, they use a "DINO perceptual loss", and I'm going to argue...

PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss
Pixel diffusion generates images directly in pixel space in an end-to-end manner, avoiding the artifacts and bottlenecks introduced by VAEs in two-stage latent diffusion. However, it is challenging to...