Efficient Patch-Level Video Modeling with Transformer Architectures

The chapter explores the methodology of modeling visual data at the patch level without resizing videos, focusing on spatial-temporal patchification using low-dimensional latent representations with injected noise. It discusses cost-efficient solutions for video compression and representation, integrating transformer architectures in diffusion models. The chapter also compares various models in image and text-to-image generation, highlighting the role of mask-laden modeling and context-aware positional embeddings.

Transcript

Play full episode

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app