Developing large video models poses significant challenges compared to text-based models due to the complexity and richness of video data. Unlike text, video involves predicting distributions over all possible frames, representing high-dimensional continuous spaces, and capturing intricate details creating hurdles in accurate predictions. Models incorporating latent variables to capture unperceived information have failed, along with attempts using neural nets, GANS, VAEs, and other methods to predict missing parts of video or images. These failures contrast with the success of similar methods in text-based models like L&Ms, showing the difficulty in effectively learning representations for video data.