Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions
Jan 10, 2024
auto_awesome
This podcast discusses recent developments in the multimodal space, including the Unified IO 2 model, collecting preference data for images, LLaVA-RLHF experiments, and challenges in multimodal RLHF. They explore the architecture and challenges of multimodal models, the potential of GPT for V in multimodal RLHF, and the use of RLHF technique in multimodal models. They also discuss the importance of clearer terminology and the adoption of synthetic data in this context.
Multimodal models enable large language models to understand visual information and offer more versatility compared to decoder-only models.
Unified IO2 is the first auto-regressive multimodal model capable of understanding and generating images, text, audio, and action.
Deep dives
Multimodal Models and Their Importance
Multimodal models aim to allow large language models to understand visual information. With the increasing prevalence of visual media in society, image inputs can provide a richer training data set. On the output side, models like Gemini can natively generate images, which opens up new possibilities for creative acts and information processing. By separating generation and information processing, multimodal models can follow an encoder-decoder architecture, offering more versatility compared to decoder-only models.
Unified IO2 and Tokenization Challenges
Unified IO2 is the first auto-regressive multimodal model capable of understanding and generating images, text, audio, and action. To unify modalities, inputs and outputs are tokenized into a shared semantic space and processed by a single encoder-to-coder transformer model. Tokenization poses a challenge for multimodal models, particularly in sharing the tokenization space across different signals. Comparing unified IO2 to models like Flamingo and Gemini highlights differences in architecture and training methods.
Multimodal RLHF and Data Challenges
Multimodal RLHF, such as in the case of L-A-V-A and RLHFV, involves fine-tuning large language and vision models with reinforcement learning from human feedback. The challenge lies in aligning modalities and avoiding hallucination, where textual outputs are not grounded by multimodal context. Generating high-quality data sets for multimodal RLHF is difficult, especially when collecting preference data for textual outputs compared to images. Synthetic data generation may offer cost-effective alternatives to using large-scale models like GPT-4 for RLHF.