Fatih Porikli from Qualcomm AI Research discusses their research at CVPR, covering topics like text-to-image generation, video reasoning using language models, real-time 360° image generation, visual reasoning for mathematical plots, and more. They also touch on demos like multi-modal vision-language models and parameter-efficient fine tuning for mobile phones.
Read more
AI Summary
AI Chapters
Episode notes
auto_awesome
Podcast summary created with Snipd AI
Quick takeaways
Efficient text-to-image generation with diffusion models for increased speed and lower computational costs.
Integration of video and language prompts to enhance object-specific information in models, improving efficiency.
Utilizing text-conditioned 360-degree HDR image generation for real-time portrait relighting at the edge, achieving natural lighting effects.
Deep dives
Qualcomm's Contribution to GenAI Solutions and CVPR Research
Qualcomm AI Research focuses on advancing AI core capabilities like perception, reasoning, and action across devices. They are at the forefront of research in generative AI, working on solutions like LLMs, LVM, stable diffusion image generators, and video questionnaires for mobile applications, autonomous driving, XR, robotics, and more.
Improving Image Generation with Clockwork Units Paper
The Clockwork Units paper discusses a text-to-image generator diffusion model that focuses on efficient computation by replacing middle layers with an approximating network. This change leads to around 30% savings in computations and accelerates the image generation process while maintaining output quality.
Enhancing Video Reasoning with Grounded Reasoning Models
The 'Look, Remember and Reason' paper introduces a video language model that combines video input with language prompts to enhance object-specific information like object location, type, and movement. By training models with stochastic probing questions during training, the model learns to provide efficient and accurate responses during inference.
Generating High Dynamic Range Images for Real-Time Portrait Relighting
The Edge Relight 360 paper focuses on text-conditioned 360-degree HDR image generation for real-time portrait relighting. By utilizing high dynamic range image maps as environment lighting, the model can rotate the environment around a target object, providing natural and visually appealing relighting effects.
Enhancing Multimodal Language Models with Speculative Decoding
The Speculative Decoding for Multimodal Language Models paper introduces a method that accelerates model inference by running a small 'draft model' in parallel with a full-size model. This approach helps in faster token generation and can extend to various multimodal applications for efficient and accurate model predictions.
Today we’re joined by Fatih Porikli, senior director of technology at Qualcomm AI Research. In our conversation, we covered several of the Qualcomm team’s 16 accepted main track and workshop papers at this year’s CVPR conference. The papers span a variety of generative AI and traditional computer vision topics, with an emphasis on increased training and inference efficiency for mobile and edge deployment. We explore efficient diffusion models for text-to-image generation, grounded reasoning in videos using language models, real-time on-device 360° image generation for video portrait relighting, unique video-language model for situated interactions like fitness coaching, and visual reasoning model and benchmark for interpreting complex mathematical plots, and more! We also touched on several of the demos the team will be presenting at the conference, including multi-modal vision-language models (LLaVA) and parameter-efficient fine tuning (LoRA) on mobile phones.