AI Breakdown

agibreakdown
undefined
Jan 29, 2025 • 4min

Arxiv paper - Improving Video Generation with Human Feedback

In this episode, we discuss Improving Video Generation with Human Feedback by Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Wenyu Qin, Menghan Xia, Xintao Wang, Xiaohong Liu, Fei Yang, Pengfei Wan, Di Zhang, Kun Gai, Yujiu Yang, Wanli Ouyang. The paper introduces a pipeline that utilizes human feedback to enhance video generation, addressing issues like unsmooth motion and prompt-video misalignment. It presents **VideoReward**, a multi-dimensional reward model trained on a large-scale human preference dataset, and develops three alignment algorithms—Flow-DPO, Flow-RWR, and Flow-NRG—to optimize flow-based video models. Experimental results show that VideoReward outperforms existing models, Flow-DPO achieves superior performance over other methods, and Flow-NRG allows for personalized video quality adjustments during inference.
undefined
Jan 28, 2025 • 6min

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

In this episode, we discuss Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling by The authors of the paper are: - Xiaokang Chen - Zhiyu Wu - Xingchao Liu - Zizheng Pan - Wen Liu - Zhenda Xie - Xingkai Yu - Chong Ruan. The paper introduces Janus-Pro, an enhanced version of the original Janus model that features an optimized training strategy, expanded training data, and a larger model size. These improvements lead to significant advancements in multimodal understanding, text-to-image instruction-following capabilities, and the stability of text-to-image generation. Additionally, the authors have made the code and models publicly available to encourage further research and exploration in the field.
undefined
Jan 27, 2025 • 5min

Arxiv paper - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

In this episode, we discuss DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning by DeepSeek-AI. The paper introduces DeepSeek-R1-Zero, a reasoning model trained solely with large-scale reinforcement learning, which exhibits strong reasoning abilities but struggles with readability and language mixing. To overcome these limitations, the authors developed DeepSeek-R1 by adding multi-stage training and cold-start data, achieving performance on par with OpenAI’s models. Additionally, they open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six distilled dense models to support the research community.
undefined
Jan 24, 2025 • 4min

Arxiv paper - Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

In this episode, we discuss Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step by Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Peng Gao, Hongsheng Li, Pheng-Ann Heng. The paper investigates the use of Chain-of-Thought (CoT) reasoning to improve autoregressive image generation through techniques like test-time computation scaling, Direct Preference Optimization (DPO), and their integration. The authors introduce the Potential Assessment Reward Model (PARM) and an enhanced version, PARM++, which evaluate and refine image generation for better performance, showing significant improvements over baseline models in benchmarks. The study offers insights into applying CoT reasoning to image generation, achieving notable advancements and releasing code and models for further research.
undefined
Jan 23, 2025 • 5min

Arxiv paper - Improving Factuality with Explicit Working Memory

In this episode, we discuss Improving Factuality with Explicit Working Memory by Mingda Chen, Yang Li, Karthik Padthe, Rulin Shao, Alicia Sun, Luke Zettlemoyer, Gargi Gosh, Wen-tau Yih. The paper presents Ewe, a novel method that incorporates explicit working memory into large language models to improve factuality in long-form text generation by updating memory in real-time based on feedback from external resources. Ewe demonstrates superior performance over existing approaches across four datasets, boosting the VeriScore metric without compromising response helpfulness. The study highlights the significance of memory update rules, configuration, and retrieval datastore quality in enhancing the model's accuracy.
undefined
Jan 17, 2025 • 4min

Arxiv paper - Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

In this episode, we discuss Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control by Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu. The paper introduces "Diffusion as Shader" (DaS), a novel approach that supports various video control tasks within a unified framework by utilizing 3D control signals, overcoming the limitations of existing methods which are typically restricted to 2D signals. DaS achieves precise video manipulation, such as camera control and content editing, by employing 3D tracking videos, resulting in enhanced temporal consistency. The approach was fine-tuned within three days using 8 H800 GPUs and demonstrates strong performance in tasks like mesh-to-video generation and motion transfer, with further resources available online.
undefined
Jan 13, 2025 • 4min

Arxiv paper - FaceLift: Single Image to 3D Head with View Generation and GS-LRM

In this episode, we discuss FaceLift: Single Image to 3D Head with View Generation and GS-LRM by Weijie Lyu, Yi Zhou, Ming-Hsuan Yang, Zhixin Shu. FaceLift is a feed-forward approach for rapid and high-quality 360-degree head reconstruction using a single image, utilizing a multi-view latent diffusion model followed by a GS-LRM reconstructor to create 3D representations from generated views. It is trained primarily on synthetic datasets, showing strong real-world generalization, and outperforms existing 3D head reconstruction methods. Additionally, FaceLift enables 4D novel view synthesis for video inputs and can be integrated with 2D reanimation techniques for 3D facial animation.
undefined
Jan 8, 2025 • 4min

Arxiv paper - GenHMR: Generative Human Mesh Recovery

In this episode, we discuss GenHMR: Generative Human Mesh Recovery by Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Srijan Das, Chen Chen. The paper introduces GenHMR, a novel generative framework for human mesh recovery (HMR) that addresses uncertainties in converting 2D images to 3D mesh. It employs a pose tokenizer and an image-conditional masked transformer to learn distributions of pose tokens, improving upon deterministic and probabilistic approaches. The model also includes a 2D pose-guided refinement technique and demonstrates superior performance compared to current methods.
undefined
Jan 6, 2025 • 4min

Arxiv paper - Video Creation by Demonstration

In this episode, we discuss Video Creation by Demonstration by Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu. The paper introduces Video Creation by Demonstration, utilizing a method called 𝛿-Diffusion to generate videos that smoothly continue from a given context image, integrating actions from a demonstration video. This approach relies on self-supervised learning for future frame prediction in unlabeled videos, using implicit latent control for flexible video generation. The proposed method surpasses current baselines in both human and machine evaluations, showcasing potential for interactive world simulations.
undefined
Jan 2, 2025 • 5min

Arxiv paper - Byte Latent Transformer: Patches Scale Better Than Tokens

In this episode, we discuss Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, Srinivasan Iyer. The Byte Latent Transformer (BLT) presents a novel approach to large language models by processing data at the byte level, eliminating the need for traditional tokenization. It maintains performance comparable to tokenization-based models while offering improvements in efficiency, robustness, and scaling capability. BLT's dynamic encoding of bytes into variable-sized patches allows more efficient utilization of computational resources and successful scaling to larger model sizes, showcasing its potential in handling raw byte data without a fixed vocabulary.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app