Ep#5: R+X: Retrieval & Execution from Human Videos

Apr 24, 2025

Norman Di Palo, a robotics research scientist at Google DeepMind, and Georgios Papagiannis, a PhD candidate at Imperial College, dive into groundbreaking advancements in robotics. They discuss how everyday human videos can train robots, emphasizing in-context learning and the challenges of using wearable cameras. The pair explore video retrieval systems, highlighting keyframe extraction and the fusion of vision with language models to improve task execution. Their insights illuminate the ongoing innovations in imitation learning and the potential for real-time knowledge adaptation in robotics.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

ANECDOTE

Unlabeled Human Videos for Robots

They collect uninterrupted, unlabeled human videos by wearing a camera while performing everyday tasks in lab environments mimicking homes.
The robot receives natural language commands and retrieves relevant clips from this data to execute tasks without training a new policy.

ADVICE

Map Gripper to Human Hand

Use hand tracking models like Hammer to fit the robot's gripper to human hand poses in video.
Apply different heuristics per task class to accurately map robotic gripper movements from human demonstrations.

INSIGHT

Keypoint Bottleneck Boosts Generalization

An information bottleneck from keypoints improves spatial generalization by focusing on object location instead of full RGB.
Language models generalize action trajectories mathematically, enabling robust manipulation in varied object poses.

Get the Snipd Podcast app to discover more snips from this episode

Get the app