Robert Nishihara on AI and the Future of Data

Aug 28, 2025

Ask episode

Chapters

Transcript

Episode notes

Robert Nishihara is one of the creators of Ray and cofounder of Anyscale, a platform for high-performance distributed data analysis and artificial intelligence. Ben Lorica and Robert discuss the need for data for the next generation of AI, which will be multimodal. What kinds of data will we need to develop models for video and multimodal data? And what kinds of tools will we use to prepare that data?

Points of Interest

1:06: Are we running out of data?
1:35: There is a paradigm shift in how ML is thinking about AI. The innovation is on the data side: finding data, evaluating sources of data, curating data, creating synthetic data, filtering low-quality data. People are curating and processing data using AI. Filtering out low-quality data or unimportant image data is an AI task.
5:02: A lot of the tools were aimed at warehouses and lakehouses. Now we increasingly have more unstructured multimodal data. What's the challenge for tooling?
5:44: Lots of companies have lots of data. They get value out of data by running SQL queries on structured data, but structured data is limited. The real insight is in unstructured data, which will be analyzed using AI. Data will shift from SQL-centric to AI-centric. And tooling for multimodal data processing is almost nonexistent.
8:23: In part of the pipeline, you might be able to use CPUs instead of GPUs.
8:44: Data processing is not just running inference with an LLM. You might want to decompress video, re-encode video, find scene changes, transcribe, or classify. Some stages will be GPU bound, some will be memory bound, some will be CPU bound. You will want to be able to aggregate these different resources.
10:03: Most likely, with this kind of data, it's assumed you will have to go distributed and scale out. There is no choice but to scale the computation.
10:46: In the past, we were only using structured data. Now we have multimodal data. We are only scratching the surface of what we can do with video—so people weren't collecting it as much. We will now collect more data.
11:41: We need to enable training on 100 times more data.
12:43: ML infrastructure teams are now on the critical path.
13:52: Companies at the cutting edge have been doing this, but nearly every company has its own data about its specific business that they can use to improve their platform. The value is there. The challenge is the tooling and the infrastructure.
15:15: There's another interesting angle around data and scale: experimentation. You will have to run experiments. Data processing and experimentation is part of experimentation.
16:18: Customization isn't just at the level of the model. There are decisions to be made at every stage of the pipeline. What to collect, how to chunk, how to embed, how to do retrieval, what model to use, what data to use to fine tune—there are so many decisions to make. To iterate quickly, you need to try different choices and evaluate how they work. Companies should overinvest in evals early.
17:29: If you don't have the right foundation, these experiments will be impossible.
18:23: What's the next data type to get popular?
18:42: Image data will be ubiquitous. People will do a lot with PDFs. Video will be the most challenging. Video combines images and audio; text can be in video too. But the data size is enormous. There are modeling challenges around video understanding. There's so much information in video that isn't being mined.
22:50: Companies aren't saying that scaling laws are over, but scaling is slowing down. What's happening?