Benchmarking AI Performance

This chapter examines various metrics for evaluating AI model performance, including cross-entropy loss and notable benchmarks like HeddaSwag and MMLU. It also discusses the replicability of experiment results, addressing skepticism, and the importance of model capacity for compression in relation to learning outcomes.

Play episode from 31:15

chevron_right

Transcript

chevron_right

Transcript

Episode notes

In this episode of AI + a16z, Bowen Peng and Jeffrey Quesnelle of Nous Research join a16z General Partner Anjney Midha to discuss their mission to keep open source AI research alive and activate the community of independent builders. The focus is on a recent project called DisTrO, which demonstrates it's possible to train AI models across the public internet much faster than previously thought possible. However, Nous is behind a number of other successful open source AI projects, including the popular Hermes family of "neutral" and guardrail-free language models.

Here's an excerpt of Jeffrey explaining how DisTrO was inspired by the possibility that major open source AI providers could turn their efforts back inward:

"What if we don't get Llama 4? That's like an actual existential threat because the closed providers will continue to get better and we would be dead in the water, in a sense.

"So we asked, 'Is there any real reason we can't make Llama 4 ourselves?' And there is a real reason, which is that we don't have 20,000 H100s. . . . God willing and the creek don't rise, maybe we will one day, but we don't have that right now.

"So we said, 'But what do we have?' We have a giant activated community who's passionate about wanting to do this and would be willing to contribute their GPUs, their power, to it, if only they could . . . but we don't have the ability to activate that willingness into actual action. . . . The only way people are connected is over the internet, and so anything that isn't sharing over the internet is not gonna work.

"And so that was the initial premise: What if we don't get Llama 4? And then, what do we have that we could use to create Llama 4? And, if we can't, what are the technical problems that, if only we slayed that one technical problem, the dam of our community can now flow and actually solve the problem?"

Learn more:

DisTrO paper

Nous Research

Nous Research GitHub

Follow everyone on X:

Bowen Peng

Jeffrey Quesnelle

Anjney Midha

Check out everything a16z is doing with artificial intelligence here, including articles, projects, and more podcasts.

Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Home Top podcasts Popular guests Top books