This podcast discusses the challenges of ML compute and evaluation, including AI2's transition to industry, the need for minimum viable resources, and government trust for non-profits. It examines the debate on a regulatory body for algorithms, exploring clear communication standards and evaluation tasks. It also explores the evolution of AI, uncomfortable choices, and the need for new approaches to add truth to public discourse.
AI2 aims to transition from academic to nonprofit by establishing themselves as a trusted place for multi-organizational evaluation, requiring substantial compute power through the acquisition of GPUs.
NIST's role as a standard-setting organization can facilitate third-party assessment of AI systems, addressing concerns about cheating and contamination in evaluations, and ensuring responsible and safe development and deployment of AI systems.
Deep dives
Focus on AI2's transition to a more academic organization
AI2 is in the process of transitioning from being solely academic to proving themselves as a unique place for multi-organizational evaluation to happen. They aim to compare models and gain trust from the government as a nonprofit organization focused on understanding language models scientifically. This ambitious goal requires a significant amount of compute power, making the acquisition of GPUs crucial. While there are discussions on which institutions can manage GPUs better, AI2 emphasizes the need for clear communication and standards in the evaluation space, particularly for open-ended evaluations.
The importance of third-party evaluation oversight
There is a growing recognition that robust and rigorous scientific evaluation of AI systems requires third-party assessment or oversight, whether at the federal level or through other entities like NIST. NIST's role as a standard-setting organization has positioned them as a potential facilitator and translator between academia, industry, and government agencies. Their expertise in evaluation, along with their ability to create an evaluation taxonomy, can help address concerns about cheating and contamination in evaluations. The establishment of clear evaluation standards and a public infrastructure for evaluation is crucial to ensuring the responsible and safe development and deployment of AI systems.
Challenges in understanding AI's impact on elections and misinformation
The emerging challenges around misinformation and disinformation in the context of election integrity raise questions about the role of federal and state entities in mitigating these concerns. The perception is that these entities often react to public anxiety rather than proactively shaping the conversation. The responsibility to bring more truth and understanding to the discourse falls on researchers, government agencies, and private citizens. RFIs and similar initiatives from organizations like NIST can help structure and inform the evaluation and regulation of AI systems. However, translating the complex and nuanced evaluations of language models to ensure public trust and accountability remains a challenge.
The need for translation and defining evaluation standards
As AI systems, especially language models, become more advanced and capable, there is a need for translation between academia, industry, and government agencies. NIST and similar entities can play a crucial role in defining evaluation taxonomies and translating evaluations into meaningful standards and guidelines. The goal is to ensure that evaluation efforts are not limited to a single organization or skewed towards specific incentives, such as cheating on evaluations. A collaborative effort among various stakeholders is necessary to establish a public infrastructure that can handle the challenges of evaluating AI systems effectively and responsibly.
This week Tom and Nate catch up on two everlasting themes of ML: compute and evaluation. We chat about AI2, Zuck's GPUs, evaluation as procurement, NIST comments, neglecting reward models, and plenty of other topics. We're on the tracks for 2024 and waiting for some things to happen. Links for what we covered this week: