📆 ThursdAI - Jan 2 - is 25' the year of AI agents?

ThursdAI - The top AI news from the past week

00:00

Evaluating AI Agents: Challenges and Innovations

This chapter explores the complexities of assessing AI agents compared to traditional applications, focusing on the challenges of measuring output reliability, particularly hallucinations. It highlights advanced evaluation methods and the importance of continuous assessments in selecting effective models for real-world applications. Additionally, the discussion covers recent advancements in AI tools and the integration of external systems, stressing their implications for future capabilities.

Play episode from 38:06

chevron_right

Transcript

chevron_right

Transcript

Episode notes

Hey folks, Alex here 👋 Happy new year!

On our first episode of this year, and the second quarter of this century, there wasn't a lot of AI news to report on (most AI labs were on a well deserved break). So this week, I'm very happy to present a special ThursdAI episode, an interview with Joāo Moura, CEO of Crew.ai all about AI agents!

We first chatted with Joāo a year ago, back in January of 2024, as CrewAI was blowing up but still just an open source project, it got to be the number 1 trending project on Github, and #1 project on Product Hunt. (You can either listen to the podcast or watch it in the embedded Youtube above)

00:36 Introduction and New Year Greetings

02:23 Updates on Open Source and LLMs

03:25 Deep Dive: AI Agents and Reasoning

03:55 Quick TLDR and Recent Developments

04:04 Medical LLMs and Modern BERT

09:55 Enterprise AI and Crew AI Introduction

10:17 Interview with João Moura: Crew AI

25:43 Human-in-the-Loop and Agent Evaluation

33:17 Evaluating AI Agents and LLMs

44:48 Open Source Models and Fin to OpenAI

45:21 Performance of Claude's Sonnet 3.5

48:01 Different parts of an agent topology, brain, memory, tools, caching

53:48 Tool Use and Integrations

01:04:20 Removing LangChain from Crew

01:07:51 The Year of Agents and Reasoning

01:18:43 Addressing Concerns About AI

01:24:31 Future of AI and Agents

01:28:46 Conclusion and Farewell

---

Is 2025 "the year of AI agents"?

AI agents as I remember them as a concept started for me a few month after I started ThursdAI ,when AutoGPT exploded. Was such a novel idea at the time, run LLM requests in a loop,

(In fact, back then, I came up with a retry with AI concept and called it TrAI/Catch, where upon an error, I would feed that error back into the GPT api and ask it to correct itself. it feels so long ago!)

AutoGPT became the fastest ever Github project to reach 100K stars, and while exciting, it did not work.

Since then we saw multiple attempts at agentic frameworks, like babyAGI, autoGen. Crew AI was one of them that keeps being the favorite among many folks.

So, what is an AI agent? Simon Willison, friend of the pod, has a mission, to ask everyone who announces a new agent, what they mean when they say it because it seems that everyone "shares" a common understanding of AI agents, but it's different for everyone.

We'll start with Joāo's explanation and go from there. But let's assume the basic, it's a set of LLM calls, running in a self correcting loop, with access to planning, external tools (via function calling) and a memory or sorts that make decisions.

Though, as we go into detail, you'll see that since the very basic "run LLM in the loop" days, the agents in 2025 have evolved and have a lot of complexity.

My takeaways from the conversation

I encourage you to listen / watch the whole interview, Joāo is deeply knowledgable about the field and we go into a lot of topics, but here are my main takeaways from our chat

* Enterprises are adopting agents, starting with internal use-cases

* Crews have 4 different kinds of memory, Long Term (across runs), short term (each run), Entity term (company names, entities), pre-existing knowledge (DNA?)

* TIL about a "do all links respond with 200" guardrail

* Some of the agent tools we mentioned

* Stripe Agent API - for agent payments and access to payment data (blog)

* Okta Auth for Gen AI - agent authentication and role management (blog)

* E2B - code execution platform for agents (e2b.dev)

* BrowserBase - programmatic web-browser for your AI agent

* Exa - search grounding for agents for real time understanding

* Crew has 13 crews that run 24/7 to automate their company

* Crews like Onboarding User Enrichment Crew, Meetings Prep, Taking Phone Calls, Generate Use Cases for Leads

* GPT-4o mini is the most used model for 2024 for CrewAI with main factors being speed / cost

* Speed of AI development makes it hard to standardize and solidify common integrations.

* Reasoning models like o1 still haven't seen a lot of success, partly due to speed, partly due to different way of prompting required.

This weeks Buzz

We've just opened up pre-registration for our upcoming FREE evaluations course, featuring Paige Bailey from Google and Graham Neubig from All Hands AI (previously Open Devin). We've distilled a lot of what we learned about evaluating LLM applications while building Weave, our LLM Observability and Evaluation tooling, and are excited to share this with you all! Get on the list

Also, 2 workshops (also about Evals) from us are upcoming, one in SF on Jan 11th and one in Seattle on Jan 13th (which I'm going to lead!) so if you're in those cities at those times, would love to see you!

And that's it for this week, there wasn't a LOT of news as I said. The interesting thing is, even in the very short week, the news that we did get were all about agents and reasoning, so it looks like 2025 is agents and reasoning, agents and reasoning!

See you all next week 🫡

TL;DR with links:

* Open Source LLMs

* HuatuoGPT-o1 - medical LLM designed for medical reasoning (HF, Paper, Github, Data)

* Nomic - modernbert-embed-base - first embed model on top of modernbert (HF)

* HuggingFace - SmolAgents lib to build agents (Blog)

* SmallThinker-3B-Preview - a QWEN 2.5 3B "reasoning" finetune (HF)

* Wolfram new Benchmarks including DeepSeek v3 (X)

* Big CO LLMs + APIs

* Newcomer Rubik's AI Sonus-1 family - Mini, Air, Pro and Reasoning (X, Chat)

* Microsoft "estimated" GPT-4o-mini is a ~8B (X)

* Meta plans to bring AI profiles to their social networks (X)