The Agent Reasoning Interface: o1/o3, Claude 3, ChatGPT Canvas, Tasks, and Operator — with Karina Nguyen of OpenAI

337 snips

Feb 1, 2025

Karina Nguyen, Research Manager and Post-training lead at OpenAI, shares her impressive journey from working with Claude at Anthropic to creating innovative tools like ChatGPT Canvas. She dives into enhancing human-computer interaction and the collaborative spirit that drives her team. The conversation highlights the importance of user feedback, the complexity of AI behavior, and innovative approaches to improve writing through AI. Karina also discusses the need for trust in AI interactions and the cultural dynamics between AI organizations.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

00:00 / 00:00

Iterative Development Process

Write a Product Requirements Document (PRD) to define your goals clearly.
Evaluate your model's performance continuously to avoid overfitting and ensure quality.

00:00 / 00:00

Karina's Journey

Karina Nguyen's career began with computer vision applications for investigative journalism at UC Berkeley.
She transitioned to AI and joined Anthropic, where she wrote the first 50,000 lines of Claude.ai.

00:00 / 00:00

Human-AI Collaboration

Building trust with AI agents involves gradual collaboration and understanding their working styles.
Consistent interaction helps users delegate tasks to models without fear of failure.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

Sponsorships and tickets for the AI Engineer Summit are selling fast! See the new website with speakers and schedules live!

If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you, this Feb 20-22nd in NYC.

We’re pleased to share that Karina will be presenting OpenAI’s closing keynote at the AI Engineer Summit. We were fortunate to get some time with her today to introduce some of her work, and hope this serves as nice background for her talk!

There are very few early AI careers that have been as impactful as Karina Nguyen’s. After stints at Notion, Square, Dropbox, Primer, the New York Times, and UC Berkeley, She joined Anthropic as employee ~60 and worked on a wide range of research/product roles for Claude 1, 2, and 3. We’ll just let her LinkedIn speak for itself:

Now, as Research manager and Post-training lead in Model Behavior at OpenAI, she creates new interaction paradigms for reasoning interfaces and capabilities, like ChatGPT Canvas, Tasks, SimpleQA, streaming chain-of-thought for o1 models, and more via novel synthetic model training.

Ideal AI Research+Product Process

In the podcast we got a sense of what Karina has found works for her and her team to be as productive as they have been:

* Write PRD (Define what you want)

* Funding (Get resources)

* Prototype Prompted Baseline (See what’s possible)

* Write and Run Evals (Get failures to hillclimb)

* Model training (Exceed baseline without overfitting)

* Bugbash (Find bugs and solve them)

* Ship (Get users!)

We could turn this into a snazzy viral graphic but really this is all it is. Simple to say, difficult to do well. Hopefully it helps you define your process if you do similar product-research work.

Show Notes

* Our Reasoning Price War post

* Karina LinkedIn, Website, Twitter

* OSINT visualization work

* Ukraine 3D storytelling

* Karina on Claude Artifacts

* Karina on Claude 3 Benchmarks

* Inspiration for Artifacts / Canvas from early UX work she did on GPT-3

* “i really believe that things like canvas and tasks should and could have happened like 2 yrs ago, idk why we are lagging in the form factors” (tweet)

* Our article on prompting o1 vs Karina’s Claude prompting principles

* Canvas: https://openai.com/index/introducing-canvas/

* We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.

To support this, our research team developed the following core behaviors:

* Triggering the canvas for writing and coding

* Generating diverse content types

* Making targeted edits

* Rewriting documents

* Providing inline critique

We measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.

* Tasks: https://www.theverge.com/2025/1/14/24343528/openai-chatgpt-repeating-tasks-agent-ai

* Agents and Operator

* What are agents? “Agents are a gradual progression of tasks: starting with one-off actions, moving to collaboration, and ultimately fully trustworthy long-horizon delegation in complex envs like multi-player/multiagents.” (tweet)

* tasks and canvas fall within the first two, and we are def. marching towards the third—though the form factor for 3 will take time to develop

* Operator/Computer Use Agents

* https://openai.com/index/introducing-operator/

* Misc:

* Andrew Ng

* Prediction: Personal AI Consumer playbook

* ChatGPT as generative OS

Timestamps

* 00:00 Welcome to the Latent Space Podcast

* 00:11 Introducing Karina Nguyen

* 02:21 Karina's Journey to OpenAI

* 04:45 Early Prototypes and Projects

* 05:25 Joining Anthropic and Early Work

* 07:16 Challenges and Innovations at Anthropic

* 11:30 Launching Claude 3

* 21:57 Behavioral Design and Model Personality

* 27:37 The Making of ChatGPT Canvas

* 34:34 Canvas Update and Initial Impressions

* 34:46 Differences Between Canvas and API Outputs

* 35:50 Core Use Cases of Canvas

* 36:35 Canvas as a Writing Partner

* 36:55 Canvas vs. Google Docs and Future Improvements

* 37:35 Canvas for Coding and Executing Code

* 38:50 Challenges in Developing Canvas

* 41:45 Introduction to Tasks

* 41:53 Developing and Iterating on Tasks

* 46:27 Future Vision for Tasks and Proactive Models

* 52:23 Computer Use Agents and Their Potential

* 01:00:21 Cultural Differences Between OpenAI and Anthropic

* 01:03:46 Call to Action and Final Thoughts

Transcript

Alessio [00:00:04]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel, and I'm joined by my usual co-host, Swyx.

swyx [00:00:11]: Hey, and today we're very, very blessed to have Karina Nguyen in the studio. Welcome.

Karina [00:00:15]: Nice to meet you.

swyx [00:00:16]: We finally made it happen. We finally made it happen. First time we tried this, you were working at a different company, and now we're here. Fortunately, you had some time, so thank you so much for joining us. Yeah, thank you for inviting me. Karina, your website says you lead a research team in OpenAI, creating new interaction paradigms for reasoning interfaces and capabilities like ChatGPT Canvas, and most recently, ChatGPT TAS. I don't know, is that what we're calling it? Streaming chain of thought for O1 models and more via novel synthetic model training. What is this research team?

Karina [00:00:45]: Yeah, I need to clarify this a little bit more. I think it changed a lot since the last time we launched. So we launched Canvas, and it was the first project. I was a tech lead, basically, and then I think over time I was trying to refine what my team is, and I feel like it's at the intersection of human-computer interaction, defining what the next interaction paradigms might look like with some of the most recent reasoning models, as well as actually trying to come up with novel methods, how to improve those models for certain tasks if you want to. So for Canvas, for example, one of the most common use cases is basically writing and coding. And we're continually working on, okay, how do we make Canvas coding to go beyond what is possible right now? And that requires us to actually do our own training and coming up with new methods of synthetic data generation. The way I'm thinking about it is that my team is going from very full stack, from training models all the way up to deployment and making sure that we create novel product features that is coherent to what you're doing. So we're really working on that.

swyx [00:02:08]: So it's, it's a lot of work to do right now. And I think that's why I think it's such a great opportunity. You know, how could something this big work in like an industrial space and in the things that we're doing, you know, it's a really exciting time for us. And it's just, you know, it's a lot of work, but what I really like about working in digital space is the, you know, the visual space is always the best place to stay. It's not just the skill sets that need to be done.

Alessio [00:02:17]: Like we have, like, a lot of things to be done, but like, we've got a lot of different, you know, things to come up with. I know you have some early UX prototypes with GPT-3 as well, and kind of like maybe how that is informed, the way you build products.

Karina [00:02:32]: I think my background was mostly like working on computer vision applications for like investigative journalism. Back when I was like at school at Berkeley, and I was working a lot with like Human Rights Center and like investigative journalists from various media. And that's how I learned more about like AI, like with vision transformers. And at that time, I was working with some of the professors at Berkeley AI Research.

swyx [00:03:00]: There are some Pulitzer Prize winning professors, right, that teach there?

Karina [00:03:04]: No, so it's mostly like was reporting for like teams like the New York Times, like the AP Associated Press. So it was like all in the context of like Human Rights Center. Got it. Yeah. So that was like in computer vision. And then I saw... I saw Crisolo's work around, you know, like interpretability from Google. And that's how I found out about like Anthropic. And at that time, I was just like, I think it was like the year when like Ukraine's war happened. And I was like trying to find a full-time job. And it was kind of like all got distracted. It was like kind of like spring. And I was like very focused on like figuring out like what to do. And then my best option at that time was just like continue my internship. At the New York Times and convert to like full-time. At the New York Times, it was just like working on like mostly like product engineering work around like R&D prototypes, kind of like storytelling features on the mobile experience. So it kind of like storytelling experiences. And like at that time, we were like thinking about like how do we employ like NLP techniques to like scrape some of the archives from the New York Times or something. But then I always wanted to like get into like AI. And like I knew OpenAI for a while, like since I was like, and I was like, I don't know, I don't know. So I kind of like applied to Anthropic just on the website. And I was rejected the first time. But then at that time, they were not hiring for like anything like product engineering or front-end engineering, which was something I was like, at that time, I was like interested in. And then there was like a new opening at Anthropic was like kind of like you are front-end engineer. And so I applied. And that's how my journey began. But like the earlier prototypes was mostly like I used like Clip.

swyx [00:05:13]: We'll briefly mention that the Ukrainian crisis actually hit home more for you than most people because you're from the Ukraine and you moved here like for school, I guess. Yeah.

Karina [00:05:23]: Yeah.

swyx [00:05:23]: We'll come back to that if it comes up. But then you joined Anthropic, not just as a front-end engineer. You were the first. Is that true? Designer? Yeah.

Karina [00:05:32]: Yes. I think like I did both product design and front-end engineering together. And like at that time it was like pre-CHPT. It was like, I think August 2022. And that was a time when Anthropic really decided to like do more product-y related things. And the vision was like, we need to like fund research and like building product is like the best way to like fund safety research, which I find it quite admirable. So the really first product that Anthropic built was like Cloud and Slack. And it was sunsetted not long after, but like it was like one of the first, I think I still come back to that idea of like Cloud operating inside some of the organizational workplace like Slack and something magical in there. And I remember we built like ideas like summarize the thread, but you can like imagine having automated like ways of like, maybe Cloud should like summarize multiple channels every week, custom for what you like or for what you want. And then we built some like really cool features. Like this. So we could like tag Cloud and then ask to summarize what's what happened in the thread. So just like new ideas, but we didn't quite double down because you could like imagine like Cloud having access to like the files or like Google drive that you can upload and just connectors, like connections in the Slack. Also the UX was kind of constraining at that time. I was thinking like, oh, we wanted to do this feature, but like Slack interface kind of constrained us to like do that. And we didn't want to like be dependent on the platform, like Slack. And then after like ChaiGPT came out, I remember the first two weeks, my manager made me this challenge, like, can I like reproduce kind of like a similar interface in like two weeks? And one of the early mistakes being in the engineering is like, I said, yes, instead I should have said like, you know, it's double, two X at the time. Sure. Um, and this is how like Cloud.ai was kind of like born.

swyx [00:07:39]: Oh, so you actually wrote Cloud.ai? Yeah. As your first job. Yeah.

Karina [00:07:43]: Like, I think like the first like 50,000 code of lines without any reviews at that time, because there's no one, um, yeah, it was like very small team. It was all like six, seven team who we were called like deployment team. Yeah.

swyx [00:07:59]: Oh, mine, I actually interviewed for, uh, at Anthropic around that time. I got, I was given Cloud in Sheets and that was my other form factor. I was like, oh yeah, this needs to be in a table so we can, we can just copy paste and just span it out. Uh, which is kind of cool. The other rumor that, um, we might as well just mention this, um, Raza Habib from HumanLoop, uh, often says that, uh, you know, there was some, there's some version of ChatGPT in Anthropic, like you had the chat interface already, like you had Slack, why not launch a web UI? Like basically like how did, how did OpenAI beat Anthropic to ChatGPT basically? Um, well, it seems kind of obvious to have it.

Karina [00:08:35]: I think ChatGPT model itself came out way before then we decided to like launch Cloud2 necessarily. And I think like at that time, Cloud 1.3 had a lot of hallucinations actually. So I think there was like, one of the concerns is like, I don't think like the leadership was convinced, had the conviction that this is the model that you need to like, you want to like deploy or something. So it was a lot of discussions around, around that time. But Cloud 1.3 was like, I don't know if you played with that, but it's like extremely creative and it was like really cool.

swyx [00:09:07]: Nice.

Alessio [00:09:08]: It's still creative. And you had a tweet. Recently that you said things like Canvas and Tasks could have happened two years ago, but they were not. Do you know why they were not? Was it too many researchers at the labs not focused on UX? Was it just not a priority for the labs?

Karina [00:09:24]: Yeah. I come back to that question a lot. I guess like I was working on something similar to like Canvas-y, but for Cloud at that time in like 2023, it was the same similar idea of like Cloud workspace where a human and a Cloud could have like a shared workspace. Yeah. And that's Artifacts. Which is like a document. Right.

swyx [00:09:44]: No, no, no. This is Cloud projects.

Karina [00:09:46]: I don't know. I think it kind of evolved. I think like at that time I was like in product engineering team and then I switched to like research team and the product engineering team grew so much. They had their own ideas of like artifacts and like projects. So not necessarily, maybe they had, they looked at my like previous explorations, but like, you know, when I was exploring like Cloud documents or like Cloud workspace was like. Yeah. I don't think anybody was thinking about UX as much or like not many like researchers understood that. And I think the inspiration actually for, I still have like all the sketches, but the inspiration was like from the Harry Potter, like Tom Riddler diary. That was an inspiration of like having Cloud writing into the document or something and communicate back.

swyx [00:10:34]: So like in the movie you write a little bit and then it answers you. Yeah.

Karina [00:10:37]: Okay.

swyx [00:10:38]: Interesting.

Karina [00:10:39]: But that was like in the. Only in the context of like writing. I think Canvas is like more also serves like coding, one of the most common use cases. But yeah, I think like those, those ideas could have happened like two years ago. Just like maybe, I don't think it was like a priority at that time. It was like very unclear. I think like AI landscape at that time was very nascent. If that makes sense. Like nobody, like, even when I would talk to like some of the designers at that time, like product designers, they were not even thinking about that at all. They did not have like AI in mind. And like, it's kind of interesting, except for one of my designer friends. His name is Jason Yuan. Yeah. Who was thinking about that.

swyx [00:11:19]: And Jason now is a new computer. Yes. We'll have them on at some point. I had them speak at my first summit and you're speaking the second one, which will be really fun. Nice. We'll stay on Anthropic for a bit and then we'll move on to more recent things. I think the other big project that you were, you were involved with was just Cloud 3. Just tell us the story. Like, what was it like to launch one of the biggest launches of the year? Yeah.

Karina [00:11:39]: I think like I was, so Cloud 3.

swyx [00:11:43]: This is Haiku, Sonnet, Opus all at once, right? Yes. Yeah.

Karina [00:11:46]: It was a Cloud 3 family. I was a part of the post-training fine tuning team. We only had like, what, like 10, 12 people involved. And it was really, really fun to like work together as friends. So yeah, I was mostly involved in like Cloud 3 Haiku post-training side and then evaluations, like developing new evaluations. And like literally writing the entire like model card. And I had a lot of fun. I think like the way you train the model is like very different, obviously. But I think what I've learned is that like you will end up with like, I don't know, like 70 models and every model will have its own like brain damage. And like, so it's just like, like kind of just bugs.

swyx [00:12:28]: Like personality wise or performance benchmarks?

Karina [00:12:31]: I think every model is very different. And I think like, it's like one of the interesting like research questions is like, how do you understand like the data interface? How do you understand the interactions as you like train the model? It's like, if you train the model on like contradictory data sets, how can you make sure that there won't be like any like weird like side effects? And sometimes you get like side effects. And like the learning is that you have to like iterate very rapidly and like have to like debug and detect it and make like address it with like interventions. And actually some of the techniques from like software engineering is very like useful here. It's like, how do you- Yeah, exactly.

swyx [00:13:09]: So I really empathize with this because data sets, if you put in the wrong one, you can basically kind of screw up like the past month of training. The problem with this for me is the existence of YOLO runs. I cannot square this with YOLO runs. If you're telling me like you're taking such care about data sets, then every day I'm going to check in, run evals and do that stuff. But then we also know that YOLO runs exist. Yes. So how do you square that?

Karina [00:13:32]: Well, I think it's like dependent on how much compute you have. Right? So it's like, it's actually a lot of questions and like researchers are like, how do you most effectively use the compute that you have? And maybe you can have like two to three runs that is only like YOLO runs. But if you don't have a luxury of that, like you kind of need to like prioritize ruthlessly. Like what are the experiments that are most important to like run? Yeah. I think this is what like research management is basically. It's like, how do you-

swyx [00:14:04]: Funding efforts. Yeah. Yeah. Prioritizing.

Karina [00:14:07]: Take like research bets and make sure that you build the conviction and those bets rapidly such that if they work out, you like double down on them. Yeah.

swyx [00:14:15]: You almost have to like kind of ablate data sets too and like do it on the side channel and then merge it in. Yeah. It's kind of super interesting. Tell us more, like what's your favorite? So you, I have this in front of me, the model card. You say constructing this painful, this table was slightly painful. Just pick a benchmark and what's an interesting story behind one of them?

Karina [00:14:33]: I would say GPQA was kind of interesting. I think it was like the first, I think we were the first lab, like Antarctica was the first lab to like run.

swyx [00:14:42]: Oh, because it was like relatively new after NeurIPS? Yeah.

Karina [00:14:45]: Yeah. Okay. Published GPQA like numbers. And I think one of the things that we've learned was that I personally learned about that, like any evals is like, some evals are like very like high variance. And like GPQA is like, happened to be like a huge like high variance. Like evaluation. So like one thing that we did is like having like run the average of like five and like take the average. But like the hardest thing about like the model card is like none of the numbers are like apples to apples. Yes. Will knows this. So you actually need to like go back to like, I don't know, like GPT-4 model card and like read the appendix just to like make sure that like the settings are the same as you're running the settings too. So it's like never an apples to apples. Yeah. But it's interesting how like, you know, when you market models as products, like customers don't necessarily know. Yeah. Like.

swyx [00:15:44]: They're just like, my MMLU is 99. What do you mean? Yeah, exactly. Why isn't there an industry standard harness, right? There's this eLuther's thing, which it seems like none of the model labs use. And then OpenAI put out simple eval and nobody uses that. Why isn't there just one standard way everyone runs this? Because the alternative approach is you rerun your evals on their models. And obviously the numbers, your numbers will be lower. Yeah. And they'll be unhappy. So that's why you don't do that.

Karina [00:16:12]: I think it operates on an assumption that like the models, the next generation of the model or the model that you produce next is going to behave the same. So for example, like I think the way you prompt a one or like a cloud three is going to be very different from each other. I feel like there's a lot of like prompting that you need to do to get the evals to run correctly. So sometimes the model will just like output like new lines and the way it parsed will be like incorrect or something. This has happened with like Stanford. I remember like when Stanford had this also like they were like running benchmarks. Helm? Yeah, Helm. And somehow like cloud was like always like not performing well. And that's because like the way they prompted it was kind of wrong. So it's like a lot of like techniques. Yeah. It's just like very hard because like nobody even knows.

swyx [00:17:00]: Has that gone away with chat models instead of, you know, just raw completion models?

Karina [00:17:05]: Yeah, I guess like each eval also can be run in a very different way. Sometimes you can like ask the model to output in like XML tags, but some models are not really good at XML tags. So it's like, do you change the formatting per model or like do you run the same format across all models? And then like the metrics themselves, right? Like maybe, you know, accuracy is like one thing, but maybe you care about like some other metrics like F score or like some other like things. Yeah. It's like hard. I don't know.

Alessio [00:17:36]: And talking about O1 prompting, we just had a O1 prompting post on the newsletter, which I think was...

swyx [00:17:42]: Apparently it went viral within OpenAI. Yeah. I don't know. I got pinged by other OpenAI people. They were like, is this helpful to us? I'm like, okay. Oh, nice. Yeah.

Alessio [00:17:50]: I think it's like maybe one of the top three most read posts now. Yeah. Cool. And I didn't write it. Okay. Exactly.

swyx [00:17:57]: Anyway, go ahead.

Alessio [00:17:57]: What are your tips on O1 versus like cloud prompting or like what are things that you took away from that experience? And especially now, I know that with 4.0 for Canvas, you've done RL after on the model. So yeah, just general learning. So now to think about prompting these models differently.

Karina [00:18:12]: I actually think like O1, I did not even harness the magic of like O1 prompting. But like one thing that I found is that like, if you give O1 like hard, like constraints of like what you're doing. What you're looking for, basically the model will be, will have a much easier time to like kind of like select the candidates and match like the candidate that is most like fulfilled the criteria that you gave. And I think there's a class of problems like this that O1 excels at. For example, if you have a question, like a bio question on like some, or like in chemistry, right? Like if you have like very specific criteria with the protein or like some of the. Chemical bindings or something like, then the model will be really, will be really good at like determining the exact candidate that will match the certain criteria.

swyx [00:19:04]: I have often thought that we need a new IF eval for this. Because this is basically kind of instruction following, isn't it? Yes. But I don't think IF eval has like multi-step IF eval. Yeah. So that's what basically I use AI News for. I have a lot of prompts and a lot of steps and a lot of criteria and O1 just kind of checks through each kind of systematically. And we don't have any evals like that.

Karina [00:19:24]: Yeah.

Alessio [00:19:25]: Does OpenAI know how to prompt O1? I think that's kind of like the, that's the, you know, Sam is always talking about incremental deployments and kind of like getting, having people getting used to it. When you release a model, you obviously do all the safety testing, but do you feel like people internally know how to get a hundred percent out of the model? Or like, are you also spending a lot of time learning from like the outside on how to better prompt O1 and like all these things? Yeah.

Karina [00:19:50]: I certainly think that you learn so much from like external feedback too. Yeah. I feel like I don't fully know on how people use like O1. I think like a lot of people use O1 for like really hardcore like coding questions. I feel like I don't fully know how to best use O1. You release the model. Except for like, I use O1 to just like do some like synthetic data explorations. But that's it.

Alessio [00:20:16]: Do people inside of OpenAI, once the model is coming out, do you get like a company-wide memo of like, hey, this is how you should try and prompt this? Yes. Especially for people that might not be close to it during development, you know, or I don't know if you can share anything, but I'm curious how internally these things kind of get shared.

Karina [00:20:34]: I feel like I'm like in my own little corner in like research. I don't really like to look at some of the Slack channels.

swyx [00:20:40]: It's very, very big.

Karina [00:20:41]: So I actually don't know if something like this exists. Probably. It might be exist because we need to share to like customers or like, you know, like some of the guides. I'm like, how do you use this model? So probably there is.

swyx [00:20:56]: I often say this. The reason that AI engineering can exist outside of the model labs is because the model labs release models with capabilities that they don't even fully know because you never trained specifically for it. It's emergent. And you can rely on basically crowdsourcing the search of that space or the behavior space to the rest of us. Yeah. So like, you don't have to know. That's what I'm saying. Yeah.

Karina [00:21:20]: I think like an interesting thing about like O1 is like. That like it's really for like average human. Sometimes I don't even know whether the model like produced the correct output or not. Like it's really hard for me to like verify even like hard like stem questions. I don't know if I'm not an expert. Like I usually don't know. So it's like the question of like alignment is actually more important like for this like complex reasoning models to like how do we help humans to like verify the outputs of these models is quite important. And I feel like. Yeah. Like learning from external feedback is kind of cool.

swyx [00:21:56]: For sure. One last thing on cloud three. You had a section on behavioral design. Yes. Anthropics very famous for the HHH goals. What was your insights there? Or, you know, maybe just talk a little bit about what you explored. Yeah.

Karina [00:22:09]: I think like behavioral design is like a really cool. I'm glad that I made it like a section around this. And it's like really cool. I think like.

swyx [00:22:17]: Like you weren't going to publish one and then you insisted on it or what?

Karina [00:22:20]: I think like I just like put the section. Yeah. I think like I put the section inside it and like, yeah, Jared, my like one of my most favorite researchers like, yeah, that's cool. Let's, let's do that. I guess. Yeah. Like nobody had this like term like behavioral design necessarily for the models. It's kind of like a new little field of like extending like product design into like the model design. Right. Like, so how do you create a behavior for the model in certain contexts? So as for example, like in Canvas, right. Like one of the things that we had to like think about is like, okay, like now the model enters like more collaborative environment, more collaborative context. So like what's the most appropriate behavior for the model to act like as a collaborator? Should it ask like more follow up questions? Should it like change? What's the tone should be? Like what is the collaborator's tone? It's different from like a chat, like conversationalist versus like collaborator. So how do you shape the perspective? Like, you know, like the persona and the personality around that is it has like some philosophical questions too. Like, yeah. Behavioral. I mean, like, I guess like I can talk more about like the methods of like creating the personality. Please. It's the same thing as like you would create like a character in a video game or something. It's kind of like...

swyx [00:23:39]: Charisma, intelligence. Yeah, exactly. Wisdom.

Karina [00:23:42]: What are the core principles? Helpful, harmless, honest. Yeah. And obviously for Cloud, this was my, is much easier than I would say like for ChargeAPD. For Cloud, it's like baked in the mission, right? It's like honest, harmless, helpful. But the most complicated thing about the model behavior or the behavioral design is that sometimes two values would contradict each other. I think this happened in Cloud 3. One of the main things that we were thinking about was like, how do we balance this like honesty versus like homelessness or like helpfulness? And it's like, we don't want the model to always like refuse even to like innocuous queries, like some like creative writing prompts, but also if you don't want the model to be act like a, be harmful or something. So it's like, there's always a balance between those two. And it's more like art than the science necessarily. And this is what data sets craft is, is like more of an art than a literal science. You can definitely do like empirical research on this, but it's actually like, like this is the idea of like synthetic data. Like if you look back to like institutional AI paper is around like, how do you create completions such that you would agree to certain like principles that you want your model to agree on? So it's like, if you create the core values of the models, how do you decompose those core values? Into like specific scenarios or like, so how does the model need to express its honesty in a variety of kind of like scenarios? And this is where like generalization happens when you craft the persona of the model. Yeah.

swyx [00:25:22]: It seems like what you described behavior modification or shaping as a side job that was done. I mean, I think Anthropic has always focused on it the first and the most. But now it's like every lab has sort of. It's like a vibes officer for you guys is Amanda, for OpenAI it's Rune, and then for Google, it's Steven Johnson and Raiza who we had on the podcast. Do you think this is like a job? Like, it's like a, like every, every company needs a tastemaker.

Karina [00:25:50]: I think the model's personality is actually the reflection of the company or the reflection of the people who create that model. So like for Claude's, I think Amanda was doing a lot of like Claude character work and I was working with her at the time.

swyx [00:26:04]: But there's no team, right? Claude character work. Now there's a little bit of a team. Isn't that cool?

Karina [00:26:09]: But before that there was none. I think like actually it was Claude 3, he was like, we kind of doubled down on the feedback from Claude 2. Like people, we didn't even like think, but like people said like Claude 2 is like so much better at like writing and like has certain personality, even though it was like unintentional at all. And we did not pay that much attention and didn't know even how to like productionize this property of model being better. Like personality. And to like, with Claude 3, we kind of like had to like double down because we knew that if you would launch like in chat, we wanted to like Claude honesty is like really good for like enterprise customers. So we kind of wanted to like make sure the hallucinations went, like factuality would like go up or something. We didn't have a team until or after like Claude 3, I guess. Yeah.

swyx [00:26:58]: I mean, it's, it's growing now. And I think anyway, everyone's taking it seriously.

Karina [00:27:00]: I think on OpenAI there was a team called Model Design. It's John, the PM. She's leading that team and I work very closely with those teams that we were working on, like actually writing improvements that we did with ChaiGPT last year. And then I was working on like this collaboration, like how do you make ChaiGPT act like a collaborator for like Canvas? And then, yeah, we worked together on some of the projects.

swyx [00:27:25]: I don't think it's publicly known his, his actual name other than Rune, but he's, he's, he's mostly, he's mostly doxxed.

Alessio [00:27:32]: We'll beep it and then people can guess. Yeah. Do we want to move on to OpenAI and some of the recent work, especially you mentioned Canvas. So the first thing about Canvas is like, it's not just a UX thing. You have a different model in the backend, which you post-trained on or one preview distilled data, which was pretty interesting. Can you maybe just run people through, you come up with a feature idea, maybe then how do you decide what goes in the model, what goes in the product and just that, that process? Yeah.

Karina [00:28:03]: I think the most unique thing about ChaiGPT Canvas. What I really liked about that was that it was also the team formed out of the air. So it was like July 4th or something... Wow. during the break. Like on Independence Day.

swyx [00:28:17]: They just like, okay.

Karina [00:28:18]: I think it was, there was some like company break or something. I remember I was just like taking a break and then I was like pitching this idea to like Barrett Zarf. Barrett Zarf, yeah. Who was my manager at that time. Just like, I just want to like create this like Canvas or something. And I really didn't know how to like apply this. Navigate, OpenAI, it was like my first, like, I don't know, like first month at OpenAI and I really didn't know how to like navigate, how do I get product to work with me or like some of the ideas, like some of the things like this was like, so I'm really grateful for like actually Barrett and Mira who helped me to like staff this project basically. And I think that was really cool. And it was like this 4th of July and like Barrett was like, yeah, actually, who's like an engineering manager is like, yeah, we should like staff this project with like five, six engineers or something. And then Karina can be a researcher on this project. And I think like, this is how the team was formed. This was kind of like out of the air. And so like, I didn't know anyone there at that time, except for Thomas Dimson. He did like the first like initial like engineering prototype of the canvas and it kind of like reshaped. But I think the first, we learned a lot on the way how to work together as product and research. And I think this is one of the first projects at OpenAI where research and product work together from the very beginning. And we just made it like a successful project in my opinion is because like designers, engineers, PM and research team were all together. And we would like push back on each other. Like if like it doesn't make sense. Yeah. we'd like to do it on the model side, like we are hard to like collaborate with like applied engineers to like make sure this is being handled on the applied side. But the idea is you can go that far with like prompted baseline, prompt, the charge of PT was kind of like the first thing that we tried was like a canvas as a tool or something. So how do we define the behavior of the canvas? But then like we've found like different like edge cases that we wanted to like fix and the only way to like fix the some of these edge cases actually through post training. So we actually, what we did was actually retrain the entire 4.0 plus our Canvas stuff. And this is like, there are like two reasons why we did this is because like the first one is that we wanted to ship this as a better model in the dropdown menu. We could like rapidly iterate on users' feedback as we ship it and not going through the entire like integration process into like this like new one model or something, which took some time. Right. So I'm like from beta to like GA, it took, I think, three months. So we kind of wanted to like ship our own model with that feature to like learn from the user feedback very quickly. So that was like one of the decisions we made. And then with Canvas itself, we just like had a lot of like different like behavioral, it's again, like it's a behavioral engineering. It's kind of like various behavioral craft around like when does Canvas need to write comments? When does it need to like update or like edit the document? When does it need to like update or like edit the document? When does it need to edit the entire, like rewrite the entire document versus like edit very specific section of the user asks? And when does it need to like trigger the Canvas itself? It was one of those, those like behavioral engineering questions that we had. At that time, I was also working with like writing quality. So that was like the perfect way for us to like literally both teach the model how to use Canvas, but also like improve writing quality if writing was like one of the main use cases for Chachi PD. So I think that was like the reasoning around that.

swyx [00:31:55]: There's so many questions. Oh my God. Quick one. What does improved writing quality mean? What are the evals?

Karina [00:32:01]: What are the evals? Yeah. So the way I'm thinking about it is like have two various directions. The first direction is like, how do you improve the quality of the writing of the current use cases of Chachi PD? And those, most of the use cases are mostly like nonfiction writings. It's like email writing or like some of the, maybe you've blog posts, cover letters is like one. I don't mean use cases, but then the second one is like, how do we teach the model to literally think more creatively or like write in a more creative manner such that it will like just create novel forms writing. And I think the second one is like much of a longer term, like research question. While the first one is more like, okay, we just need to improve data quality for the writing use cases that between the models are. It is more straightforward question. Okay. But the way we evaluated the writing quality, so actually I worked with Jan's team on the model design. So they had a team of like model writers and we would work together and it's just like a human eval. It's like internal human eval where we would just like that. Yeah. On the prompt distribution that we cared about, like we want to make sure that the models that we like use, that we trained were always like better or something. Yeah.

swyx [00:33:20]: So like some test set of like a hundred prompts that you want to make sure you're good on. I don't know. I don't know how big the prompt distribution needs to be because you are literally catering to everyone. Right.

Karina [00:33:32]: Yeah. I think it was much more opinionated way of like improving writing quality because we worked together with like model designers to like come up with like core principles of what makes this particular writing good. Like what does make email writing good? And we had to like craft like some of the literally like rubric on like what makes it good and then make sure during the eval, we check the marks on this like rubric. Yeah.

swyx [00:33:58]: That's what I do. Yeah. That's what school teachers do. Yeah.

Karina [00:34:02]: Yeah. It's really funny.

swyx [00:34:03]: Like, yeah, that's exactly how we grade essays. Yes.

Karina [00:34:06]: Yeah.

Alessio [00:34:06]: I guess my question is when do you work the improvements back in the model? So the canvas model is better writing. Why not just make the core model better too? So for example, I built this small podcasting thing for a podcast and I have the 4.0 API and I asked it to write a write up about the episode based on the transcript. And then I've done the same in canvas. The canvas one is a lot better. Like the one from the raw 4.0, it starts, the podcast delves and I was like, no, I'm not delved in the third word. Why not put them back in 4.0 core or is there just like.

Karina [00:34:38]: I think you put it back in the core now.

Alessio [00:34:40]: Yeah. So like, so the 4.0 canvas now is the same as 4.0. Yeah. You, you must've missed that update. Yeah. What's the, what's the, what's the process to, I think it's just like an AB test almost. Right. To me, it feels, I mean, I've only tried it like three times. But it feels the canvas, the canvas output feels very different than the API output.

Karina [00:35:01]: Yeah, yeah. I think like, there's always like a difference in the model quality. I would say like the original better model that we released this canvas was actually much more creative than even right now when I use like 4.0 with canvas. I think it's just like the complexity of like the data and the complexity of the, it's kind of like versioning issues right here. It's like, okay, like your version. 11 will be very different from like version eight, right? It's like, even though like the stuff that you put in is like the same or something.

swyx [00:35:32]: It's a good time to, to say that I have used it a lot more than three times. I'm a huge fan of canvas. I think it is, um, yeah, like it's weird when I talk to my other friends, they, they don't really get it yet or they don't really use it yet. I think because it's maybe sold as like sort of writing help when really like it's kind of, it's the scratch pad. Yeah. What are the core use cases or like, yeah.

Karina [00:35:53]: Oh yeah. I'm curious. Literally draft.

swyx [00:35:54]: Drafting anything like I want to draft like copy for my conference that I'm running, like I'll put it there first and then I like, it'll just have the canvas up and I'll just say what I don't like about it and it changes. I will maybe edit stuff here and paste in. So, so for example, like I wanted to draft a brainstorm list of reasons of signs that you may be an NPC just for fun, just like a blog post for fun. Nice. And I was like, okay, I'll do 10 of these and then I want you to generate the next 10. So I wrote 10. I placed it in it to, to chat GPT. Okay. And they generated the next 10 and they all sucked, all horrible, but it also spun up the canvas with, with the blog posts and I was like, okay, self-critique why your output sucks and then try again. And it, and it just kind of just iterates on the blog posts with me as a writing partner and it is so much better than, I don't know, like intermediate steps. I was like, that would be my primary use case literally drafting anything. I think the other way that I'll put it, I'm not putting words in your mouth. This is how I view what canvas is and why. It's so important. It's basically an inversion of what Google docs is, wants to do with Gemini. It's like Google docs on the main screen and then Gemini on the side and right now what chat GPT has done is do the chat thing first and then the docs on the side, but it's kind of like a reversal of, of what is the main thing. Like Google docs starts with the canvas first that you can edit and whatever, and then you maybe sometimes you call in the AI assistants, but chat GPT, what you are now is you're kind of AI first with these, the site output being Google docs.

Karina [00:37:22]: I think we definitely want to improve. Like writing use case in terms of like, how do we make it easier for people to format or like do some of the editing? I think there is still a lot of room for improvement, to be honest. I think the another thing is like coding, right? I feel like one of the things that'd be like doubling down is actually like executing code inside the canvas. And there's a lot of questions like, how do you evolve this? It's kind of like IDE for both. And I feel like this is where I'm coming from is like the chat GPT evolves into this blank image. It's kind of like the interface, which can morph itself in whatever you trying, like the model should try to like derive your true intent and then modify the interface based on your intent. And then if you like writing, it should become like the most powerful, like writing IDE possible. If it's like coding, it should become like a coding IDE or something.

swyx [00:38:14]: I think it's a little bit of a odd decision for me to call those two things, the same product name, because they're basically two different UIs. Like one is code interpreter plus plus. The other one is canvas. Yes. I don't know if you have other thoughts on canvas.

Alessio [00:38:27]: No, I'm just curious, maybe some of the harder things. So when I was reading, for example, forcing the model to do targeted edits versus like for rewrite, it sounds like it was like really hard in the AI engineer mind. Maybe sometimes it's like just pass one sentence in the prompt. It's just going to rewrite that sentence. Right. But obviously it's harder than that. What are maybe some of the like hard things that people don't understand from the outside and building products like this?

Karina [00:38:50]: I think it's always hard with any new like product feature. Like. Canvas or tasks or like any other new features that you don't know how people would use this feature. And so how do you even like build evaluations that would simulate how people would use this feature? And it's always like really hard for us. Therefore, like we try to like lean on to like iterative deployment this in order to like learn from user feedback as much as possible. Again, it's like we didn't know that like code diffs was very difficult. For a model, for example, again, it's like, do we go back to like fundamentally improve like code diffs as a model capability, or do you like do a workaround where the model will just like rewrite the entire document, which is yield to like higher accuracy? And so those are like some of the decisions that we had to like make as yeah. How do you like improve the bar to the product quality, but also make sure the model. Quality is also a part of it. And like, what kind of like cheat offs you're okay to do? Again, I think, I think this is like new way of product development is more like product research, model training and like product development goes like together hand in hand. This is like one of the hardest things, like defining the entire like model behaviors. I think just like, is there's so many edge cases that might happen, especially when you like do canvas was like other tools, right? Like canvas plus Dalek. Canvas plus search. If you like select certain section and then like ask for search, like how do you build such evals? Like what kind of like features or like behaviors that you care the most about? And this is how you build evals.

swyx [00:40:35]: You tested against every feature of ChatGPT? No. Oh, okay. I mean, I don't think there's that many that you can. Right. It will take forever.

Karina [00:40:44]: But it's the same. It's indecision boundary between like Python, ADA advanced data analysis versus canvas. Is one of the most trickiest like decision boundary behaviors that we had to like figure out, like how do you derive the intent from the human user query? Yeah. And how do I say this? Deriving the intent, meaning does the user expect canvas or some other tool and then like make sure that it's like maximally like the intent was is like actually still one of the hardest problems. Yeah. Especially with like agents, right? Like you don't want like agents to go for like five minutes and do something on the background and then come back with like some mid answer that you could have gotten from like a normal model or like the answers that you didn't even want because it didn't have enough context. It didn't like follow up correctly.

swyx [00:41:40]: You said the magic word. We have to take a shot every time you say it. You said agents.

swyx [00:41:46]: So let's move to tasks. You just launched tasks. What was that like? What was the story? I mean, it's, it's your, it's your baby. So

Karina [00:41:52]: Now that I have a team, I actually like tasks was purely like my residence projects. I was mostly a supervisor. So I kind of like delegated a lot of things to my resident. His name is like Vivek. And I think this is like one of the projects where I learned management, I would say. Yeah. But it was really cool. I think it's very similar model. I'm trying to replicate canvas operational model. How do we operate with product people or like product applied orgs was research and the same happened. I was trying to replicate like the methods and replicate the operational process with tasks. And actually tasks was developed less than like two months. So if canvas took like, I don't know, four months, then tasks took like two months. And I think again, like it's kind of very similar process of like, how do we build eval? You know, some people like ask for like reminders in actual charge GPT, but then like, obviously, even though they know it doesn't work. Yeah. So like there is some like demand or like desire from users to like do this. And actually I feel like task is like simple feature in my opinion is something that you would want from any model. Right. But then the magic is like when I actually, because the model is so general, it knows how to use search or like canvas or like create cypher. You know, you can modify stories and create Python puzzles when coupled with status actually becomes like really, really powerful. It was like the same ideas of like, how do we shape the behavior of the model? Again, we shipped it as like as a better model in the model dropdown. And then we are working towards like making that feature integrated in like the core model. So I feel like the principles that like everything should be like in one model, but because of some of the operational difficulties, it's, it's much easier to like deploy. It's a separate model first to like learn from the user feedback and then iterate very quickly and then improve into the core model basically. Again, this is a project was also like together at the beginning from the very beginning, designers, engineers, researchers were working all together and together with model designers, we were like trying to like come up with like evals evaluations and like testing and like bug bashing. And it's like a lot of cool like synergy.

swyx [00:44:12]: Evals, bug bashing. I'm trying to distill. Okay. I would love a canvas for this, for distill what the ideal product management or research management process is. Right. Start from like, do you have a PRD? Do you have a doc that like these, these things? Yes. And then from PRD, you get funding maybe or like, you know, staffing resources, whatever. Yes. And then prototype maybe. Yeah. Prototype.

Karina [00:44:37]: I would say like prototype was prompted baseline. It's all, all, everything starts with like prompted baseline. Yeah. And then like we craft like certain like evaluations that you want to like capture. Okay. They want to like measure progress at least for the model and then make sure that evals are good and make sure that the prompted baseline actually fails on those like evals because then you have like, if you're allowed to like hill climb on. And then once you start iterating on the model training, it's actually very iterative. So like every time you train the model or you like look at the benchmark or like look at your evals and it like goes up, it's like good. But then also you don't want to like, you want to make sure it's not like super overfitting. Like that's where you run on other evals, right? Like intelligence evals or something. And then like. Yeah.

swyx [00:45:20]: You don't want regressions on the other stuff. Right. Yes. Okay. Is that your job or is that like the rest of the company's job to do?

Karina [00:45:26]: I think it's mainly my like. Really? The job of the people who like.

swyx [00:45:30]: Because regressions are going to happen and you don't necessarily own the data for the other stuff.

Karina [00:45:34]: What's happening right now is that like you, basically you only like update your, your data sets, right? So it's like you compare on the baseline, you compare like the regressions on the baseline model.

swyx [00:45:47]: Model training and then book bash. And that's, that's about it. And then ship.

Karina [00:45:50]: Actually, I did the course with Andrew Yang, who. Yes. There was like one little lesson around this. Okay.

swyx [00:45:57]: I haven't seen. Product research. You tweeted a picture with him and it wasn't clear if you were working on a course. I mean, it looked like the standard course picture with Andrew Yang. Yes. Okay. There was a course with him. What was that like working with him?

Karina [00:46:08]: No, I'm not working with him. I just like, I just like did the course with him. Yeah. Yeah.

Alessio [00:46:11]: How do you think about the tasks? So I started creating a bunch of them. Like, do you see this as being, going back to like the composability, like composable together later? Like you're going to be scheduled one task that does multiple tasks chained together. What's the vision?

Karina [00:46:27]: I would say task is like a foundational module, obviously to generalize to all sorts of like behaviors that you want. Like sometimes like I see like people have like three tasks.

Karina [00:46:41]: And right now I don't think like the model handles this very well. I think that ideally we learn from like the user behavior and ideally the model will just be more proactive in suggesting of like, oh, I can either do this for you every day because I've observed that you do that every day or something. So it's like more becomes like a proactive behavior. I think right now you have to be more explicit, like, oh yeah, like every day, like remind me of this. But I think like the, the ideally the model will always think about you on the background and like kind of suggests, okay, like I noticed you've been reading some of this particular like how I can use articles. Maybe I can try to suggest you like every day or something. So it's like, it's just like much more like of a natural like friend, I think.

swyx [00:47:35]: Well, there is an actual startup called Friend that is trying to do that. Oh, Yes. We'll have, we'll interview Avi at some point. But like it sounds like the guiding principle is just what is useful to you. It's a little bit B2C, you know, is there any B2B push at all or you don't think about that?

Karina [00:47:51]: I personally don't think about that as much, but I definitely feel like B2B is cool. Again, I come back to like Cloud and Slack. It's like one of the, like the first like interfaces where like the model was operating inside your organization, right? It would be very cool for the model to like handle that. To like become like a productive member of your organization. And then either like even like even process, like I right now, like I'm thinking like processing like user feedback. I think it'd be very cool if the model would just like start doing this for us and like we don't have to hire a new person on this just for this or something. And like you have like very simple like data analysis or like data analytics or like how this features like.

swyx [00:48:36]: Do you do this analysis yourself? Or do you have a data science team that tells you insights?

Karina [00:48:40]: I think there are some data scientists. Okay.

swyx [00:48:43]: I've often wondered, I think there should be some startup or something that does automated data insights. Like I just throw you my data. You tell me. Yeah. Yeah, exactly. Cause that's what the data team at any company does. Right. Which is just give us your data. We'll like make PowerPoints. Yeah. Yeah.

Karina [00:48:59]: That'd be very cool.

swyx [00:49:00]: That's, I think that's a, that's a really good vision. You had thoughts on agents in general. There's some more proactive stuff. You actually had tweeted a definition. Which is kind of interesting.

Karina [00:49:09]: I did.

swyx [00:49:10]: Well, I'll read it out to you. You tell me. Okay. If you still agree with yourself. This is five days ago. Agents are a gradual progression of tasks, starting off with one-off actions, moving to collaboration. Ultimately fully trustworthy long horizon. I know it's, I know it's uncomfortable to have your tweets read to you. I have had this done to me. Ultimately fully trustworthy long horizon delegation in complex environments like multiplayer, multi-agents, tasks, and canvases fall within the first two. What is the third one?

Karina [00:49:34]: One of my weaknesses is like, I like writing long sentences. I feel like that's a good thing. Like I need to like learn how to.

swyx [00:49:39]: That's fine. That's fine. Is that your definition of agents? Like what are you looking for?

Karina [00:49:43]: I'm not sure if this is my definition of agents, but I feel like it's more like how I think it makes sense, right? Like I feel like for me to like trust an agent with my passwords or my credit card, I actually need to build trust with that agent that it will handle my tasks correctly and reliably. And the way I would go about this is how I would naturally like collaborate with other people. Is it like we first, even if it's any project, right, like we first came, when we first come, like we don't even know each other. Like we don't know how each other's like working style, like what I prefer, what do they prefer, how do they prefer to communicate, et cetera, et cetera. So like you spend like the first, like, I don't know, like two weeks to just like learn their style of working. And then like over time you adapt to their working style and then this is how you create the collaboration. And then like at the beginning you don't have much trust. So like how do you build more trust, especially like, it's the same thing as like with a manager, right? Like it's like, how do you build trust with your manager? What does they need to know about you? What do you need to know about them? Over time as you build trust and trust builds either through collaboration, which is why I feel like building Canvas was kind of like the first steps towards like more collaborative agents. I think with humans, so like you can, you should need to show a consistency. Yeah. Consistent effort to each other, like consistent effort that you care about each other is that you like work together very well or something. So consistency and like collaboration is like what creates trust. And then I will naturally will try to delegate tasks to a model because I know the model will not fail me or something. So it's kind of like building out like the intuition for the form factor of like new agents. Because sometimes I feel like a lot of researchers or like people in AI community are like so, into like, yeah, agents, delegate everything like blah, blah, blah, but like on the way towards that, I think like collaboration is actually one of the main roadblocks or like milestones to get over. Because then you will learn some of the implicit preferences that would help you, that would help towards like this full delegation model. Yeah.

swyx [00:51:55]: Trust is very important. I have an AGI working for me and I, we're, we're still working on the trust issues. Okay. Um, we are recording this just before the launch of the podcast. We have a collaborative operator. The other side of agents that is very topical recently is computer use and topic launch computer use recently. Um, you know, you're not saying this, but opening is rumored to be working on things and like, there's a lot of labs are like exploring this, like sort of drive a computer generally. Um, how important is that for agents?

Karina [00:52:23]: I think it would be one of the core capabilities of agents. Yeah. Computer using, oh, agents using desktop or like your computer is like the delegation part. So like when you might want to like delegate an agent to like order a book for me or like order a flight or like search for a flight and then order things. And I feel like this idea was flying around like for a long time since at least like 2022 or something. And finally we are here. It's just like there's a lot of like lag between idea and like full execution in the orders like two to three years.

swyx [00:53:01]: The vision models had to get better. Yeah. A lot better.

Karina [00:53:04]: The perception and something. But I think like it's really cool. I feel like it has like implications for like consumers definitely like delegation. But I guess again like I think like latency is like one of the most important factors here. It's like you don't want to make sure that the model correctly understands what you want. And then if it doesn't understand or if it doesn't know like full context, it should like ask for a follow up question and then like use that to perform the task. Like the agent should know if it has enough information to complete the task at the maximal, if it's a maximal success or not. And I think this is like still an open kind of like research question I feel like. Yeah. And the second idea is that like I think it also enables new class of like research questions of like computer use agents. Like can we use it in RL? Right. Like this is kind of like very cool like nascent area of like research.

swyx [00:53:59]: What's one thing? What's one thing that you think by the end of this year people will be using computer use agents a lot for?

Karina [00:54:05]: I don't know. It's really hard to predict. I'm trying to look for.

swyx [00:54:09]: Maybe for coding.

Karina [00:54:11]: I don't know.

swyx [00:54:11]: For coding?

Karina [00:54:12]: I think like right now like with Canvas we are thinking about like this paradigm of like real time collaboration to like asynchronous collaboration. So it's like it would be cool if I can just delegate to a model like, okay, can you figure out like how to do this feature or something? And then the model can just like. Test out that feature in its own like virtual environment or something. I don't know. Like maybe this is a weird idea. Obviously, there will be a lot of use cases around the consumers, the consumer use cases like, hey, like shop for me or something.

swyx [00:54:43]: I was going to say, everyone goes to booking plane tickets. That's like the worst example because you only booked plane tickets, what, two or three times a year? Or like concert tickets.

Karina [00:54:50]: I don't know. Yeah.

swyx [00:54:51]: Concert tickets. Yeah.

Karina [00:54:51]: Like Taylor Swift.

swyx [00:54:52]: I want a Facebook marketplace bought that just scrolls Facebook marketplace for free stuff. Yeah. And then just go and get it. Yeah.

Karina [00:55:00]: I have a question. I don't know. What do you think?

swyx [00:55:01]: I have been very bearish in computer use because they're slow, they're expensive, they're imprecise, like the accuracy is horrible. Still, even with Anthopics new stuff, I'm really waiting to see what opening I might do to change my opinions. And really what I'm trying to do is like Jan last year versus December last year, I changed a lot of opinions. What am I wrong about today? And computer use is probably one of them where I'm like, I don't think, I don't know if by end of the year we'll still be using them. Will my ChatGPT have? Like every GPT instance, will they, will they have a virtual computer? Maybe? I don't know. Coding? Yes. Because he, he invested in a company that does, does that for the, the code sandboxes there. There are a bunch of code sandbox companies. E2B is the name. But then like in browsers, yes. Computer use is like coding plus browsers, plus everything else. There's a whole operating system and it's very like, you have to be pixel precise. You have to OCR. Well, I think OCR is basically solved, but like pixel precise and like understand the UI of what you're operating. And like, I don't know if the models are, I don't know. There you go.

Karina [00:56:01]: Yeah. Yeah. Two questions. Like, do you think the progress of like mini models, like O3 mini or like O1 mini, I guess like it's came back to like the cloud, cloud 3 high cool, cloud 1.2 instant, like this like gradual progression of like small models becoming really powerful, which are very also like fast. Like I'm sure like the computer use agents like would be able to like couple with like those like small models that will solve some of the latency issues, in my opinion. I think in terms of like other operating system, I think a lot about it these days, it's just like, if you're entering this like task oriented, like operating system or something, where also a generative OS, like in my opinion, like people in like few years will click on like websites way less. I want to see the plot of like website clicks over time. But then my prediction is like, it will click. It will go down and like people's access to the internet will be through the model's lens. Either you see what the model is doing or you don't see what the model is doing on the internet. Yeah.

Alessio [00:57:10]: I think my personal benchmark for computer use this year is expense reports. So I have to do my expense report every month. But what you need to do. So for example, I expense a lunch, I have to go back on the calendar and see who I was having lunch with. Then I need to upload the receipt of the lunch and I need to tag the person. The expense report, blah, blah, blah. Yeah. It's very simple on a task by task basis. Yeah. But like you have to go to every app. Right. That I use. You have to go to like the, you know, Uber app. You have to go to the camera roll to get the photo of the receipt, all these things. It's not, you cannot actually do it today, but it feels like a tractable problem. You know that probably by the end of the year we should be able to do it.

Karina [00:57:49]: Yeah. This reminds me of like the idea of you kind of want to show to computer use agents how you would want. How you want or how you like booking your flights. It's kind of like a few shot. Yeah.

swyx [00:58:03]: Demonstration.

Karina [00:58:04]: Demonstrations of like maybe there is more efficient way that you do things that the model should learn to do it in that way. And so it's kind of like, again, comes back to like personalized tasks too is like right now task is just like where you're like rudimentary, but in the future tasks should become like much more personalized for your preferences.

swyx [00:58:27]: Okay. Well, we mentioned that. Oh, I'll also say that I think one takeaway I got from your, this conversation is that ChatGPT will have to integrate a lot more with my life. Like you, you, you will need my calendar. You will need my email. Yes. Like for sure. And maybe you use MCP. I don't know. Have you, have you looked at MCP?

Karina [00:58:43]: No, I haven't.

swyx [00:58:44]: It's good. It's got a lot of adoption. Okay.

Alessio [00:58:47]: Anything else that we're forgetting about or like maybe something that people should use more? Yeah. I don't know. Before we wrap on like the open AI side of things.

Karina [00:58:56]: I think. I think like search product is kind of cool, like ChatGPT search. I think this idea of like, you know, like right now I'm thinking a lot of us, like, you know, the magic of ChatGPT when it first came out, it was like, you know, you ask something, any like instruction, and then like, it would like follow the instruction that you gave to a model, right? Like write a poem and we'll give you a poem. But I think like the magic of the next generation of ChatGPT is like actually, and we're like, we're marching towards that. It's like, when you ask a question, it's not just a question. It's not just going to be in the text output. The ideal output might be like in some form of like a react app on the fly or something. So like, this is happening with like search, right? Like give me like Apple stock and then it gives you the chart and gives you like this like generative UI. And I feel like this is what I mean by like the evolution of ChatGPT becomes like more of a generative OS with a task orientation or something. So it's like, and then UI will adapt to what you like. So like, if you really like 3D, what do you like? If you really like 3D visualizations, I think the model should give you as much visualization as possible. Like, you know, if you really like certain way of like the UIs, like maybe you like round corners. I don't know. It's just like some color schemes that you're like, it's just like the UI becomes like more dynamic and like becomes like a custom, custom model, like personal model, right? Like from personal computer to like a personal model, I think. Yeah.

swyx [01:00:20]: Takes overall, you are one of the rare few people, actually, maybe not that rare. To work at both OpenAI and Anthropic.

Karina [01:00:28]: Not anymore. Yeah.

swyx [01:00:31]: Cultural difference. What are general takes that people like only like you see?

Karina [01:00:35]: I love both places. I think I've learned so much at Anthropic and I'm really, really grateful to the people and I'm still like friends with a lot of people there. And I was really sad when John left OpenAI because I came to OpenAI because I wanted to work with the most or something. What's he doing now? But I think it changed a lot. So I think like... When I first joined Anthropic, they were like, I don't know, 60, 70 people. When they left, they were like 700 like people. So it's like a massive like growth. OpenAI and Anthropic is different in terms of like more like maybe like product mindset. Maybe OpenAI is much more willing to take some of the product risks and explore different bets. And I think Anthropic is much more focused and they have... I think it's fine. Like they have to like prioritize, but they definitely double down on like enterprise might be more than like consumers or something. I don't know. It's just like some of the product mindsets might be different. I would say like research, I've enjoyed like both like research cultures, both at Anthropic and like OpenAI. I feel like they are more... On the daily basis, I feel like it's more similar than different.

swyx [01:01:50]: I mean, no surprise.

Karina [01:01:52]: Like how you run experiments is kind of like very similar. I'm sure the Anthropic...

swyx [01:01:55]: I mean, you know, Dario used to be VP research, right? So he set the culture at OpenAI. So yeah, it makes sense. Maybe quick takes on people that you mentioned. Barrett, you mentioned Mira. Like what's one thing you learned from Barrett, Mira, Sam, maybe? Something like that. Like one lesson that you would share to others.

Karina [01:02:13]: I wish I like worked with them way longer. I think what I've learned from Mira is actually her like interdisciplinary mindset. She's really good at like connecting dots. Between like product and like kind of balancing like product research and like create this like comprehensive, like coherent story. Because sometimes like there are like researchers who like really hate doing product and there are researchers who really love doing product. And it's like kind of dichotomy between two and also like safety is like a part of this process. So kind of, you kind of want to like create this coherent, like think from like systems perspective. Or like think about like bigger picture. And I think I learned a lot from her on that. I definitely feel like I have much more creative freedom at OpenAI. And that's because the environment that the leaders set like enables me to do that. So it's like if I have an idea, if I want.

swyx [01:03:10]: Propose it. Yeah, exactly. On your first month.

Karina [01:03:11]: There's like more like creative freedom and like resource reallocation. Especially in research is like being adaptable to like new technologies and like change your views based on that. Yeah. Like you know, I've seen a lot of like researches that are like based on like empirical results or kind of like change the research directions. I've seen a lot of like, sometimes I've seen researchers who would just like get stuck on the same directions for like two to three years and they would never like work out or something, but they would still be like stubborn. So it's like adaptability to like new directions and like new paradigms. It's kind of like one of those things that-

Alessio [01:03:42]: This is a Barrett thing or this is a general culture thing?

Karina [01:03:45]: A general kind of culture, I think. Cool.

Alessio [01:03:46]: Yeah. And just to wrap up, we just usually have a call to action.

Alessio [01:03:52]: Do you want people to give you feedback? Do you want people to join your team?

Karina [01:03:56]: Oh yeah, of course. I'm definitely hiring for like research engineers who are like more product minded people. So it's like people who know how to train the models, but also like interested in like deploying into like the products and developing like new product features. I'm definitely looking for those archetypes of like research engineers or like research scientists. So yeah. If you're like looking for a job, if you're like interested in joining my team, I'm like really looking forward to that. I'm definitely happy to just reach out, I guess.

swyx [01:04:24]: And then just like generally, what do you want people to do more of in the world, whether or not they work with you, like, you know, call to action as in like everyone should be doing this.

Karina [01:04:32]: I think this is something that I tell to a lot of like designers is that like, I think people should like spend more time just like play around with the models. And the more you play with a model, the more creative ideas you'll get around like what kind of like new potential features of the products or like new kinds of things. Kind of like interaction paradigms that you might want to create with those models. I feel like we are bottlenecked by like human creativity on like completely changing the way we think about the internet or like some of the, the way you think about software, like AI right now is pushes us to like rethink everything that we've done before in my view. And I feel like not enough people are either double down on like those ideas or I'm just like not seeing a lot of like human creativity in this like. Interface design or like product design mindsets. So I feel like it'd be really great for people to just like do that. And especially right now it's like research, some research becomes like much more product oriented. So it's like you actually can train the models for the things that you want to do in a product or something. Yeah.

swyx [01:05:41]: And you define the process now. Now this is my go-to for how to manage a process. I think it's pretty common sense, but it's nice to hear from you that cause you actually did it. That's nice. Thank you for driving innovation, interface design and the new models at OpenAI and Anthropic. And we're looking forward to what you're going to talk about in New York. Yeah.

Karina [01:06:01]: Thank you so much for inviting me here. I hope my job will not be automated by the time.

swyx [01:06:06]: Well, I hope you automate yourself and we'll do whatever else you want to do. That's it. Thank you. Awesome. Thanks.

Get full access to Latent.Space at www.latent.space/subscribe