
đ ThursdAI - Jan 15 - Agent Skills Deep Dive, GPT 5.2 Codex Builds a Browser, Claude Cowork for the Masses, and the Era of Personalized AI!
ThursdAI - The top AI news from the past week
What are skills and why they matter
Eleanor Berger defines skills, their standard, and how they customize general-purpose agents.
Hey yaâll, Alex here, and this week I was especially giddy to record the show! Mostly because when a thing clicks for me that hasnât clicked before, I canât wait to tell you all about it!
This week, that thing is Agent Skills! The currently best way to customize your AI agents with domain expertise, in a simple, repeatable way that doesnât blow up the context window! We mentioned skills when Anthropic first released them (Oct 16) and when they became an open standard but it didnât really click until last week! So more on that below.
Also this week, Anthropic released a research preview of Claude Cowork, an agentic tool for non coders, OpenAI finally let loos GPT 5.2 Codex (in the API, it was previously available only via Codex), Apple announced a deal with Gemini to power Siri, OpenAI and Anthropic both doubled down on healthcare and much more! We had an incredible show, with an expert in Agent Skills, Eleanor Berger and the usual gang on co-hosts, strongly recommend watching the show in addition to the newsletter!
Also, I vibe coded skills support for all LLMs to Chorus, and promised folks a link to download it, so look for that in the footer, letâs dive in!
ThursdAI is where you stay up to date! Subscribe to keep us going!
Big Company LLMs + APIs: Cowork, Codex, and a Browser in a Week
Anthropic launches Claude Cowork: Agentic AI for NonâCoders (research preview)
Anthropic announced Claude Cowork, which is basically Claude Code wrapped in a friendly UI for people who donât want to touch a terminal. Itâs a research preview available on the Max tier, and it gives Claude read/write access to a folder on your Mac so it can do real work without you caring about diffs, git, or command line.
The wild bit is that Cowork was built in a week and a half, and according to the Anthropic team it was 100% written using Claude Code. This feels like a âweâve crossed a thresholdâ moment. If youâre wondering why this matters, itâs because coding agents are general agents. If a model can write code to do tasks, it can do taxes, clean your desktop, or orchestrate workflows, and that means nonâdevelopers can now access the same leverage developers have been enjoying for a year.
It also isnât just for filesâit comes with a Chrome connector, meaning it can navigate the web to gather info, download receipts, or do research and it uses skills (more on those later)
Earlier this week I recorded this first reactions video about Cowork and Iâve been testing it ever since, itâs a very interesting approach of coding agents that âhide the codingâ to just... do things. Will this become as big as Claude Code for anthropic (which is reportedly a 1B business for them)? Letâs see!
There are real security concerns here, especially if youâre not in the habit of backing up or using git. Cowork sandboxes a folder, but it can still delete things in that folder, so donât let it loose on your whole drive unless you like chaos.
GPTâ5.2 Codex: LongâRunning Agents Are Here
OpenAI shipped GPTâ5.2 Codex into the API finally! After being announced as the answer for Opus 4.5 and only being available in Codex. The big headline is SOTA on SWE-Bench and longârunning agentic capability. People describe it as methodical. It takes longer, but itâs reliable on extended tasks, especially when you let it run without micromanaging.
This model is now integrated into Cursor, GitHub Copilot, VS Code, Factory, and Vercel AI Gateway within hours of launch. Itâs also stateâofâtheâart on SWEâBench Pro and TerminalâBench 2.0, and it has native context compaction. That last part matters because if youâve ever run an agent for long sessions, the context gets bloated and the model gets dumber. Compaction is an attempt to keep it coherent by summarizing old context into fresh threads, and we debated whether it really works. I think it helps, but I also agree that the best strategy is still to run smaller, atomic tasks with clean context.
Cursor vibe-coded browser with GPT-5.2 and 3M lines of code
The most mindâblowing thing we discussed is Cursor letting GPTâ5.2 Codex run for a full week to build a browser called FastRenderer. This is not Chromiumâbased. Itâs a custom HTML parser, CSS cascade, layout engine, text shaping, paint pipeline, and even a JavaScript VM, written in Rust, from scratch. The codebase is open source on GitHub, and the full story is on Cursorâs blog
It took nearly 30,000 commits and millions of lines of code. The system ran hundreds of concurrent agents with a plannerâworker architecture, and GPTâ5.2 was the best model for staying on task in that longârunning regime. Thatâs the real story, not just âlol a model wrote a browser.â This is a stress test for longâhorizon agentic software development, and itâs a preview of how teams will ship in 2026.
I said on the show, browsers are REALLY hard, it took two decades for the industry to settle and be able to render websites normally, and thereâs a reason everyoneâs using Chromium. This is VERY impressive đ
Now as for me, I began using Codex again, but I still find Opus better? Not sure if this is just me expecting something thatâs not there? Iâll keep you posted
Gemini Personal Intelligence: The Data Moat king is back!
What kind of car do you drive? Does ChatGPT know that? welp, it turns our Google does (based on your emails, Google photos) and now Gemini can tap into this personal info (if you allow it, they are stressing privacy), and give you much more personalized answers!
Flipping this Beta feature on, lets Gemini reason across Gmail, YouTube, Photos, and Search with explicit optâin permissions, and itâs rolling out to Pro and Ultra users in the US first.
I got to try it early, and itâs uncanny. I asked Gemini what car I drive, and it told me I likely drive a Model Y, but it noticed I recently searched for a Honda Odyssey and asked if I was thinking about switching. It was kinda... freaky because I forgot I had early access and this was turned on đ
Pro Tip: if youâre brave enough to turn this on, ask for a complete profile on you đ
Now the last piece is for Gemini to become proactive, suggesting things for me based on my needs!
Apple & Google: The Partnership (and Drama Corner)
We touched on this in the intro, but itâs official: Apple Intelligence will be powered by Google Gemini for âworld knowledgeâ tasks. Apple stated that after âcareful evaluation,â Google provided the most capable foundation model for their.. apple foundation models. Itâs confusing, I agree.
Honestly? I got excited about Apple Intelligence, but Siri is still... Siri. Itâs 2026 and we are still struggling with basic intents. Hopefully, plugging Gemini into the backend changes that?
In other drama: The silicon valley carousel continues. 3 Co-founders (Barret Zoph, Sam Schoenholz and Luke Metz) from Thinking Machines (and former OpenAI folks) have returned to the mothership (OpenAI), amid some vague tweets about âunethical conduct.â Itâs never a dull week on the timeline.
This Weekâs Buzz: WeaveHacks 3 in SF
Iâve got one thing in the Buzz corner this week, and itâs a big one. WeaveHacks 3 is back in San Francisco, January 31st - February 1st. The theme is selfâimproving agents, and if youâve been itching to build in person, this is it. Weâve got an amazing judge lineup, incredible sponsors, and a ridiculous amount of agent tooling to play with.
You can sign up here: https://luma.com/weavehacks3
If youâre coming, add to the form you heard it on ThursdAI and weâll make sure you get in!
Deep Dive: Agent Skills With Eleanor Berger
This was the core of the episode, and Iâm still buzzing about it. We brought on Eleanor Berger, who has basically become the skill evangelist for the entire community, and she walked us through why skills are the missing layer in agentic AI.
Skills are simple markdown files with a tiny bit of metadata in a directory together optional scripts, references, and assets. The key idea is progressive disclosure. Instead of stuffing your entire knowledge base into the context, the model only sees a small list of skills and let it load only what it needs. That means you can have hundreds of skills without blowing your context window (and making the model dumber and slower in result)
The technical structure is dead simple, but the implications are huge. Skills create a portable, reusable, composable way to give agents domain expertise, and they now work across most major harnesses. That means you can build a skill once and use it in Claude, Cursor, AMP, or any other agent tool that supports the standard.
Eleanor made the point that skills are an admission that we now have generalâpurpose agents. The model can do the work, but it doesnât know your preferences, your domain, your workflows. Skills are how you teach it those things. We also talked about how scripts inside skills reduce variance because youâre not asking the model to invent code every time; youâre just invoking trusted tools.
What really clicked for me this week is how easy it is to create skills using an agent. You donât need to handâcraft directories. You can describe your workflow, or even just do the task once in chat, and then ask the agent to turn it into a skill. It really is very very simple! And thatâs likely the reason everyone is adopting this simple formart for extension their agents knowledge.
Get started with skills
If you use Claude Chat, the simplest way to get started is ask Claude to review your previous conversations and suggest a skill for you. Or, at the end of a long chat where you went back and forth with Claude on a task, ask it to distill the important parts into a skill. If you want to use other peopleâs skills, and you are using Claude Code, or any of the supported IDE/Agents, hereâs where to download the folders and install them:
If you arenât a developer and donât subscribe to Claude, well, I got good news for you! I vibecoded skill support for every LLM đ
The Skills Demo That Changed My Mind
I was resistant to skills at first, mostly because I wanted them inside my chat interface and not just in CLI tools. And I wasnât subscribed to Claude for a while. Then I realized I could add skill support directly to Chorus, the openâsource multiâmodel chat app, and I used Claude Code plus Ralph loops to vibe code it in a few hours. Now I can run skills with GPTâ5.2 Codex, Claude Opus, and Gemini from the same chat interface. That was my âI know kung fuâ moment.
If you want to try Chorus with skills enabled, you can download my release here! Only for mac, and they are unsigned, mac will not like it, but you can run them anyway.
And if you want to explore more awesome skills, check out Vercelâs React Best Practices skills and UI Skills. Itâs the beginning of a new kind of distribution: knowledge packaged as skills, shared like open source libraries (or paid for!) and
Open Source AI
Baichuan-M3 is a 235B medical LLM fine-tuned from Qwen3, released under Apache 2.0. The interesting claim here is that it beats GPT-5.2 on OpenAIâs HealthBench, including a remarkably low 3.5% hallucination rate.
What makes it different from typical medical models is that itâs trained to run actual clinical consultations asking follow-up questions and reasoning through differential diagnoses rather than just spitting out answers. Nisten pointed out that if youâre going to fine-tune something for healthcare, Qwen3 MoE is an excellent base because of its multilingual capabilities, which matters a lot in clinical settings. You can run it with vLLM or SGLang if youâve got the hardware. (HF)
LongCat-Flash-Thinking-2601 from Meituan is a 560B MoE (27B active) released fully MIT-licensed. Itâs specifically built for agentic tasks, scoring well on tool-use benchmarks like Ď²-Bench and BrowseComp.
Thereâs a âHeavy Thinkingâ mode that pushes AIME-25 to 100%. What I like about this one is the training philosophy, they inject noise and broken tools during RL to simulate messy real-world conditions, which is exactly what production agents deal with. You can try it at longcat.chat and Github
We also saw Google release MedGemma this week (blog) a 4B model optimized for medical imaging like X-rays and CT scans and TranslateGemma (X) a family of on device translations (4B, 12B and 27B) which seem kind of cool! Didnât have tons of time to dive into them unfortunately.
Vision, Voice & Art (Rapid Fire)
* Veo 3.1 adds native vertical video, 4K output, and better consistency in the Gemini API. Huge for creators (blog)
* Viral Kling motionâtransfer vids are breaking peopleâs brains about what AI video pipelines will look like.
* Pocket TTS from Kyutai Labs: a 100Mâparameter openâsource TTS model that runs on CPU and clones voices from seconds of audio (X)
* GLMâImage drops as an openâsource hybrid AR + diffusion image model with genuinely excellent text rendering but pretty bad for everything else
* Black Forest Labs drops open source Flux.2 [Klein] 4B and 9B small models that create images super fast! (X, Fal, HF)
Phew, ok. I was super excited about this one and Iâm really really happy with the result. I was joking on the pod that to prepare for this podcast, I not only had to collect all the news, I also had to ramp up on Agent Skills, and I wish we had an ability to upload information like the Matrix, but alas we didnât. I also really enjoyed vibecoding a whole feature into Chorus just to explore skills fully, mind was absolutely blown when it worked after 3 hours of Ralphing!
See you next week, I think I have one more super exciting thing to play with this week before I talk about it!
TL;DR and Show Notes
* Hosts & Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co-Hosts: Wolfram Ravenwolf (@WolframRvnwlf), Yam Peleg (@yampeleg), Nisten Tahiraj (@nisten), LDJ (@ldjconfirmed)
* Guest: Eleanor Berger (@intellectronica)
* Open Source LLMs
* Baichuan-M3 - A 235B open-source medical LLM that beats GPT-5.2 on HealthBench with a 3.5% hallucination rate, featuring full clinical consultation capabilities. (HF, Blog, X Announcement)
* LongCat-Flash-Thinking-2601 - Meituanâs 560B MoE (27B active) agentic reasoning model, fully MIT licensed. Features âHeavy Thinkingâ mode scoring 100% on AIME-25. (GitHub, Demo, X Announcement)
* TranslateGemma - Googleâs open translation family (4B, 12B, 27B) supporting 55 languages. The 4B model runs entirely on-device. (Arxiv, Kaggle, X Announcement)
* MedGemma 1.5 & MedASR - Native 3D imaging support (CT/MRI) and a speech model that beats Whisper v3 by 82% on clinical dictation error rates. (MedGemma HF, MedASR HF, Arxiv)
* Big CO LLMs + APIs
* Claude Cowork - Anthropicâs new desktop agent allows non-coders to give Claude file system and browser access to perform complex tasks. (TechCrunch, X Coverage)
* GPT-5.2 Codex - Now in the API ($1.75/1M input). Features native context compaction and state-of-the-art performance for long-running agentic loops. (Blog, Pricing)
* Cursor & FastRenderer - Cursor used GPT-5.2 Codex to build a 3M+ line Rust browser from scratch in one week of autonomous coding. (Blog, GitHub, X Thread)
* Gemini Personal Intelligence - Google leverages its data moat, letting Gemini reason across Gmail, Photos, and Search for hyper-personalized proactive help. (Blog, X Announcement)
* Partnerships & Drama
* Apple + Gemini - Apple officially selects Gemini to power Siri backend capabilities.
* OpenAI + Cerebras - A $10B deal for 750MW of high-speed compute through 2028. (Announcement)
* Thinking Machines - Co-founders and CTO return to OpenAI amidst drama; Soumith Chintala named new CTO.
* This Weekâs Buzz
* WeaveHacks 3 - Self-Improving Agents Hackathon in SF (Jan 31-Feb 1). (Sign Up Here)
* Vision, Voice & Audio
* Veo 3.1 - Native 9:16 vertical video, 4K resolution, and reference image support in Gemini API. (Docs)
* Pocket TTS - A 100M parameter CPU-only model from Kyutai Labs that clones voices from 5s of audio. (GitHub, HF)
* GLM-Image - Hybrid AR + Diffusion model with SOTA text rendering. (HF, GitHub)
* FLUX.2 [klein] - Black Forest Labs releases fast 4B (Apache 2.0) and 9B models for sub-second image gen. (HF Collection, X Announcement)
* Kling Motion Transfer - Viral example of AI video pipelines changing Hollywood workflows. (X Thread)
* Deep Dive: Agent Skills
* Vercel React Best Practices - Pre-packaged skills for agents. (Blog)
* UI Skills - Documentation and skill standards. (Docs)
* Chorus with Skills - My fork of Chorus enabling skills for all LLMs. (Release)
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe


