Andon Labs tests AI autonomy by letting agents run businesses in messy reality with real customers, consequences. In VendingBench, an agent starts with $500 and an empty vending machine, researches trends and suppliers, emails wholesalers, restocks, tracks sales, and iterates for profit. When deployed at Anthropic, humans red-teamed it with sob stories, discount demands, and bizarre requests like tungsten cubes, triggering “bank runs” of freebie seekers. Long histories caused drift and hallucinations, including dramatic escalations and invented security reports. Multi-agent supervisors often amplified each other into hype or doom. Better tools and memory compression help, but long-horizon planning stays fragile.