

India's Big Indic Data Chase
12 snips Sep 26, 2025
In this discussion, Abhishek Upperwal, founder of Socket AI and an expert in Indic-language AI, highlights the urgent need for India to develop its own AI models that understand native languages. He addresses the significant data bottleneck facing startups aiming to build these models, noting the scarcity of high-quality Indic language datasets. Abhishek emphasizes the limitations of mere translation and dives into innovative strategies for crowdsourcing data and leveraging government archives. The conversation reveals the economic and security implications of a sovereign AI voice for India.
AI Snips
Chapters
Transcript
Episode notes
Why India Needs Its Own AI Voice
- Western-trained LLMs embed cultural and legal biases that don't map to India.
- Sovereign, Indic-first models are needed for defense, policy and culturally aligned outputs.
Hallucinations From Western Models
- Abhishek described a Western legal model fine-tuned on Indian law that hallucinated hybrid US-India laws.
- He also recalled models misclassifying Indian territories, which made defense agencies wary.
Data Is The Real Bottleneck
- Data scarcity in Indic languages is the single biggest bottleneck for Indian LLMs.
- Startups combine licensing, crowdsourcing and synthetic generation to assemble usable corpora.