India's Big Indic Data Chase

16 snips

Sep 26, 2025

In this discussion, Abhishek Upperwal, founder of Socket AI and an expert in Indic-language AI, highlights the urgent need for India to develop its own AI models that understand native languages. He addresses the significant data bottleneck facing startups aiming to build these models, noting the scarcity of high-quality Indic language datasets. Abhishek emphasizes the limitations of mere translation and dives into innovative strategies for crowdsourcing data and leveraging government archives. The conversation reveals the economic and security implications of a sovereign AI voice for India.

Ask episode

AI Snips

Chapters

Transcript

Episode notes

INSIGHT

Why India Needs Its Own AI Voice

Western-trained LLMs embed cultural and legal biases that don't map to India.
Sovereign, Indic-first models are needed for defense, policy and culturally aligned outputs.

ANECDOTE

Hallucinations From Western Models

Abhishek described a Western legal model fine-tuned on Indian law that hallucinated hybrid US-India laws.
He also recalled models misclassifying Indian territories, which made defense agencies wary.

INSIGHT

Data Is The Real Bottleneck

Data scarcity in Indic languages is the single biggest bottleneck for Indian LLMs.
Startups combine licensing, crowdsourcing and synthetic generation to assemble usable corpora.

Get the Snipd Podcast app to discover more snips from this episode

Get the app

A quiet race is on to give India its own AI voice. From call-centre automation to defence and legal systems, Abhishek Upperwal of Soket Labs and journalist Swathi Radhakrishnan tell us why AI trained only on Western, English-heavy data cannot meet India’s needs. Translation isn’t enough; models must “think” in Hindi, Tamil or Marathi to capture nuance and reduce bias. The government’s IndiaAI mission, with nearly ₹10,000 crore in funding, is catalysing startups to build these Indic models. But their biggest bottleneck is data. Only a sliver of the world’s open datasets are in Indian languages, and even public archives like Doordarshan take time to unlock. Startups are scrambling crowdsourcing voices, licensing publishing-house content, generating synthetic text and negotiating with ministries to reach the 15–20 trillion high-quality tokens needed for a world-class foundation model. In this episode Host Anirban Chowdhury, ET’s Swathi Moorthy and Soket AI’s, founder, Abhishek Upperwal try to answer the following questions:

What makes sovereign, Indic-first AI critical for India’s economy and security?
How are innovators overcoming the huge shortage of quality language data?
Can low-cost, DeepSeek-style methods help India build frugal yet powerful models?
Where will the commercial payoffs voice AI, regional apps, enterprise tools arrive first?

Tune in.

You can follow Anirban Chowdhury on his social media: Twitter and Linkedin

You can follow Swathi Moorthy on her social media: Twitter and LinkedIn and also read Newspaper Article

Listen to Corner Office Conversation our new show:: Corner Office Conversation with Pawan Goenka, Chairman, IN-SPACe, Corner Office Conversation with The New Leaders of Indian Pharma and much more.

Check out other interesting episodes from the host like: Why Is India Still Buying Russian Oil?, How AI is Rewriting Cinema Part 2, Trump vs Harvard: India Impact, Of Dragons and Elephants: Modi–Xi in Focus and much more.

Catch the latest episode of ‘The Morning Brief’ on ET Play, The Economic Times Online, Spotify, Apple Podcasts, JioSaavn, Amazon Music and Youtube.

See omnystudio.com/listener for privacy information.