AI's Dark Data Problem In India

If data is food for AI, feeding it junk will only lead to bad outcomes.

Advertisement
Read Time: 4 mins
LLM startups in India are facing roadblocks for high-quality regional language data (AI Generated Image)

From just-born AI unicorns to LLM (Large Language Model) upstarts, there's a mindless rush to get AI models into the market as fast as possible. In this race, some startups-and even bigger companies-are cutting corners by buying illegal data from dark marketplaces. They want to train their AI models quickly and launch them faster, overlooking the reliability or legality of data.

With its massive population and an ocean of data, India is becoming a hub for dark marketplaces of training data.

One startup employee told NDTV AI that no one in their company knows where the founders got the data to train their algorithm.

"We don't dare ask because the deadlines are crazy. The founder wants models ready within weeks," he said, requesting anonymity.

An AI researcher from one of the top IITs shared a similar concern-he and a group of researchers quit a startup after realizing how they were acquiring their training data. "The founder wanted an LLM in a few weeks!" he said, clearly disillusioned.

What AI Feeds On

Think of data as food for AI. Just like a student gets better at learning by reading more books and essays, AI models improve as they are fed more data. The more information they consume, the more accurate and human-like they become in their responses. In the same way that a well-read person is more knowledgeable, an AI trained on vast amounts of diverse data can perform better and understand more nuances.

But if that data is flawed, the AI won't learn properly-just like a student reading bad material. In India, where data is often sourced from dark, unregulated marketplaces, these shortcuts can lead to unreliable AI systems full of biases.

How Bad Data Hurts

As Umakant Soni of AIfoundry says, "With LLMs, we're seeing AI develop general intelligence, but unlike humans, who don't scale their biases easily, AI will scale biases exponentially."

Advertisement

He draws a powerful analogy: "Training foundational models is like sending humans to schools and colleges. We regulate curriculums for humans to ensure education is sound, and we need a similar approach for AI-a societal-regulated, open-source dataset marketplace. Otherwise, we're setting ourselves up for dystopian outcomes."

And it's not just an Indian problem. Across the globe, AI companies are facing backlash for doing the same thing. In the U.S. and Europe, companies like OpenAI, Google, and Meta have used vast amounts of data from services like Common Crawl-an internet-scraping tool-to train their models.

Advertisement

Back in India, the demand for training data is fuelled by the rise of LLMs. For instance, a basic AI prototype needs at least 1 million data points, each costing up to INR 6,000. But no one knows how reliable that data is, and that's the problem.

As Jibu Elias, an AI ethicist and researcher, rightly points out, "Open-source models like LLaMA (Large Language Model Meta AI) have democratized access to powerful AI technologies." The flip side of this accessibility is a spike in the demand for diverse datasets. India's linguistic diversity adds to the challenge-most datasets don't even cover regional languages, pushing many to unregulated markets.

Advertisement

LLM startups in India are facing the same roadblocks. They're hungry for high-quality data in regional languages, but it's hard to come by. Elias sums it up: "Despite India's connectivity and diversity, machine-usable data, particularly in Indian languages, remains limited." This shortage reinforces biases in AI models, many of which still need to rely heavily on English.

But it gets worse.

"These platforms operate in legal gray areas, bypassing consent and privacy norms," Elias explains. The risk is even greater when the data involves sensitive information like healthcare records or financial details. And when that data is incomplete or flawed, it leads to unreliable AI models in critical areas like healthcare and finance.

Even more concerning is the damage this is doing to public trust. "The societal impact could include an erosion of public trust in AI technologies," Elias warns, "stifling innovation because of biased, unreliable solutions."

Advertisement

How can India shape a Responsible AI?

India must establish strong regulations and ethical frameworks to shut down this shady data trade. Globally, companies are already facing harsh legal consequences-Europe's GDPR (General Data Protection Regulation) is a perfect example of how regulation can keep AI data usage in check.

Elias argues, "Stringent regulatory oversight and ethical frameworks are essential." There are better ways to source data, like community-driven data gathering or AI-assisted translation for oral languages. These solutions will help ensure AI systems are built on solid foundations.

Because if data is food for AI, feeding it junk will only lead to bad outcomes.

(Pankaj Mishra is the Editor, AI at NDTV. He shapes stories and conversations that make sense of AI in India and how it influences people's lives and work)

Featured Video Of The Day
Maharashtra Exit Polls: 3 Of 9 Pollsters Predict Hung House