This Article is From Sep 06, 2018

Microsoft Releases Speech Dataset For Three Indian Languages To Aid Researchers

The largest publicly available Indian language speech data for use in research and building models

NDTV Education Team
Education
Sep 06, 2018 21:08 pm IST
- Published On Sep 06, 2018 21:08 pm IST
- Last Updated On Sep 06, 2018 21:08 pm IST

Read Time: 3 mins

Twitter
WhatsApp
Facebook
Reddit
Email

The dataset is aimed at helping researchers and academia build Indian language speech recognition

Bengaluru:

Microsoft India today announced the availability of Microsoft Indian language Speech Corpus, offering speech training and test data for Telugu, Tamil and Gujarati. According to a statement from the software giant's Indian arm called this as the largest publicly available Indian language speech dataset which includes audio and corresponding transcripts.

Microsoft India also said the dataset is aimed at helping researchers and academia build Indian language speech recognition for all applications where speech is used.

This Indian language Speech Corpus content is provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences, said the statement.

Today, according to Microsoft India, there is a scarcity of adequate digital data for text, speech and linguistic resources - which are imperative in building large machine learning models for many vernacular languages across the world. Microsoft, the statement said, is working to address this lack of data and catalyze the development of machine learning based models that can help in building systems for low resource languages, thus enabling the eco system of researchers, academia and tech companies working on India language models and to accelerate the needs of Indian users.

"The launch of Microsoft Indian Language Speech Corpus is a part of this effort," the statement added.

"Using our technology expertise, we want to accelerate innovation in voice based computing for India by supporting researchers and academia," said Sundar Srinivasan, General Manager, Artificial Intelligence & Research, Microsoft India.

Microsoft's Indian Language Speech Corpus was tested at Interspeech 2018, the world's largest and most comprehensive conference on the science and technology of spoken language processing.

In a Low Resource Speech Recognition Challenge, participants used data from Microsoft Indian language speech corpus to build Automatic Speech Recognition (ASR) systems. They were able to create high quality speech recognition models using this data, thus validating the efficacy of the Corpus.

Click here for more Education News

Show full article

Microsoft India, Microsoft Indian Language Speech Corpus