Sanjib Chowdhary came across Storyweaver, a multilingual children’s story platform, while looking for books to read to his 7-year-old daughter. Choudhury’s mother tongue is Kochila Tharu, which is spoken by about 250,000 people in eastern Nepal. (Nepali, the official language of Nepal, has 16 million speakers.) Languages with a relatively small number of speakers, such as Kochila Tharu, don’t have much digital content for the community to use: no Google Translate, subtitles in movies or TV, or online newspapers. In industry terms, these languages are “underserved” and “under-resourced”.
That’s where StoryWeaver comes into play. Founded by Pratham Books, an Indian non-profit educational organization, StoryWeaver currently has over 50,000 openly licensed stories in over 300 languages around the world. Users can browse the repository by reading level, language and topic, and once they have selected a story, they can click on the illustrated plates in the selected language (each resembles a page from the book). (There are also bilingual options showing two languages side-by-side, as well as options to download and read with audio). “Smile Please!”, a story about a deer wandering in the woods, is currently the “most read” story: originally written in Hindi for beginners, it has been translated into 147 languages And it has been read over 281,000 times.
Most of the languages represented on the platform come from Africa and Asia, and many of them are indigenous languages that are in danger of losing their speakers in an English-dominated world. Chaudhary’s experience as a father reflects this tension. “The problem with children is that they prefer to read story books in English rather than in their own language because English is so easy. With Kochila Tharu, the spelling is difficult, the words are difficult, and they see it all the time, in schools, on TV But keep in touch with English,” explains Chowdhary.
AI-assisted translation tools like Storyweaver can make more languages live simultaneously. But the technology is still new and relies on data that only speakers of these forgotten languages can provide. This raises questions about the value that will be placed on the work of native speakers powering artificial intelligence tools and how language data repositories will be commercialized.
To understand how AI-assisted translation tools work, it’s necessary to look at what’s happening in India: With 22 official languages and over 780 spoken languages, it’s no coincidence that the country is multilingual. is the center of innovation for technology. The core of StoryWeaver is inspired by a natural language processing tool developed by Microsoft Research India called Interactive Neural Machine Translation (INMT) prediction technology.
Unlike most commercial AI-based translation tools, INMT does not completely eliminate the human intermediary. Instead, it helps humans make suggestions in the language they’re translating. For example, if you start typing “It’s raining” in the target language, the model you work with can perform sentence completions like “Tonight,” “Very,” or “It’s raining.” Presents as options, based on context and previous word or set of words. word. During translation, the tool takes into account the meaning in the source language and what the target language allows, and then generates possibilities for the translator to choose from, explains Kalika Bali, principal investigator at Microsoft and one of the architects of INMT.
Tools like INMT allow Storyweaver’s group of volunteers to quickly generate translations of pre-existing stories. The user interface is easy to master, even for beginner translators, many of whom, like Choudhury, volunteer or work for early childhood nonprofits. This is the case of Churki Hansda. That information is employed by the North Chandipur Community Society, one of many Storyweaver partner organizations around the world. She brings with her a knowledge of Kora and Santali, two neglected indigenous languages spoken in eastern India. “When we were little, we didn’t have story books. our textbooks were in bengali [la lengua regional dominante]and we ended up remembering everything because we couldn’t understand what we were reading,” Hansda tells me. “It’s nice to be able to make books in our own languages for our kids.”
Aamna Singh, director of content and partnerships at Pratham Books, estimates that 58% of the languages represented in StoryWeaver are being lost, a status quo that has implications for early childhood education. But the effort to lift under-served language communities from anonymity is also closely tied to harnessing their potential as consumers, and AI-powered translation technology is an important part of this change. Voice recognition tools and chatbots in Indian regional languages aim to appeal to customers living outside metropolitan cities, a market that is expected to grow as mobile data access becomes even cheaper.
The quality of these tools depends on the data they are trained with, and sourcing them is a major challenge. To keep up with multilingualism on the Internet, machine translation models need to generate large amounts of training data in two parallel languages. Parliamentary records and media publications are common sources of public data that can be used to build these tools. However, these two fonts, according to Bali, a Microsoft researcher, are too specific and do not cover a wide enough range of topics and vocabulary to be representative of human speech. (That’s why StoryWeaver isn’t a good source of training data either, because the sentences in children’s books are much simpler and fund only goes up to a fourth grade reading level).
Technical requirements aside, data work is often performed in invisible, poorly paid and unregulated environments. There is growing concern about how much we owe to human workers who collect data sets behind the scenes to train artificial intelligence systems. Is known mob workerThese people perform repetitive and systematic tasks ranging from tagging images of trees and pedestrians for self-driving cars to detecting signs of disease on medical scanners.
Such monotonous “ghost work” takes on an emotional dimension in the context of language preservation. Linguistic data workers who contribute to machine translation models are so driven by the idea of linguistic dignity on the Internet that issues of fair pay and data management are sidelined in debates about why this work is culturally important.
After all, its cultural value is enormous: Sanjib Chowdhary’s daughter Kochila understands Tharu more than she did a few years ago, and Chowdhary’s involvement with Storyweaver has grown ever since. Over the past year and a half, he and two friends have worked to generate the Nepali language equivalent of about 40,000 English words. But he was paid only $243 for the project, or less than 1 cent per English word divided by three. According to Microsoft’s Bali, the models needed 100,000 paired sentences to generate acceptable translations.
Despite the repetitive and low-paying nature of the job, Chowdhary doesn’t see himself as a mob worker, but as a linguistic curator. “We have many homophones in Kochila Tharu that do not exist in English. For example, the names of different fishes… We have many words to refer to fish, fishing items and fish-based preparations that cannot be found in other languages”, he explains. “If our language dies, we will lose them. I want to collect these words before they disappear.
The hope of a future in which marginal linguistic identities can flourish on the Internet is a powerful incentive for people like Choudhury and Hansda. Hansda’s stint with Storyweaver landed her a paid opportunity at AI4Bharat (or Artificial Intelligence for India), an initiative by the Indian Institute of Technology in Chennai, which collects data in tagged pairs for English and 12 Indian languages. Is. Hansda will add 100,000 sentences in Santhali to the AI4Bharat dataset over 18 months, spanning indigenous oral history, folklore, literature, phrases and words. Hansda charges $1.66 an hour for this work as a “language specialist”.
To be truly innovative and responsible, AI-assisted language research must ensure that native speakers and their communities not only contribute data, but also help determine how it will be used. For the time being, AI4Bharat aims to “bring parity with respect to English for Indian languages in the environment of Artificial Intelligence technologies through open source contributions.” This means that opening will automatically lead to inclusion. However, there are practically no clear guidelines that prevent companies developing AI technologies from using datasets collected and trained by non-commercial research entities such as universities or non-profit organizations.
For example, AI4Bharat classifies its data set as open source, which means that Hansda’s contribution can be commercialized for profit in the future. There are precedents for: Artificial intelligence tools make video De Meta, which hasn’t yet gone public but was announced last fall, has trained with data sets gathered from publicly available video clips on YouTube and Shutterstock. Technologist Andy Baio calls this practice “artificial intelligence data laundering” and writes that “outsourcing the heavy load of data collection and model training to non-commercial entities allows companies to avoid accountability and potential legal responsibility”.
For now, the drive toward language inclusion – whether driven by commercial gain, social impact, technological innovation or a mix of all three – is exciting to speakers of minority languages. Hansda hopes that the day will come when his grandchildren will be able to live their lives online in Santali. “They’ll say: ‘This was made by our grandmothers,'” she says. ,
This article is published in collaboration with free letter With Future Tense, a project of Slate, New America and Arizona State University.