This webpage was generated programmatically. To access the article at its initial source, please visit the link below:
https://www.nature.com/articles/d41586-024-04095-6
If you wish to have this article removed from our site, kindly reach out to us.
• Millions of individuals across the globe consistently interact online in languages that are not their native tongue.
• This has generated a significant need for artificial intelligence (AI) systems capable of translating both written and spoken language.
• However, many existing models are limited to text or utilize text as an intermediary in spoken-language translations, with many targeted at a narrow selection of the world’s languages.
• Publishing in Nature, the SEAMLESS Communication Team1 tackles these issues to propose crucial technologies that could render fast universal translation feasible.
The authors from SEAMLESS developed an AI model employing a neural network methodology to facilitate direct translation between approximately 100 languages (Fig. 1a). The model accepts inputs in text or speech from any of these languages and translates them into text, along with direct speech translation in 36 languages. This speech-to-speech translation is notably remarkable due to its ‘end-to-end’ method: for instance, the model can translate spoken English directly into spoken German without needing to first transcribe it into English text and then convert it to German text (Fig. 1b).
To train the AI model, the investigators utilized techniques known as self-supervised and semi-supervised learning. These strategies enable a model to derive insights from vast amounts of unrefined data — such as text, audio, and video — without necessitating human annotators to label the data with specific tags or classifications that provide context. Such labels might include exact transcripts or translations, for example.
The segment of the model dedicated to speech translation was pre-trained on an extensive data set containing 4.5 million hours of multilingual spoken audio. This type of training aids the model in identifying patterns within the data, simplifying the process of tuning the model for specific applications without the need for exhaustive bespoke training data.
Read the paper: Joint speech and text machine translation for up to 100 languages
One of the SEAMLESS team’s most astute tactics involved ‘harvesting’ the Internet for training pairs that correspond across languages — such as audio clips in one language that align with subtitles in another. By starting with data known to be reliable, the authors trained the model to discern when two pieces of content (such as a video clip and its corresponding subtitle) genuinely match in terms of meaning. Utilizing this method on vast fields of Internet-derived information, they amassed about 443,000 hours of audio with associated text, and aligned approximately 30,000 hours of speech pairs for further training on their model.
Despite these advancements, I believe the most commendable aspect of this endeavor is not merely the proposed concept or methodology. Rather, it is the assurance that all data and code necessary to operate and refine this technology are readily accessible to the public — though the model itself is intended solely for non-commercial uses. The authors characterize their translation model as ‘foundational’ (see go.nature.com/3teaxvx), indicating it can be adapted on meticulously curated data sets for specific applications — such as enhancing translation accuracy for particular language pairs or for technical terminology.
Meta has emerged as one of the most significant advocates for open-source language technology. Its research team played a crucial role in the development of PyTorch, a software library for training AI models that is widely adopted by entities such as OpenAI and Tesla, alongside numerous researchers globally. The model presented here extends Meta’s collection of foundational language technology models, like the Llama family of large language models2, which can facilitate the creation of applications similar to ChatGPT. This level of transparency is immensely beneficial for researchers lacking substantial computational resources necessary to construct these models from the ground up.
While this technology is exhilarating, many challenges persist. The SEAMLESS model’s capability to translate up to 100 languages is notable, yet the total number of languages spoken worldwide is about 7,000. The tool additionally encounters difficulties in numerous contexts that humans navigate with relative ease — such as discussions in loud environments or between individuals with distinct accents. However, the authors’ strategies for leveraging real-world data will pave a promising trajectory towards speech technology that rivals the realms of science fiction.
LLMs produce racist output when prompted in African American English
The challenges tied to current speech technologies are well-recorded. Transcriptions generally perform worse for English dialects deemed non-‘standard’ — like African American English — compared to those variants more commonly utilized3. Translation quality to and from a language is subpar if that language is underrepresented in the data employed to train the model. This impacts any languages that are less frequently found on the Internet, from Afrikaans to Zulu4.
Some transcription models have even been observed to ‘hallucinate’5 — generating entire phrases not actually spoken in audio inputs — and this phenomenon occurs more frequently among individuals who have speech difficulties than it does for those who do not (Fig. 1c). Such kinds of machine-induced errors could potentially lead to real harm, such as incorrectly prescribing medication or wrongfully accusing an individual in a legal situation. And the adverse effects disproportionatelyaffects marginalized groups, who are prone to being misinterpreted.
The SEAMLESS researchers measured the harmfulness linked to their model (the extent to which its translations may introduce damaging or offensive wording)6. This represents progress, providing a foundation against which forthcoming models can be evaluated. Nevertheless, considering that the efficacy of current models fluctuates significantly across various languages, additional caution is needed to ensure a model can skillfully translate or transcribe specific terms in particular languages. This effort should run alongside initiatives among computer-vision specialists, who strive to enhance the inadequate performance of image-recognition models within under-represented communities and prevent the models from generating offensive predictions7.
Meta’s AI translation model encompasses neglected languages
The authors also investigated gender bias within the translations generated by their model. Their examination focused on whether the model disproportionately represented one gender when converting gender-neutral expressions into gender-specific languages: does “I am a teacher” in English result in the masculine “Soy profesor” or the feminine “Soy profesora” in Spanish? However, such assessments are confined to languages with strictly binary masculine or feminine forms, and future evaluations should expand the range of linguistic biases considered8.
In the future, design-centric thinking will be essential to guarantee that users can accurately contextualize the translations produced by these models — many of which differ in quality. Besides the toxicity alerts investigated by the SEAMLESS authors, developers should deliberate on how to present translations in manners that clarify a model’s limitations — indicating, for instance, when an output involves the model merely guessing a gender. This could mean forgoing an output altogether when its accuracy is uncertain, or pairing low-quality results with written warnings or visual indicators9. Perhaps most crucially, users should have the ability to opt-out of utilizing speech technologies — for instance, in medical or legal environments — if they wish.
Although speech technologies can be more efficient and economical at transcribing and translating compared to humans (who are also susceptible to biases and mistakes10), it is vital to comprehend the ways in which these technologies falter — disproportionately affecting certain demographics. Future endeavors must guarantee that speech-technology researchers rectify performance inequalities, and that users are adequately informed about the potential benefits and drawbacks associated with these models.
This page was generated programmatically, to view the article in its original location you can follow the link below:
https://www.nature.com/articles/d41586-024-04095-6
and if you wish to eliminate this article from our site please contact us
This webpage was generated automatically. To view the article in its initial setting, you may…
This page was generated automatically; to view the article in its original site, please visit…
This webpage was generated automatically; to view the article at its original source, please follow…
This webpage was generated programmatically; to access the article in its initial location, you may…
This webpage was generated automatically, to view the article in its original position you can…
This page was generated automatically, to view the article in its initial location, you may…