Exploring Speech to Speech Translation with SeamlessM4T v2

Read Time:
minutes

Speech-to-Speech (S2S) Translation is a rapidly evolving technology designed to convert speech from one language to another while maintaining the natural flow and intent of the original message. This process goes beyond mere translation; it involves retaining the speaker's tone, prosody, and sometimes even their vocal characteristics.

A Common Problem: Dubbing Challenges in Movies

For instance, imagine watching a Marvel movie dubbed in another language. You might find that:

  1. Humour doesn’t translate well: Cultural references and jokes may lose their punch when converted into a different language.
  2. Voice mismatch: The dubbed voice often doesn't match the original character’s tone or personality, causing a disconnect in the viewing experience.

These issues illustrate a core challenge in speech translation—maintaining the naturalness of the original performance.

Enter SeamlessM4T and SeamlessExpressive

Meta’s SeamlessM4T v2 comes to the rescue, addressing such challenges with SeamlessExpressive. This model is capable of translating speech while preserving vocal styles and prosody. In other words, it can keep the original speaker's rhythm, pauses, and tone, making the translation more natural and emotionally resonant. Imagine watching that same Marvel movie but in another language, where the translated dialogue is delivered in a voice that sounds almost identical to the original actor, maintaining the same energy and humor.

Including a GAN vocoder adds another layer of authenticity by ensuring that even subtle voice characteristics are maintained in the dubbed version. Even in translations, the character’s voice sounds true to their original portrayal.

Real-Time Streaming with SeamlessStreaming

In addition to expressiveness, Meta's SeamlessStreaming technology enables low-latency speech-to-speech translation, meaning the system can start translating before the speaker has finished. This feature is groundbreaking for live events like conferences or conversations, making translation more natural and fluid.

How does Seamless work?

The SeamlessM4T v2 model is designed for speech-to-speech translation, with several components that handle different aspects of the task.

Transformer Text Encoder and Transformer Text Decoder

The Transformer Text Encoder and Decoder form the backbone of the text processing pipeline in the SeamlessM4T v2 model. These components are based on the NLLB (No Language Left Behind) architecture, designed to handle multiple languages effectively.

The encoder processes the input text (x_text) by tokenizing it into subword units. It then applies a series of self-attention and feed-forward layers to capture the linguistic information in the input text. This process transforms the input into a high-dimensional representation that encodes local and global context.

On the other hand, the decoder generates the target text (y_text) based on the encoded linguistic information. It uses a similar architecture to the encoder but includes cross-attention layers that allow it to attend to the encoded input while generating the output. The decoder operates autoregressively, generating one token at a time while considering the previously generated tokens.

These components work in tandem with the speech processing modules to enable seamless translation between text and speech in multiple languages. The encoded representations from the text encoder can be passed to the Length Adaptor and subsequent components for speech synthesis. In contrast, the text decoder can generate translations based on encoded speech input from the Conformer Speech Encoder.

Mel-Filterbanks Extractor and Conformer Speech Encoder (w2v-BERT 2.0)

The Mel-Filterbanks Extractor is the first step in processing raw speech input (x_speech). It converts the time-domain audio signal into a time-frequency representation using a short-time Fourier transform (STFT). The resulting spectrogram is then passed through a bank of 80 mel-scaled filters, which approximate the human auditory system's frequency response. This process results in mel-filterbank features that capture the essential characteristics of speech while reducing the dimensionality of the input.

The Conformer Speech Encoder, based on the w2v-BERT 2.0 architecture, takes these mel-filterbank features as input and processes them to extract high-level speech representations. The Conformer architecture combines the strengths of Transformers and Convolution Neural Networks (CNNs). The self-attention mechanism in Transformers captures long-range dependencies, while the convolution layers model local features effectively.

The encoder consists of multiple Conformer blocks, each containing a multi-head self-attention layer, a convolution module, and a feed-forward network. The w2v-BERT 2.0 pre-training approach involves both contrastive learning and masked language modeling tasks on large speech datasets, which helps the model learn robust speech representations across different languages and accents.

The output of the Conformer Speech Encoder is a sequence of feature vectors that capture the content of the speech input. These features are then passed to the Length Adaptor for further processing, enabling the model to perform tasks such as speech-to-text translation or speech-to-speech translation.

Length Adaptor

The Length Adaptor is a crucial component in the SeamlessM4T v2 architecture, serving as a bridge between the speech and text processing pipelines. Its primary function is to adjust the length of the feature sequences extracted by the Conformer Speech Encoder to match the length of the text sequences processed by the Transformer Text Decoder.

This adaptation is necessary because speech and text typically have different sequence lengths. Speech features are often much longer, with multiple frames corresponding to a single character or subword in text. The Length Adaptor uses a combination of convolutional layers and downsampling techniques to reduce the length of the speech features while preserving the essential information.

The adapted features maintain temporal alignment with the original speech input while having a length more suitable for text-based processing. This alignment is crucial for tasks like speech-to-text translation, where the model needs to map speech segments to corresponding text elements accurately.

Moreover, the Length Adaptor enables bidirectional processing between speech and text modalities. When translating from speech to text, it compresses the speech features. Conversely, when generating speech from text, it works in conjunction with the Unit Duration Predictor to expand the text representations to match the required speech length.

By ensuring proper sequence alignment, the Length Adaptor significantly improves the model's ability to learn the relationships between speech and text, leading to more accurate translations and higher-quality speech synthesis.

Subword-Length T2U Encoder and Subword-to-Character Upsampler

The Subword-Length T2U (Text-to-Unit) Encoder and Subword-to-Character Upsampler work together to handle the transition between different levels of linguistic representation, from text to speech units.

The Subword-Length T2U Encoder takes the output from either the Transformer Text Encoder (for text input) or the Length Adaptor (for speech input) and converts it into a sequence of subword units. Subwords are fragments of words that balance the flexibility of character-level processing and the semantic meaning captured by complete words. This approach is efficient for languages with rich morphology, where words can have multiple meaningful parts.

The encoder uses a learned mapping to convert the input representations into a fixed set of subword units. This process helps manage vocabulary size and handle out-of-vocabulary words, as new words can often be represented as combinations of known subwords.

Following the T2U Encoder, the Subword-to-Character Upsampler increases the granularity of the representation by expanding subwords into individual characters. This upsampling process is crucial for several reasons:

  1. It allows for finer control over the generated speech, as character-level representations can capture more nuanced pronunciations.
  2. It facilitates better alignment between the linguistic and speech units in the later stages of the model.
  3. It enables the model to handle languages with complex orthographies more effectively.

The upsampler typically uses a combination of learned embeddings and neural network layers to expand each subword into its constituent characters while maintaining the contextual information from the sub-word level.

These components play a vital role in preparing the linguistic representations for the subsequent stages of speech synthesis, ensuring that the model can generate accurate and natural-sounding speech across various languages and speaking styles.

Unit Duration Predictor and Character-to-Unit Upsampler

The Unit Duration Predictor and Character-to-Unit Upsampler are essential components in the SeamlessM4T v2 model that bridge the gap between linguistic representations and speech units.

The Unit Duration Predictor is responsible for estimating the duration of each speech unit corresponding to the input characters or subwords. It takes the output from the Subword-to-Character Upsampler and predicts how long each unit should be when spoken. This prediction is crucial for generating natural-sounding speech with appropriate rhythm and pacing.

The predictor typically uses a neural network architecture, such as a feed-forward or recurrent network, to learn the relationship between linguistic features and unit durations. It considers factors like phoneme identity, stress patterns, and surrounding context to make accurate predictions. The output of this component is a sequence of duration values, one for each character or speech unit.

Working with the Unit Duration Predictor, the Character-to-Unit Upsampler takes the character-level representations and expands them into speech units based on the predicted durations. This process involves:

  1. Mapping characters to their corresponding speech units (phonemes, diphones, or other acoustic units).
  2. Replicating or interpolating the unit representations to match the predicted durations.
  3. Applying any necessary smoothing or adjustment ensures a continuous and natural transition between units.

The upsampler's output is a sequence of speech units temporally aligned with the input text and ready for further processing by the NAR Unit Decoder. This careful expansion and alignment process is crucial for maintaining the timing and prosody of the generated speech.

Together, these components enable the model to generate speech that conveys the correct linguistic content and sounds natural in terms of rhythm, stress, and intonation. They form a critical link in the text-to-speech synthesis pipeline, translating abstract linguistic representations into concrete acoustic targets.

NAR Unit Decoder and HiFi-GAN Unit-Vocoder

The NAR (Non-Autoregressive) Unit Decoder and HiFi-GAN Unit-Vocoder are the final stages in the speech synthesis pipeline of the SeamlessM4T v2 model, responsible for generating the actual speech waveform.

The NAR Unit Decoder takes the upsampled and duration-adjusted speech units from the previous stages and processes them to predict the final sequence of acoustic units. Unlike autoregressive models that generate outputs sequentially, the NAR approach allows for parallel processing of the entire sequence, significantly speeding up the generation process.

The decoder typically consists of a stack of Transformer layers with masked self-attention mechanisms. It refines the input representations, considering the global context of the entire sequence to produce more coherent and natural-sounding speech units. The non-autoregressive nature of this component allows for faster inference times, making it suitable for real-time applications.

Once the NAR Unit Decoder has produced the final sequence of acoustic units, the HiFi-GAN Unit-Vocoder converts these units into a continuous audio waveform. HiFi-GAN (High-Fidelity Generative Adversarial Network) is a state-of-the-art neural vocoder known for generating high-quality speech with fine acoustic details.

The HiFi-GAN architecture consists of a generator and a discriminator:

  1. The generator takes the acoustic units as input and uses a series of transposed convolutions and residual blocks to upsample and refine the signal into a full audio waveform.
  2. The discriminator, trained adversarially, helps ensure that the generated speech sounds realistic and natural by providing feedback to the generator during training.

This vocoder can capture subtle speech characteristics, such as breathiness, pitch variations, and other nuances that contribute to the naturalness and expressiveness of the synthesized voice.

The NAR Unit Decoder and HiFi-GAN Unit-Vocoder combination enables the SeamlessM4T v2 model to produce high-fidelity, natural-sounding speech across multiple languages and speakers while maintaining the efficiency required for practical applications.

Aligner

The Aligner is a crucial component in the training process of the SeamlessM4T v2 model, although it is not directly involved in the inference pipeline. Its primary function is to provide supervision during training by ensuring that the generated speech units align correctly with the original text-to-character units.

The Aligner creates a mapping between the input text (or its character-level representation) and the corresponding speech units. This mapping is essential for several reasons:

  1. It helps the model learn the correct timing and duration for each linguistic unit in the spoken form.
  2. It enables the model to capture the relationship between written text and its pronunciation, which can vary significantly across languages.
  3. It provides a mechanism for the model to learn prosodic features such as stress and intonation patterns.

During training, the Aligner computes an alignment score or loss that measures how well the generated speech units match the expected alignment based on the input text. This alignment loss is then used as part of the overall training objective, encouraging the model to produce speech that sounds natural and accurately reflects the input text's content and structure.

The Aligner typically employs dynamic time warping (DTW) or attention-based mechanisms to compute the optimal alignment between the text and speech representations. It may also incorporate linguistic knowledge, such as phoneme dictionaries or language-specific rules, to improve the accuracy of the alignment.

By providing this detailed alignment information during training, the Aligner helps refine various components of the model:

  • It improves the Length Adaptor's ability to accurately match speech and text sequence lengths.
  • It enhances the Unit Duration Predictor's performance in estimating the correct duration for each speech unit.
  • It helps the NAR Unit Decoder learn to generate speech units that correspond closely to the input text.

While not used during inference, the Aligner's impact on the trained model is significant, contributing to the overall quality, accuracy, and naturalness of the generated speech across different languages and speaking styles.

Demonstration

%%capture
!pip install fairseq2
!pip install pydub sentencepiece
!pip install git+https://github.com/facebookresearch/seamless_communication.git

Set up seamless_communication, fairseq2 and some utilities.

import io
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import mmap
import numpy
import soundfile
import torchaudio
import torch

from collections import defaultdict
from IPython.display import Audio, display
from pathlib import Path
from pydub import AudioSegment

from seamless_communication.inference import Translator
from seamless_communication.streaming.dataloaders.s2tt import SileroVADSilenceRemover


SeamlessM4T Inference:

Initialize the models:

# Initialize a Translator object with a multitask model, vocoder on the GPU.

model_name = "seamlessM4T_v2_large"
vocoder_name = "vocoder_v2" if model_name == "seamlessM4T_v2_large" else "vocoder_36langs"

translator = Translator(
    model_name,
    vocoder_name,
    device=torch.device("cuda:0"),
    dtype=torch.float16,
)

S2ST inference:

# README:  https://github.com/facebookresearch/seamless_communication/tree/main/src/seamless_communication/cli/m4t/predict
# Please use audios with duration under 20 seconds for optimal performance.

# Resample the audio in 16khz if sample rate is not 16khz already.
# torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000)

print("English audio:")
in_file = "/content/audio-defailt.wav"
display(Audio(in_file, rate=16000, autoplay=False, normalize=True))

tgt_langs = ("spa", "fra", "deu", "ita", "hin", "cmn")

for tgt_lang in tgt_langs:
  text_output, speech_output = translator.predict(
      input=in_file,
      task_str="s2st",
      tgt_lang=tgt_lang,
  )

  print(f"Translated text in {tgt_lang}: {text_output[0]}")
  print()

  out_file = f"/content/translated_LJ_{tgt_lang}.wav"

  torchaudio.save(out_file, speech_output.audio_wavs[0][0].to(torch.float32).cpu(), speech_output.sample_rate)

  print(f"Translated audio in {tgt_lang}:")
  audio_play = Audio(out_file, rate=speech_output.sample_rate, autoplay=False, normalize=True)
  display(audio_play)
  print()

English audio:

Translated text in spa: Oye, ¿cómo estás? Hace mucho tiempo.

Translated audio in spa:

Translated text in ita: Ehi, come va, e' da molto tempo?

Translated audio in ita:

Translated text in hin: अरे, आप कैसे कर रहे हैं. यह एक लंबे समय के लिए किया गया है?

Translated audio in hin:

Translated text in fra: Ça fait longtemps que tu te débrouilles ?

Translated audio in fra:

Translated text in deu: Hey, wie geht's dir? Es ist schon lange her?

Translated audio in deu:

Translated text in cmn: <unk>,你怎么样了?已经很久了.

Translated audio in cmn:

SeamlessExpressive Inference:

# Please follow instructions to download SeamlessExpressive here: https://ai.meta.com/resources/models-and-libraries/seamless-downloads/

!wget "<Check your mail to get the model weights>" -O /content/SeamlessExpressive.tar.gz

!tar -xzvf /content/SeamlessExpressive.tar.gz
in_file = '/content/audio-defailt.wav'
out_file = '/content/spa-default.wav'

!expressivity_predict {in_file} --tgt_lang spa \
    --model_name seamless_expressivity --vocoder_name vocoder_pretssel \
    --gated-model-dir SeamlessExpressive --output_path {out_file}
print('Input Audio: English')
audio_input = Audio(in_file, rate=16000, autoplay=False, normalize=True)
display(audio_input)

print('Output Audio: Spanish')
audio_output = Audio(out_file, rate=16000, autoplay=False, normalize=True)
display(audio_output)

Input Audio: English

audio-defailt.wav

Output Audio: Spanish

spa-audio-default.wav

The Impact of SeamlessExpressive

SeamlessExpressive allows for expressive speech-to-speech translations from and into multiple languages, including English, French, Spanish, and Mandarin. By focusing on preserving natural prosody and vocal style, the model enables a much more immersive and personalized experience.

Imagine using this technology in a global meeting. SeamlessExpressive would allow someone to present in their native language, which would be translated into your language while preserving their speaking style. This could significantly improve cross-cultural communication, especially when tone and expression are crucial—such as negotiations or sensitive conversations.

As the world becomes increasingly interconnected, even regional businesses are facing the need to communicate with diverse audiences. Effective communication is crucial, whether it's expanding customer bases, working with international suppliers, or offering tourist services. Speech-to-Speech (S2S) translation technologies, such as SeamlessM4T and SeamlessExpressive, offer a game-changing solution for regional businesses looking to break language barriers and expand their reach.

1. Enhancing Customer Experience

Imagine a local hotel in Paris catering to tourists from China, Brazil, and Germany. With SeamlessExpressive, the hotel staff could communicate with guests in real-time, translating conversations into the guests' preferred language while maintaining a natural tone and vocal style. This technology would make visitors feel more welcomed and understood, enhancing their overall experience.

For example:

  • Restaurant Orders: A restaurant in Italy could use S2S translation to take orders from foreign tourists, ensuring the nuances of their requests—such as dietary restrictions—are accurately understood and catered to, while keeping the friendly tone of the waiter.
  • Tour Guides: Regional tour guides can offer personalized tours in different languages without losing their unique style and expression, allowing tourists to enjoy the full experience of the tour guide's charisma and expertise, no matter their language.

2. Expanding into New Markets

For regional businesses looking to expand internationally, seamless communication is key. SeamlessM4T allows business owners and employees to hold multilingual conversations in real-time, from business negotiations to customer support. For instance, a family-owned manufacturer in Mexico looking to export products to French and Japanese markets can now use low-latency translations to communicate smoothly with new partners. This opens doors to international trade without needing a large multilingual staff.

3. Customer Support in Multiple Languages

Many regional businesses, especially those selling products or services online, receive queries from non-local customers. A small e-commerce company based in India might get support requests from customers in the US, Japan, or Spain. Using SeamlessExpressive, the business can offer multilingual customer support in real-time, where the personality and tone of the support agent’s voice are retained even in translation. This ensures that customer satisfaction remains high and the brand’s image remains consistent across languages.

For instance:

  • Tech Support: A regional tech company could use S2S translation to provide multilingual support, making resolving technical issues for international clients easier while preserving the tone of the support agent's assistance.

4. Overcoming Dubbing Issues for Local Content Creators

Regional businesses involved in content creation—such as local filmmakers or influencers—can benefit from SeamlessExpressive for creating multilingual content. Suppose a regional fitness trainer in Japan wants to produce workout videos for a global audience. Using SeamlessM4T, they can easily dub their videos into multiple languages while maintaining their original voice, energy, and enthusiasm. This gives them a global reach without hiring different voice actors or compromising on vocal expression.

5. Training and Development Across Borders

Companies that conduct training sessions for employees in different regions can use these S2S technologies to bridge the language gap. A regional business that provides skilled labor training could expand its reach by offering translated training sessions to international clients or remote workers. For instance, a farming cooperative in Brazil can train new employees from Spanish-speaking countries in real-time, ensuring the knowledge is accurately conveyed without waiting for subtitles or translations.

Conclusion

The advent of advanced Speech-to-Speech (S2S) translation technologies like SeamlessM4T and SeamlessExpressive represents a significant leap forward in breaking down language barriers for businesses and individuals alike. These innovations offer unprecedented opportunities for regional companies to expand their reach, enhance customer experiences, and compete globally.

By preserving the nuances of speech, including tone, prosody, and vocal characteristics, these technologies enable more natural and effective communication across languages. This is particularly valuable when emotional context and personal expression are crucial, such as in customer service, negotiations, or content creation.

As these technologies evolve and improve, we can anticipate even more seamless multilingual interactions. This could lead to a more interconnected global marketplace where language differences no longer hinder business growth or cultural exchange. For regional businesses, embracing these tools could be the key to unlocking new markets, improving customer satisfaction, and fostering international collaborations.

While challenges such as accuracy and ethical considerations in AI-driven translations remain, the potential benefits of S2S translation technologies are immense. As we progress, it will be exciting to see how businesses and individuals leverage these tools to create a more inclusive and connected world where language is no longer a barrier but a bridge to new opportunities.

Ready to Transform Your Global Communications?

Curious about integrating Seamless M4T and SeamlessExpressive technology into your business? Book a consultation with us! You'll meet with Rohan Sawant and our AI Research team, who can help tailor a Speech-to-Speech translation solution that fits your organization's unique needs. Together, we'll explore how these advanced technologies can break down language barriers, enhance global communication, and drive your business growth.

Book an AI consultation

Looking to build AI solutions? Let's chat.

Schedule your consultation today - this not a sales call, feel free to come prepared with your technical queries.

You'll be meeting Rohan Sawant, the Founder.
 Company
Book a Call

Let us help you.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Behind the Blog 👀
Aditya Narayan Patro
Writer

Pranav Patel
Editor

Good boi. He is a good boi & does ML/AI. AI Lead.