cd ..

Hibiki: Real-time Voice-Preserving Language Translation

Hands-on with Open Source Real-Time Translation Models

Remember Star Trek's Universal Translator? Science fiction is becoming reality with two groundbreaking open-source models: Meta's Seamless Streaming and Hibiki. These models are revolutionizing real-time speech translation, and I've had the opportunity to test one firsthand.

The Open Source Revolution in Real-Time Translation

Until recently, real-time speech translation was dominated by closed systems from tech giants. Google's Interpreter Mode, Microsoft's Skype Translator, and DeepL Voice showcased impressive capabilities but kept their technology under wraps. That's changing with the release of two powerful open-source alternatives.

Seamless Streaming: A Universal Translator

Meta's Seamless Streaming is a multilingual powerhouse that I've personally tested. I even tested it with Armenian to challenge it. That data-set certainly needs some work, let’s just say. The model supports:

Speech recognition in 96 languages

Speech-to-text translation from 101 source languages into 96 target languages

Speech-to-speech translation for 36 target languages

Using the model through Hugging Face's interface (try it yourself at huggingface.co/spaces/facebook/seamless-streaming), I experienced near real-time translation with impressively natural output. The latency is typically under .200 seconds - comparable to cell phone latency.

What sets Seamless Streaming apart is its intelligent handling of language differences. The model dynamically decides when to start translating based on the sentence structure of both languages, ensuring natural-sounding output even between very different languages like Japanese and English.

While the model is open-source, it comes with a CC BY-NC 4.0 license, limiting it to non-commercial use. This makes it perfect for research and personal projects but requires licensing for business applications.

Hibiki: High-Fidelity Voice-Preserving Translation

Hibiki takes a different approach, focusing on high-quality translation between specific language pairs (currently French-to-English) while preserving the speaker's voice characteristics. Its key features include:

Decoder-only transformer architecture for efficient processing

Real-time translation with minimal delay

Voice preservation technology that maintains speaker identity

MIT/Apache-2.0 licensed code and CC-BY 4.0 licensed models

What makes Hibiki particularly exciting is its commercial-friendly licensing and efficient architecture. It's designed to run on consumer hardware, potentially enabling offline translation devices or apps that don't require cloud connectivity.

There’s more to learn about this model, but seeing as it just was released about 6 hours ago, I have yet to play with it. I am definitely going to put this one through it’s paces and for my own purposes, will be testing it with Armenian to see how it does.

Sign up below to get an update when I am able to put this model through it’s paces.

The Road Ahead

While these models represent significant progress, challenges remain:

Handling complex accents and dialects

Managing context and cultural nuances

Balancing latency with translation quality

Scaling to more language pairs

However, the open-source nature of these projects means the entire community can contribute to solving these challenges. We're likely to see rapid improvements and new language pairs added as researchers and developers build upon these foundations.

Conclusion

The release of Seamless Streaming and Hibiki marks a turning point in speech translation technology. Their open-source nature democratizes access to advanced translation capabilities, enabling innovation beyond what any single company could achieve. Whether you're a developer, researcher, or just curious about the technology, these models provide an exciting glimpse into the future of human communication.