DeepL has made a name for itself with online text translation it claims is more nuanced and precise than services from the likes of Google — a pitch that has catapulted the German startup to a valuation of $2 billion and more than 100,000 paying customers.
Now, as the hype for AI services continues to grow, DeepL is adding another mode to the platform: audio. Users will now be able to use DeepL Voice to listen to someone speaking in one language and automatically translate it to another, in real time.
English, German, Japanese, Korean, Swedish, Dutch, French, Turkish, Polish, Portuguese, Russian, Spanish, and Italian are languages that DeepL can “hear” today. Translated captions are available for all of the 33 languages currently supported by DeepL Translator.
DeepL Voice is currently stopping short of delivering the result as an audio or video file itself: The service is aimed at real-time, live conversations and video conferencing, and comes through as text, not audio.
In the first of these, you can set up your translations to appear as “mirrors” on a smartphone — the idea being that you put the phone between you on a meeting table for each side to see the words translated — or as a transcription that you share side by side with someone. The videoconferencing service sees the translations appearing as subtitles.
That could be something that changes over time, Jarek Kutylowski, the company’s founder and CEO (pictured above), hinted in an interview. This is DeepL’s first product for voice, but it’s unlikely to be its last. “[Voice] is where translation is going to play out in the next year,” he added.
There is other evidence to support that statement. Google — one of DeepL’s biggest competitors — also started to incorporate real-time translated captions into its Meet video conferencing service. And, there are a multitude of AI startups building voice translation services, such as AI voice specialist ElevenLabs (ElevenLabs Dubbing), and Panjaya, which creates translations using “deepfake” voices and video that matches the audio.
The latter uses ElevenLabs’ API, and according to Kutylowski, ElevenLabs itself is using tech from DeepL to power its translation service.
Audio output is not the only feature yet to launch.
There is also no API for the voice product right now. DeepL’s main business is focused on B2B and Kutylowski said the company is working with partners and customers directly.
Nor is there a wide choice of integrations: The only video calling service that supports DeepL’s subtitles currently is Teams, which “covers most of our customers,” Kutylowski said. There’s no word on when or if Zoom or Google Meet will be incorporating DeepL Voice down the line.
The product will feel like a long time coming for DeepL users, not just because we’ve been awash in a plethora of other AI voice services aimed at translation. Kutylowski said that this has been the No. 1 request from customers since 2017, the year DeepL launched.
Part of the reason for the wait is that DeepL has been taking a pretty deliberate approach to building its product. Unlike many others in the world of AI applications that lean on and tweak other companies’ large language models (LLMs), DeepL’s aim is to build its service from the ground up. In July, the company released a new LLM optimized for translations that it says outperforms GPT-4, and those from Google and Microsoft, not least because its primary purpose is for translation. The company has also continued to enhance the quality of its written output and glossary.
Similarly, one of DeepL Voice’s unique selling points is that it will work in real time, which is important since a lot of “AI translation” services on the market actually work on a delay, making them harder or impossible to use in live situations, which is the use case that DeepL is addressing.
Kutylowski hinted that this was another reason behind why the new voice-processing product is focusing on text-based translations: They can be computed and produced very fast, while processing and AI architecture still has a way to go before being able to produce audio and video as quickly.
Video conferencing and meetings are likely use cases for DeepL Voice, but Kutylowski noted that another major one the company envisions is in the service industry, where front-line workers at, say, restaurants could use the service to help communicate with customers more easily.
This could be useful, but it also highlights one of the rougher points of the service. In a world where we are all suddenly a lot more aware of data protection and concerns about how new services and platforms are co-opting private or proprietary information, it remains to be seen how keen people will be to have their voices being picked up and used in this way.
Kutylowski insisted that although voices will be traveling to its servers to be translated (the processing does not happen on-device), nothing is retained by its systems, nor used for training its LLMs. Ultimately, DeepL will work with its customers to make sure that they do not violate GDPR or any other data protection regulations.