I Paid $5 to Clone My Voice Using AI. It Was Pretty Trippy [CNET]

View Article on CNET

I have a funny relationship with my voice. I’ve been called everything from Minnie Mouse to Squeaks. The worst part is I feel like I sound like a child — although my aunt, who is similarly high-pitched, used this to her advantage when telemarketers called and she simply said her mom wasn’t home.

Nevertheless, I wanted to see what an AI-generated version of my voice would sound like, so I signed up for an account with AI audio startup ElevenLabs

If you remember the deepfake robocall purporting to be from President Joe Biden in January, you’re already familiar with this company. It reportedly responded to the incident by banning the account behind the AI-generated audio and it is now working with a deepfake detection company called Reality Defender to develop audio deepfake detection models.

AI Atlas art badge tag

This goes to show the double-edged sword that is generative AI. Thanks to tools like OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude and Meta’s namesake Meta AI, our creative potential in text, images, videos and audio has skyrocketed. Simple text or voice prompts can generate never-before-seen content and lifelike audio and video. At the same time, the potential for misuse is ever-present, which requires ever-evolving guardrails.

A better movie experience

ElevenLabs’ founders, Piotr Dabkowski and Mati Staniszewski, are childhood friends from Poland who grew up watching movies dubbed over by a Polish speaker.

“You could imagine watching Dune — it’s Timothée Chalamet and Zendaya talking and then over that is a Polish man reading both of their lines,” said ElevenLabs spokesperson Sam Sklar. “It’s not the ideal movie experience.”

And so the genesis of ElevenLabs, which was founded in 2022, was a desire to translate languages while preserving emotion, tone and delivery.

According to Sklar, Dabkowski figured out how to do this by creating a text-to-speech model that retains context around what it is saying. That means lighthearted, happy text is delivered by a lighthearted, happy voice and ominous, scary text is delivered by — you guessed it — a scary voice.

From there, London-based ElevenLabs created a system through which you can create realistic replicas of voices. To do it yourself, you upload clips of your own speech and the system breaks them down into the component parts of each word and scores your speech based on factors like expressiveness and speed. When your voice clone is ready, you can use it for AI dubbing as in the movie example or to translate content and regenerate it in one of the 32 languages ElevenLabs supports.

My voice, but AI

ElevenLabs’ free plan gives you 10,000 characters, which ElevenLabs says is roughly 10 minutes of audio.

Unfortunately, you can’t clone a voice without a subscription. Plans run from $5 to $330 a month. I coughed up $5 for the Starter Plan, which gives you 30,000 characters (or 30 minutes of audio).

I could only record 30 seconds of my voice at a time. The system needs at least 60 seconds, so I had to submit two voice samples. Then I checked a box to confirm I have the rights to my voice and will not use the AI version for anything illegal, fraudulent or harmful, so it at least appears people who use this app are on the honor system.

The ElevenLabs system generated my voice almost instantaneously. It was weirdly accurate, although arguably more confident — and slower — than I sound in real life. I’m not sure it picked up on all the anxiety swarming within me, but I suppose that’s hard to get across in a 60-second clip.

I had my AI voice read my CNET bio. The one little whoopsie was it read “Beverly Hills 90210” as “nine, oh, two, 10” instead of “nine, oh, two, one, oh.”

I also wanted to test the translation capability, which means clicking on the Dubbing Studio tab. All you need to do is add an audio file, select a target language and click Create.

It took about 60 seconds each time, but it was pretty wild to hear myself speak fluent Japanese, Greek and Finnish.

When I was done, I scanned the list of choices in a voice library and saw mine listed alongside a bunch of others, so I quickly deleted it lest some stranger use my voice for an unknown purpose.

How it’s being used

In addition to voice replicas and AI dubbing, ElevenLabs technology is used to create audiobooks and voiceovers. In July, the startup partnered with the estates of actors Judy Garland, James Dean, Burt Reynolds and Laurence Olivier to use their voices in its Reader app, which lets you listen to digital text — like articles, PDFs, newsletters and e-books — read by an AI voice.

While the Reader app is currently only available in English in the US, UK and Canada, ElevenLabs plans to increase the app to 32 languages and expand it to a global audience soon.

The technology is also used in conversational AI agents. That includes a US hospital with clients who speak Spanish, but which doesn’t have a Spanish speaker to answer phones.

It has also been used by patients with degenerative diseases like ALS who lose their voices over time. ElevenLabs can recreate their voices from recordings of their speech so they can continue to communicate in their own voice.

What’s next

ElevenLabs raised $80 million in a Series B round in January. It was co-led by venture capital firm Andreessen Horowitz, former GitHub CEO Nat Friedman and former Apple Director Daniel Gross. VC firms Sequoia Capital, Smash Capital, SV Angel, BroadLight Capital and Credo Ventures also participated.

Proceeds are funding research and development.

The startup also plans to add support for additional languages and to give you more control over how speech is read with new functionality called Director’s Mode.

“Many other AI companies are trying to lead in all the different modalities — image, text, video,” Sklar said. “We just went all in on audio. So a little more focus there in terms of the set we’re playing in.”