Who is recognized as having the best deepfake audio software?
December 9, 2021 8:27 PM   Subscribe

Who is recognized as having the best deepfake audio software, or performing the most sophisticated research in this area?

I'm thinking of material on the level of the Tom Cruise videos like this one: https://www.tiktok.com/@deeptomcruise?lang=en&is_copy_url=1&is_from_webapp=v1

I'm looking for something bleeding-edge, that may not yet be available commercially.
Best answer: A few years back comedians Key and Peele helped demo (in a highly paid corporate gig I'm guessing) Adobe's experimental tech for this called VoCo. The wikipedia page for the app mentions the lack of public updates for this has allowed alternatives Resemble AI and Wavenet to gain momentum. I'm guessing one of these?
Best answer: You might find some useful background in this CJR article on the deepfake industry. (I was surprised to hear that Sassy Trump was voiced by Peter Serafinowicz.)
For what it's worth, the Tom Cruise video you linked to wasn't fake audio at all. It was actually a live audio impersonation performed by actor Miles Fisher, who also doubled as the body and stand-in face, until the face was replaced with Tom Cruise via deep learning video synthesis technology.

If you wanted to synthesize audio that sounds like Tom Cruise today, you basically have two options: text-to-speech and "speech-to-speech".

Text-to-speech is a well developed technology at this point, with a wide range of commercial offerings, but it is not appropriate for recreating dramatic human performances. It could never convincingly render the last line of your video above, when Tom says "Hehe... that's crazy!"), because the variety of human performance cannot be captured in text alone. Making text-to-speech voices sound more realistic (or even identical to known voices) is possible today (example), but it falls short of being a reasonable way to create arbitrary human dialogue, because the "realistic human voice" problem is actually quite different from the "realistic human actor" problem. I doubt text-to-speech will make much progress in this area in the short term, because text-to-speech has hit a local optimum of usability that may thwart further research. For the wide range of text-to-speech use cases deployed today, it doesn't much matter if the voice sounds a bit robotic (your car doesn't need to dramatically read you the driving directions). The commercial applications are already viable, so refinement is not strictly necessary.

More relevant to your interests is what I would call "speech-to-speech" synthesis, where instead of rendering text to a synthesized voice, you render an actual human voice to another human voice. I don't think "speech-to-speech" is the right terminology (I don't know what is), but it does accurately capture what's done and how it's different from text-to-speech. Speech-to-speech is directly analogous to what's done to the actor's face in the Tom Cruise video. The computer is not generating a purely synthesized face but rather starting with one face and transforming it into another.

For an idea of what state-of-the-art commercially available "speech-to-speech" sounds like, check out Respeecher. They call their technique "voice cloning" but that overlaps with a term used in the text-to-speech community so I don't know if that will help you investigate further (it's a start I guess). Their technology is impressive, but still sounds synthetic to my ear. I doubt you could produce something as convincing as a human doing an impression, but maybe there are incremental improvements on the horizon.

The speech-to-speech approach is limited in that you have to record a person saying something first. You couldn't use it to generate driving directions unless you pre-recorded someone speaking all possible driving directions. So it probably has fewer commercial applications than text-to-speech, and less visibility when you do a search for "fake speech" or "speech synthesis". Respeecher is just a random company I found with a Google search, but their tech is likely to represent the cutting edge of commercially available technology today. Moreover, I doubt there are academic research initiatives that are much beyond this, as the groundwork of deep learning has already been laid, and improvements in speech synthesis are starting to look more like engineering problems than basic science problems. As a result, any new or interesting developments are probably already hidden behind pre-commercial products (if they exist at all).
