“WaveNet”的意思、由来-开放百科全书

Generating speech from text is an increasingly common task thanks to the popularity of software such as Apple's Siri, Microsoft’s Cortana, Amazon Alexa and the Google Assistant.^[4]

Most such systems use a variation of a technique that involves concatenated sound fragments together to form recognisable sounds and words.^[5] The most common of these is called concatenative TTS.^[6] It consists of large library of speech fragments, recorded from a single speaker that are then concatenated to produce complete words and sounds. The result sounds unnatural, with an odd cadence and tone.^[7] The reliance on a recorded library also makes it difficult to modify or change the voice.^[8]

Another technique, known as parametric TTS,^[9] uses mathematical models to recreate sounds that are then assembled into words and sentences. The information required to generate the sounds is stored in the parameters of the model. The characteristics of the output speech are controlled via the inputs to the model, while the speech is typically created using a voice synthesiser known as a vocoder. This can also result in unnatural sounding audio.

Design

WaveNet is a type of feedforward neural network known as a deep convolutional neural network (CNN). In WaveNet, the CNN takes a raw signal as an input and synthesises an output one sample at a time. It does so by sampling from a softmax (i.e. categorical) distribution of a signal value that is encoded using μ-law companding transformation and quantized to 256 possible values.^[10]

In the [https://arxiv.org/pdf/1609.03499.pdf 2016 paper], the network was fed real waveforms of speech in English and Mandarin. As these pass through the network, it learns a set of rules to describe how the audio waveform evolves over time. The trained network can then be used to create new speech-like waveforms at 16,000 samples per second. These waveforms include realistic breaths and lip smacks - but do not conform to any language.^[11]

WaveNet is able to accurately model different voices, with the accent and tone of the input correlating with the output. For example, if it is trained with German, it produces German speech.^[12] This ability to clone voices has raised ethical concerns about WaveNet's ability to mimic the voices of living and dead persons.

The capability also means that if the WaveNet is fed other inputs - such as music - its output will be musical. At the time of its release, DeepMind showed that WaveNet could produce waveforms that sound like classical music.^[13]

Applications

At the time of its release, DeepMind said that WaveNet required too much computational processing power to be used in real world applications.^[14] As of October 2017, Google announced a 1,000-fold performance improvement along with better voice quality. WaveNet was then used to generate Google Assistant voices for US English and Japanese across all Google platforms.^[15] At the annual I/O developer conference in May 2018, it was announced that new Google Assistant voices were available and made possible by WaveNet; WaveNet greatly reduced the number of audio recordings that were required to create a voice model by modeling the raw audio of the voice actor samples.^[16]

References

1. ^{{Cite journal|last=Oord|first=Aaron van den|last2=Dieleman|first2=Sander|last3=Zen|first3=Heiga|last4=Simonyan|first4=Karen|last5=Vinyals|first5=Oriol|last6=Graves|first6=Alex|last7=Kalchbrenner|first7=Nal|last8=Senior|first8=Andrew|last9=Kavukcuoglu|first9=Koray|date=2016-09-12|title=WaveNet: A Generative Model for Raw Audio|journal=|volume=1609|arxiv=1609.03499|bibcode=2016arXiv160903499V|via=}}
2. ^{{Cite news|url=https://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-achieve-speech-generation-breakthrough|title=Google’s DeepMind Achieves Speech-Generation Breakthrough|last=Kahn|first=Jeremy|date=2016-09-09|work=Bloomberg.com|access-date=2017-07-06|archive-url=|archive-date=|dead-url=}}
3. ^{{Cite web|url=http://fortune.com/2016/09/09/google-deepmind-wavenet-ai/|title=Google's DeepMind Claims Massive Progress in Synthesized Speech|last=Meyer|first=David|date=2016-09-09|website=Fortune|archive-url=|archive-date=|dead-url=|access-date=2017-07-06}}
4. ^{{Cite news|url=https://www.bloomberg.com/news/articles/2016-09-09/google-s-ai-brainiacs-achieve-speech-generation-breakthrough|title=Google’s DeepMind Achieves Speech-Generation Breakthrough|last=Kahn|first=Jeremy|date=2016-09-09|work=Bloomberg.com|access-date=2017-07-06|archive-url=|archive-date=|dead-url=}}
5. ^{{Cite news|url=https://www.technologyreview.com/s/602343/face-of-a-robot-voice-of-an-angel/|title=When this computer talks, you may actually want to listen|last=Condliffe|first=Jamie|date=2016-09-09|work=MIT Technology Review|access-date=2017-07-06|archive-url=|archive-date=|dead-url=|language=en}}
6. ^{{Cite book|last=Hunt|first=A. J.|last2=Black|first2=A. W.|date=May 1996|title=Unit selection in a concatenative speech synthesis system using a large speech database|url=https://www.ee.columbia.edu/~dpwe/e6820/papers/HuntB96-speechsynth.pdf|journal=1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings|volume=1|pages=373–376|doi=10.1109/ICASSP.1996.541110|via=|isbn=978-0-7803-3192-1|citeseerx=10.1.1.218.1335}}
7. ^{{Cite web|url=https://techcrunch.com/2016/09/09/googles-wavenet-uses-neural-nets-to-generate-eerily-convincing-speech-and-music/|title=Google's WaveNet uses neural nets to generate eerily convincing speech and music|last=Coldewey|first=Devin|date=2016-09-09|website=TechCrunch|archive-url=|archive-date=|dead-url=|access-date=2017-07-06}}
8. ^{{Cite web|url=https://deepmind.com/blog/wavenet-generative-model-raw-audio/|title=WaveNet: A Generative Model for Raw Audio|last=van den Oord|first=Aäron|last2=Dieleman|first2=Sander|date=2016-09-08|website=DeepMind|archive-url=|archive-date=|dead-url=|access-date=2017-07-06|last3=Zen|first3=Heiga}}
9. ^{{Cite journal|last=Zen|first=Heiga|last2=Tokuda|first2=Keiichi|last3=Black|first3=Alan W.|date=2009|title=Statistical parametric speech synthesis|url=http://www.sciencedirect.com/science/article/pii/S0167639309000648|journal=Speech Communication|volume=51|issue=11|pages=1039–1064|doi=10.1016/j.specom.2009.04.004|citeseerx=10.1.1.154.9874}}
10. ^{{Cite journal|last=Oord|first=Aaron van den|last2=Dieleman|first2=Sander|last3=Zen|first3=Heiga|last4=Simonyan|first4=Karen|last5=Vinyals|first5=Oriol|last6=Graves|first6=Alex|last7=Kalchbrenner|first7=Nal|last8=Senior|first8=Andrew|last9=Kavukcuoglu|first9=Koray|date=2016-09-12|title=WaveNet: A Generative Model for Raw Audio|journal=|volume=1609|arxiv=1609.03499|bibcode=2016arXiv160903499V|via=}}
11. ^{{Cite news|url=https://qz.com/778056/google-deepminds-wavenet-algorithm-can-accurately-mimic-human-voices/|title=Are you sure you're talking to a human? Robots are starting to sounding eerily lifelike|last=Gershgorn|first=Dave|date=2016-09-09|work=Quartz|access-date=2017-07-06|archive-url=|archive-date=|dead-url=|language=en-US}}
12. ^{{Cite web|url=https://techcrunch.com/2016/09/09/googles-wavenet-uses-neural-nets-to-generate-eerily-convincing-speech-and-music/|title=Google's WaveNet uses neural nets to generate eerily convincing speech and music|last=Coldewey|first=Devin|date=2016-09-09|website=TechCrunch|archive-url=|archive-date=|dead-url=|access-date=2017-07-06}}
13. ^{{Cite web|url=https://deepmind.com/blog/wavenet-generative-model-raw-audio/|title=WaveNet: A Generative Model for Raw Audio|last=van den Oord|first=Aäron|last2=Dieleman|first2=Sander|date=2016-09-08|website=DeepMind|archive-url=|archive-date=|dead-url=|access-date=2017-07-06|last3=Zen|first3=Heiga}}
14. ^{{Cite news|url=https://www.bbc.co.uk/news/technology-37899902|title=Adobe Voco 'Photoshop-for-voice' causes concern|date=2016-11-07|work=BBC News|access-date=2017-07-06|language=en-GB}}
15. ^[https://deepmind.com/blog/wavenet-launches-google-assistant/ WaveNet launches in the Google Assistant]
16. ^{{Cite news|url=https://www.cnet.com/how-to/how-to-get-all-google-assistants-new-voices-right-now/|title=Try the all-new Google Assistant voices right now|last=Martin|first=Taylor|date=May 9, 2018|work=CNET|access-date=May 10, 2018|language=en}}

History

Design

Applications

References