How does your assistant device work based-on Text-to-Speech technology?

Speech synthesis

--

Speech synthesis is the artificial production of human speech. Text-to-Speech (TTS) is way to converts language to human voice (or speech). The goal of TTS is to render naturally sounding speech signals for downstream such as assistant device (Google’s Assistant, Amazon’s Echo, Apple’s Siri). This story will talk about how we can generate a human-like voice. Concatenative TTS and Parametric TTS are the traditional ways to generate audio but there are some limitations. Google released a generative model, WaveNet, which is a break through on TTS. It can generate a very good audio and overcoming traditional ways’ limitation.

This story will discuss about WaveNet: A Generative Model for Raw Audio (van den Oord et al., 2016) and the following are will be covered:

  • Text-to-Speech
  • Technique of Classical Speech Synthesis
  • WaveNet
  • Experiment

Trending AI Articles:

1. Basics of Neural Network

2. Making a Simple Neural Network

3. Are you using the term ‘AI’ incorrectly?

4. From Perceptron to Deep Neural Nets

Text-to-Speech (TTS)

Technically, we can treat TTS as a sequence-to-sequence problem. It includes 2 major stages which are text analysis and speech synthesis. Text analysis is quite similar to generic natural language processing (NLP) steps (Although we may not need heave preprocessing when using deep neural network). For example, sentence segmentation, word segmentation, part-of-speech(POS). The output of first stage is grapheme-to-phoneme (G2P) which is the input of second stage. In speech synthesis, it takes the output from first stage and generating waveform.

Technique of Classical Speech Synthesis

Concatenative TTS and Parametric TTS are the traditional ways to generate audio by feeding text. As named mentioned, Concatenative TTS concatenate a short clip to form a speech. As short…

--

--