How does your assistant device work based-on Text-to-Speech technology?

Speech synthesis

Published in

Becoming Human: Artificial Intelligence Magazine

5 min readJul 29, 2019

Speech synthesis is the artificial production of human speech. Text-to-Speech (TTS) is way to converts language to human voice (or speech). The goal of TTS is to render naturally sounding speech signals for downstream such as assistant device (Google’s Assistant, Amazon’s Echo, Apple’s Siri). This story will talk about how we can generate a human-like voice. Concatenative TTS and Parametric TTS are the traditional ways to generate audio but there are some limitations. Google released a generative model, WaveNet, which is a break through on TTS. It can generate a very good audio and overcoming traditional ways’ limitation.

This story will discuss about WaveNet: A Generative Model for Raw Audio (van den Oord et al., 2016) and the following are will be covered:

Text-to-Speech
Technique of Classical Speech Synthesis
WaveNet
Experiment

Text-to-Speech (TTS)

Technically, we can treat TTS as a sequence-to-sequence problem. It includes 2 major stages which are text analysis and speech synthesis. Text analysis is quite similar to generic natural language processing (NLP) steps (Although we may not need heave preprocessing when using deep neural network). For example, sentence segmentation, word segmentation, part-of-speech(POS). The output of first stage is grapheme-to-phoneme (G2P) which is the input of second stage. In speech synthesis, it takes the output from first stage and generating waveform.

Technique of Classical Speech Synthesis

Concatenative TTS and Parametric TTS are the traditional ways to generate audio by feeding text. As named mentioned, Concatenative TTS concatenate a short clip to form a speech. As short…

How does your assistant device work based-on Text-to-Speech technology?

Speech synthesis

Trending AI Articles:

Text-to-Speech (TTS)

Technique of Classical Speech Synthesis

Written by Edward Ma