Generating neural speech synthesis voice acting using xVASynth

Published in

Becoming Human: Artificial Intelligence Magazine

6 min readJan 11, 2021

Voice acting is an integral component of modern video games, effectively providing much realism and immersiveness. This comes at a cost, however, as tinkerers (modders) don’t have access to the original professional voice actors used in the games’ creation processes. This usually means user generated content suffers from a lack of voiced characters, or often sub-par quality of voice recordings where recording equipment is not very good.

This is a companion post for the v1.0 release of xVASynth, a tool I’ve been working on since 2018, which aims to tackle this problem using established neural speech synthesis techniques. I’ll go into more technical detail here regarding the development process and the models used. Watch this quick intro video to see preview/summary of the app, narrated by some of the trained voices:

V1.0 trailer/demo/overview video going over the main features

Goals and requirements

The main goal of this app is to provide content creators with a way to generate new voice acted lines for their projects. As such, audio quality is the top priority. A secondary objective is to provide a way to exert artistic control over the generated audio.

The Data

The models are built around datasets like LJSpeech, which is formatted as a folder full of .wav files, and an accompanying metadata.csv file containing lines for each .wav file, as follows:
<filename>|<text transcript>

Extracting audio/transcript data from games was quite easy, thanks to the very open and moddable nature of Bethesda games, and great tools like BAE, LazyAudio, and Bethesda’s Creation Kit. To start preparing the data for training, the audio files were first extracted from the game file, then decomposed into .lip and .wav files. The transcript was extracted using the Creation Kit. Following this, it was just a matter of matching up the audio files to the transcript, which was easy to do via a python script. To finalize the transcripts, a number of filtering passes were run, to exclude invalid data such as screams, shouts, music, and lines with extra unspoken text.

Trending AI Articles:

1. 130 Machine Learning Projects Solved and Explained
2. The New Intelligent Sales Stack
3. Time Series and How to Detect Anomalies in Them — Part I
4. Beginners Guide -CNN Image Classifier | Part 1

For the actual audio files, a couple additional pre-processing steps were required, starting with down-sampling the audio to 22050Hz mono audio. Using pydub, the audio silence was trimmed from either end, and the middle of the audio where long pauses were present (sox can also do this, but it introduced audio artifacts).

Models

I started early experiments during this project in 2018, using Tacotron. At this point, the project was a proof of concept, and was riddled with issues, as can be seen in this video I made showcasing progress at a v0.1 pre-release version.

Demo video for early experiments in xVASynth v0.1

Though it somewhat worked, the audio quality was terrible (with very high reverb), and the output was quite unstable. The model was also very slow to load. Additionally, the model required very large datasets which limited me to voices from voice actors who also recorded audio books (The data pre-processing for which was a whole other messy can of worms). Finally, any artistic control was limited to clever use of punctuation.

Fast-forward to 2020 and the PyTorch FastPitch model by NVIDIA was released. There are several key features of this model that are useful for this project.

One of the main selling points of this model is its focus on letter-by-letter values for pitch and duration. This meant that by hijacking intermediate model values, a user would be able to have artistic control over how the audio is generated - a great plus for the acting part of voice acting.

Pitch sliders and duration editing tools in xVASynth

Another plus is the support for multiple speakers per model. The best quality was still achieved when training a single speaker at a time. However, this has meant that training (or at least pre-training) voices with only a small amount of data was now possible, by grouping them up.

Image source: FastPitch paper on arxiv (https://arxiv.org/pdf/2006.06873.pdf)

However, the most important point for this project is that audio generation was backed by a pre-trained Tacotron 2 model, instead of learning from scratch. As a pre-processing step, the Tacotron 2 model outputs mel spectrograms, and character durations which is what is used to compute the loss, as shown in the diagram above, from the paper (the pitch information is extracted using a different method).

Just like Tacotron and Tacotron 2, the FastPitch model generates mel spectrograms, so another model is needed to generate actual .wav audio files from these spectrograms. The FastPitch repo uses WaveGlow for this, which works really well.

This dependency on Tacotron 2 has meant the training has been far more quick, simple and successful. However, an issue still persists when the speaker style is very different from the one the pre-trained Tacotron 2 was trained on, LJSpeech. LJSpeech is a female speaker dataset, meaning deep male voices cannot very well be converted to the data required by FastPitch.

xVASynth right now

At the time of writing, v1.0 xVASynth comes with 34 models trained for 53 voice sets across 6 games, with many more planned for future releases. The issue with Tacotron 2 persists for most male voices, as I don’t currently have the hardware requirements to train/fine-tune a Tacotron 2 model well enough, though this is the next step after my next hardware upgrade.

The video at the top of this post details usage examples for the app, which can be downloaded from either GitHub or the Nexus.

I additionally added HiFi-GAN as an alternative vocoder to WaveGlow, which is orders of magnitude faster on the CPU, albeit at lower audio quality.

I decided to also get the ball rolling and make an example mod using this tool. People who have played Oblivion may have been haunted by the very infamous “Stop right there criminal scum” lines uttered by the seemingly psychic guards. Well, with xVASynth the nightmare can continue, in Skyrim!

Future plans

Until I next upgrade hardware, the plan is to go through the very lengthy list of remaining voices for the Bethesda games currently supported (and soon to be released), with a larger focus on the female voices.

Once I can fine-tune Tacotron 2, I’ll go back through the list, then I’ll have a go at doing the really difficult voices, such as robots, ghouls, and other creatures, to see how far this can be pushed.

You can track development progress on the GitHub project page, or one of the nexus pages, where the app and models can be downloaded. There is also a discord channel set up for discussions around this you can join: https://discord.gg/nv7c6E2TzV

Don’t forget to give us your 👏 !