Machine Learning WAVE Files with TensorFlow

--

I’ve been working on a second book focused on Machine Learning (ML) and Artificial Intelligence (AI) but this time it’s project based. The book is a collection of distinct small projects that add up to large projects. This post is inspired by a small intermediate project that plays hide and seek in music using a method called backmasking. The book doesn’t touch on TensorFlow core so I tried to use this as a starting point to exploring TensorFlow core.

The goal is to parse a WAVE file with TensorFlow while skimming the surface of how TensorFlow operates. The details here help in getting started with the latest Kaggle competition from the Google Brain team which involves creating predictive models based on a large training set of WAVE files.

By the end of this post you will have completed two small projects related to WAVE files. The second project works with TensorFlow and skims the surface of the code used by the core and its APIs. You’ll walk away with a better understanding of:

  1. WAVE file format
  2. Loading WAVE files in TensorFlow
  3. Basics of TensorFlow core

If you aren’t comfortable with TensorFlow, this isn’t a gentle introduction and I’d recommend trying the official beginner tutorial instead. The basics of TensorFlow are skipped over in these projects.

What you’ll need to have setup

You’ve likely been experimenting with TensorFlow and already have these requirements installed. The requirements are common except the WAVE files provided by the Kaggle competition.

  1. Python version 3. Installation instructions.
  2. TensorFlow version r1.4 on Linux/Mac and nightly build on Windows (due to a missing cmake entry in r1.4). Installation instructions.
  3. Jupyter Notebook. Installation instructions.
  4. Example WAVE files provided in the Google Brain Kaggle competition training data. The example implementations expect that you’ve downloaded and extracted the training data into a parent directory of where this code is executed from ../train.

Project 1–1: Parse a WAVE file purely in Python

The first micro project is to parse a WAVE file purely in Python. From the WAVE file you need to generate this output, replacing the <Fake words> with the actual values parsed from the file:

Parsed ../train/audio/bed/00176480_nohash_0.wav
-----------------------------------------------
Channels: <Fake words>
Sample Rate: <Fake words>
First Sample: <Fake words>
Second Sample: <Fake words>
Length in Seconds: <Fake words>

To ease in this project, here are a few links and terms which you might find useful.

Project 1–1: Implementation

WAVE is a format that realizes a simple specification used across multiple different file formats. It is an implementation of the RIFF file format which was inspired by IFF, another file format ahead of its time. These formats introduced a way of thinking where a file format can be generic and still fast to work with. Thanks to RIFF, many prominent file formats exist today in a format that is shared across different platforms.

The first step in working with a WAVE file is to look at its header. The header is typically the first section of a file which describes how the rest of the file is stored. For WAVE files, the header describes exactly how the sound data is stored to make it easier to parse.

In order to look at the header, open the file in binary mode and read a few bytes.

with open('../train/audio/bed/00176480_nohash_0.wav', 'rb') as wav_file:

Normally, files are opened in text mode, that means, you read and write strings from and to the file, which are encoded in a specific encoding (the default being UTF-8). ‘b’ appended to the mode opens the file in binary mode: now the data is read and written in the form of bytes objects. This mode should be used for all files that don’t contain text. [Python 3 docs]

chunk_id = wav_file.read(4)  # See the WAVE file format docs
print(chunk_id)

The chunk_id should be b'RIFF' and the b is Python’s special way of saying it should be considered a list of bytes. In this case, the printed version of the raw bytes map to the ASCII codes for the string RIFF, matching what is expected as the first piece of a WAVE header. The number 4 is the amount of bytes to read.

4 (Bytes) ChunkID Contains the letters “RIFF” in ASCII form [WAVE format]

Bytes in Python are not the same as strings, Python is sending the bytes to the screen in a format which is easier for the programmer to try and understand. The format recognizes certain characters and displays them the best it can using a map of bytes (character codes) to characters. In this case Python displayed the ASCII word RIFF, this is expected because the WAVE file’s byte order is little-endian. To convert these bytes to a string, you can tell Python to decode each byte using a certain character encoding.

b'RIFF'.decode('ascii') == 'RIFF'
'RIFF'.encode('ascii') == b'RIFF'

After the ChunkID, there’s a section of the file which is stored as a number. Converting bytes to a number isn’t as straight forward as the example using simple strings because numbers can be stored in different formats. Each format relates to a different amount of bytes.

chunk_size = struct.unpack('<I', wav_file.read(4))[0]

Reading the chunk size requires a few pieces of information in advance. The first piece is that the chunk size is stored in 4 bytes, this is read with f.read(4).

Using struct.unpack allows converting these bytes into an integer that can be further worked with. To use struct.unpack you need to know what type of integer these bytes represent. The documentation on the WAV file format is clear that the chunk size is an unsigned integer (positive real number).

The first parameter to unpack is '<I', this format parameter gives two pieces of information. The < tells unpack that the bytes are in little-endian format, this is known based on the ChunkID being RIFF instead of RIFX. The I tells unpack that the bytes create a four byte unsigned integer that is subsequently read using read(4).

The last field of a WAVE file is the actual data you care about. The header gave all the information required to parse the actual data, the most important field is the BitsPerSample (16 in this case). The final field of the header is the actual WAVE data stored as short signed integers, 'h'struct format.

16-bit samples are stored as 2's-complement signed integers, ranging from -32768 to 32767.

samples = []
bytes_per_sample = bits_per_sample / 8
sample_count = int(sub_chunk_2_size / bytes_per_sample)

for _ in range(sample_count):
samples.append(struct.unpack('<h', wav_file.read(2))[0])

Most files could be read using the logic you now know from doing this project.

Parse a WAVE file in Python, the harder way.

But this isn’t the easiest way to parse a WAVE file in Python. In fact, Python has a built-in library designed for parsing WAVE files because the format is so easy to work with. The library removes all that code and replaces it with a simple set of library calls.

import wavedef parse_wave_python(filename):
with wave.open(filename, 'rb') as wave_file:
sample_rate = wave_file.getframerate()
length_in_seconds = wave_file.getnframes() / sample_rate

first_sample = struct.unpack(
'<h', wave_file.readframes(1))[0]
second_sample = struct.unpack(
'<h', wave_file.readframes(1))[0]
print('''
Parsed {filename}
-----------------------------------------------
Channels: {num_channels}
Sample Rate: {sample_rate}
First Sample: {first_sample}
Second Sample: {second_sample}
Length in Seconds: {length_in_seconds}'''.format(
filename=filename,
num_channels=wave_file.getnchannels(),
sample_rate=wave_file.getframerate(),
first_sample=first_sample,
second_sample=second_sample,
length_in_seconds=length_in_seconds))

parse_wave_python('../train/audio/bed/00176480_nohash_0.wav')

So why would I suggest building your own WAVE file parser if there is already a library to do it? TensorFlow can be fairly complicated under the hood while working with its core logic. While learning TensorFlow’s core implementation details, I’ve found it easier to begin by approaching subjects I known well. Now that you know how a WAVE file is stored, I hope you’ll find it easier to read over the code TensorFlow uses to do the same logic.

Project 1–2: Parse a WAVE file with TensorFlow

The goal of this project is to load a WAVE file with TensorFlow and output the same information displayed in Project 1–1 and then answer questions about TensorFlow’s WAVE file decode operation and explain why.

  1. Could wav_io load an 8 bit WAVE file?
  2. Could wav_io load a PCM WAVE file?
  3. What would happen if wav_io tried to load a non PCM WAVE file?
  4. What would happen to the bytes found in a WAVE file’s extra parameters section?
  5. How does TensorFlow convert the integer WAVE data to a float?

Note: It’s best to avoid using TensorFlow’s FFmpeg module for this project due to instability on certain platforms (Windows).

Loading files in TensorFlow is fairly straightforward and documented well. WAVE files are simple to work with but require all the samples to be loaded simultaneously.

Project 1–2: Implementation

Working with WAVE files in TensorFlow requires using the DecodeWaveOp and is located in the tensorflow.contrib.framework.python.ops.audio_ops Python module (Windows r1.4 won’t have these generated ops due to a missing line in a CMake file, use tf-nightly instead). An example of using this operation can be found in the speech command example.

import tensorflow as tf
from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
audio_binary = tf.read_file(filename)
desired_channels = 1
wav_decoder = contrib_audio.decode_wav(
audio_binary,
desired_channels=desired_channels)

In TensorFlow, contrib is used to describe functionality which is experimental and may not be ready for usage. In r1.4, the required operation is listed as a contrib module while in r1.5 it has been promoted to core. An important note is that tf.read_file reads the entire WAVE file while the pure Python code implemented previously runs one sample at a time. For small WAVE files, this doesn’t impact much but with large WAVE files it’s possible that your system can run out of memory due to the size of the Tensor being loaded.

The decoding allows a parameter that reduces the amount of channels returned. For this example, setting it to 1 makes sense because the original WAVE file has a single channel so any further channels would be blank anyway. This parameter is useful if you have multiple channels of audio that are too heavy to work with and would like to reduce the resulting Tensor’s size.

with tf.Session() as sess:
sample_rate, audio = sess.run([
wav_decoder.sample_rate,
wav_decoder.audio])

The DecodeWavOp allows access to two pieces of information about the WAVE file,sample_rate andaudio. Sample rate is exactly the same as in your WAVE file parser, it’s the number of samples per second that exist in the data. The audio contains all the samples that can be accessed, it’s the same as the samples array used in your implementation.

first_sample = audio[0][0] * (1 << 15)
second_sample = audio[1][0] * (1 << 15)

OK, this seems odd but highlights a big difference dealing with TensorFlow’s DecodeWavOp. The audio[0][0] sample is a float value but the first sample pulled from the raw WAVE file is a short signed integer. The DecodeWavOp is doing a useful normalization step to make the data easier to work with. The values of each sample is now a float between -1 and 1. DecodeWavOp does this by dividing the short signed integer by the maximum allowed value for any short signed integer (2 bytes). Since the short integer is signed, the maximum positive or negative value is equivalent to turning on the bit at the 16th (8 bits * 2 bytes) position by left shifting 15 spaces. On most systems, the same logic can be done using 32,768 instead of executing the 1 << 15.

import tensorflow as tf
# Requires latest tf-1.4 on Windows
from tensorflow.contrib.framework.python.ops import audio_ops as contrib_audio
def parse_wave_tf(filename):
audio_binary = tf.read_file(filename)
desired_channels = 1
wav_decoder = contrib_audio.decode_wav(
audio_binary,
desired_channels=desired_channels)
with tf.Session() as sess:
sample_rate, audio = sess.run([
wav_decoder.sample_rate,
wav_decoder.audio])
first_sample = audio[0][0] * (1 << 15)
second_sample = audio[1][0] * (1 << 15)
print('''
Parsed {filename}
-----------------------------------------------
Channels: {desired_channels}
Sample Rate: {sample_rate}
First Sample: {first_sample}
Second Sample: {second_sample}
Length in Seconds: {length_in_seconds}'''.format(
filename=filename,
desired_channels=desired_channels,
sample_rate=sample_rate,
first_sample=first_sample,
second_sample=second_sample,
length_in_seconds=len(audio) / sample_rate))
parse_wave_tf('../train/audio/bed/004ae714_nohash_1.wav')
  1. Could wav_io load an 8 bit WAVE file? No, there is code in the wav_io.cc that explicitly checks that every WAVE file loaded is 16bits per sample. The wav_io.cc file is a great place to get started looking at TensorFlow’s logic dealing with WAVE files. The logic is nearly identical to the Python implementation you created before.
  2. Could wav_io load a PCM WAVE file? Yes, that’s all the wav_io.cc is setup to work with. There’s an explicit check that returns as error if the format isn’t PCM.
  3. What would happen if wav_io tried to load a non PCM WAVE file? It’d return an error based on the check in #2.
  4. What would happen to the bytes found in a WAVE file’s extra parameters section? Nothing, they’re silently ignored by moving the offset variable by two bytes. The offset variable isn’t used in the Python implementation because Python’s read method moves the offset in the file automatically each time it’s called.
  5. How does TensorFlow convert the integer WAVE data to a float? It converts every sample to a float by dividing it by the maximum short signed integer size.

Next

In my book, I’ve been focusing on the Python and C++ API primarily for implementing ML and AI projects. Digging into the design principles behind TensorFlow is something I enjoy and would like to share. Unfortunately, it’s hard to dig deep into TensorFlow’s core in a single post but I’d like to go further, especially with the recent Fourier transform additions. If you found this useful, please let me know and provide criticism so that I may improve.

Dig deeper on your own to see how these operations are created, I recommend looking at how these files contribute to the success of using the DecodeWavOp. Reading over these files highlight how TensorFlow cross communicates via its APIs.

Bazel build files:

Changed CMake file including the audio_ops:

The file used in Project 1–2:

Connecting the Python method call to the TensorFlow op:

The actual TensorFlow op:

Test files for wav_io:

--

--