Representation learning and language

Can machines develop their own language?

--

Do machines have their own language? [SOURCE]

Despite the title, this post is not about doing natural language processing using deep models. This topic has been discussed in depth by many smart people (e.g. Sebastian Ruder) and is rather application-oriented. Conversely, we are going to take a more philosophical look at deep learning, and particularly deep generative models, from the perspective of language.

Let us begin by understanding what we mean with “language”. And, more importantly, what we mean with “mean”.
There are many different theories trying to explain language on several levels, but fundamentally we can understand language as a way to communicate concepts using symbols. A nice way to visualize this idea is the semiotic triangle, developed by Charles Kay Ogden and Ivor Armstrong Richards in their work The Meaning of Meaning.

The semiotic triangle [SOURCE]

The corners of the triangle are the referent, the thought or reference and the symbol. The referent is an entity in the real world, for instance a cat. The reference would be my mental image or concept that gets evoked when I see that particular cat. The symbol is finally an abstract and communicatable representation for that reference which I choose in order to share my mental state with others. In this case, I could choose the symbol “CAT” and say “Look at that cute cat”.
The key insight is now, that there is no direct relation between the symbol “CAT” and the actual cat. There is only an imputed relation which can be true, if the reference is adequate and the symbol is correct. An adequate reference means that seeing the cat actually evokes my mental concept of a cat and for instance not that one of a dog. Regarding that reference, I then have to choose a correct symbol to communicate, meaning that the symbol should indeed symbolize that reference and therefore be suitable to evoke the same reference in your mind.
From this definition, it immediately becomes clear that in order for successful communication to happen, we need to share a system in which the same symbols are correct for the same references. This system would then be called a common language.

Now let’s apply our new semiotic knowledge to representation learning algorithms. Despite some reports equating the hidden representations in deep neural networks to an own language, it has to be noted that these representations are usually vectors in continuous spaces and not discrete symbols as in our semiotic model.
One of the main difficulties in finding a common language to communicate in is precisely the fact that we can only have a finite number of symbols (which moreover have to be learnable in a realistic amount of time) and that you cannot easily interpolate between symbols. A good example for this property might be names for colors: While you might have a symbol called “CYAN” for a color between blue and green and maybe even a few more fine-grained descriptors (if you work in graphics design), you won’t have a symbol to describe every wavelength in the visible electromagnetic spectrum and there is no easy way to generate ones.
Hidden representations in neural networks, by contrast, span a continuous space, such that there are potentially infinitely many of them and interpolation between any two is often as trivial as taking the arithmetic mean.

The question is therefore, whether there are deep models which exhibit a latent space similar to our semiotic definition of language. And the answer and also main reason for this post is: Yes, there are!
The model is called Vector Quantised Variational Autoencoder (VQ-VAE) and has recently been published by some researchers at DeepMind.

Schematic mechanism of the VQ-VAE [SOURCE]

Let’s take the name of this model apart to see what it means. An autoencoder is a neural network consisting of an encoder, a decoder and a latent space. The encoder maps a given input into the latent space (which is usually much smaller than the input space) and the decoder tries to reconstruct the input from this representation. This can generally be seen as a form of nonlinear dimensionality reduction.
A variational autoencoder (VAE) is a special type of autoencoder, where the mappings are not deterministic, but probabilistic. This means that one can feed the same input into the model several times and the output will look slightly different. At the same time, the latent representations are regularized to follow a certain distribution, such that we can simply pick a point in the latent space at random and use the decoder part of the network as a generative model.
Finally vector quantization (VQ) is a signal processing technique, where a probability distribution over a space is modeled by a number of prototypical points from that space. One can also use it for clustering, by then assigning every point in the space to its closest prototype. VQ has the interesting property of being density matching, which means that there are going to be more prototypes in parts of the space where there is also more probability mass.

Given these intuitions, we can try to understand what the VQ-VAE does: It is a variational autoencoder, meaning it maps inputs into a latent space and reconstructs them probabilistically, but at the same time it also performs vector quantization on that latent state!
That means that the VQ-VAE learns a mapping from inputs to latent representations and then in turn maps these representations to their respective nearest prototype representation. The decoder then has to learn to reconstruct the input from this prototype.
Encoder, decoder and prototypes (called embeddings in the paper) are learned more or less simultaneously, such that the encoder can learn to map similar inputs to the same prototype and the prototypes can match the density of latent representations.

Going back to our semiotic model, we now start to see many parallels. We could for instance take a photo of a dog (referent) with our smartphone, which would yield a digital encoding of what that dog looks like (reference).
We can now view the VQ-VAE as an act of communication between the encoder and decoder networks. The encoder gets the image of the dog (which is as adequate a reference as any digital photo arguably) and has to find a correct symbol to facilitate successful communication. It therefore has to map the input (reference) to the correct prototype (symbol) in the latent space (vocabulary), such that the decoder can map it back to an image (reference) which is as close as possible to the input.
It should be emphasized that the encoder and decoder learn the symbols and the mappings to the references at the same time, much like two children inventing their own words for objects and sharing them by pointing at the respective things. If news about neural networks inventing their own secret language would ever be appropriate, they should report about the VQ-VAE!
Notably, because of the vector quantization’s density matching property, the learned symbols are dependent on the input distribution: A VQ-VAE trained with many pictures of dogs and only a few pictures of cats might invent numerous symbols for different kinds of dogs, but only a handful of symbols for cats, much like human dog breeders with their specialized vocabulary or graphics designers who have many names for different color nuances, because they encounter these kinds of referents and references more often.

We can therefore conclude that the vector quantised variational autoencoder with its discrete latent representations can be described in the terminology of the semiotic triangle and is therefore arguably the deep learning model with the closest similarity to human communication. This suggests many exciting avenues for research, for instance regarding the interpolation between different prototypes in the VQ-VAE’s latent space and how this operation could be understood semiotically.
Another interesting idea would be to use the VQ-VAE for human language instead of images. This has indeed been tried by the authors of the paper and led to an unsupervised discovery of certain phonemes in speech data.
Future research regarding the VQ-VAE and similar models promises deeper insights into semiotically plausible data encoding and generative modeling.

If you liked this story, you can follow me on Medium and Twitter.

--

--