An intuitive, high-level framework to understand the technical trends in Artificial Intelligence

--

I was recently reading this article titled “AI Is About to Learn More Like Humans — with a Little Uncertainty” which I found very interesting because it is tackling a core debate in AI today. However, I thought it was not straightforward to fully understand it so I wanted to take a step back and try to explain what is at stake behind this article, connecting ideas to provide more background and more concrete examples. I am suggesting a framework that I think relevant although it certainly contains necessary simplifications. I’d be happy to have constructive feedback on the article. Finally, note that I have tried to make a lot of reference towards good resources for curious people who are willing to dig further.

You might want to read the article before reading further, but you can also continue with this very short summary of it: Basically, Deep Learning (DL) has been the technology that has enabled most of the recent advances in AI. However, it has several limitations that makes an increasing number of people think that it might not be the right path to take artificial intelligence to the next level. We will try to elaborate on these different trends with the following steps.

As an introduction, we will define a bit more Artificial Intelligence and see that it can be used for different purposes. In a first part, we will detail how the different applications of AI imply different constraints. In a second part, we will detail the main technical views in AI, and relate them to our topic. In a third part, we will try to provide concrete examples of how these different views approach a similar problem differently.

Introduction — Defining AI

For a start, we will define AI as it can be a confusing term. I like to consider a broad definition of AI, such as the “simulation of intelligent behavior in computers”, with intelligent as “able to vary its state or action in response to varying situations”. This means that a mere “IF THIS DO THAT” would be considered as AI, although it is very low level. It is also useful to distinguish narrow AI from general AI. Narrow AI can only achieve a specific task it was programmed for, such as playing Go, and will be totally incapable at anything else. General AI, which doesn’t exist yet, should be able to perform multiple different tasks without being reprogrammed.

There are so many applications to AI that it is hard to make a simple yet relevant segmentation, but the most simple one makes the difference between perception and cognitive tasks:

  • Perception: This boils down to mimicking our 5 senses, in order to receive information from our environment just like humans do. Main applications today include computer vision and speech recognition.
  • Cognition: This is about mimicking our mental processes. It gathers a wide range of applications of decision making, that will often include both reflection (e.g. calculations) and more intuitive decisions (e.g. make forecasts), and it will most often mix both.

In-between perception and cognition lies, in my opinion, a step that I would call “Understanding”, which is the most basic cognitive task of making sense of what you perceive, yet without making any decision that would require more complex mechanisms.

I will just illustrate quickly these three aspects with driving. Let’s say you’re in your car and come across a dog on the side of the road:

Perception would be to identify that the thing standing is a dog. Understanding would be to infer that a dog is an animal, that it can be unpredictable and might jump on the road without prior warning. (Reasonable) decision making would be to slow down.

These three steps that are seamless to a human but each of them implies quite different goals and constraints. Let’s try to identify them.

Why such distinction matters?

In this part, we will intuitively qualify the constraints and requirements for each task, at a very high level. This will be very enlightening when we will start discussing the technical trends in the next part.

Perception

You expect Perception algorithms to be very accurate. The first reason is that human intelligence is most often the benchmark for AI performance (which could, by the way, be questioned), and we are very accurate at perception. Moreover, wrong perception is just like low quality data, it will certainly mess up all the decision making process downward. A fatal accident last year involved a Tesla in autopilot mode, which reportedly did not see a white truck against a bright sky that was pulling out of the side of the road and didn’t stop.
To give you a sense of what very accurate means, Google’s FaceNet algorithm reached a 99.96% accuracy at recognizing human faces in 2015, that is even surpassing human capabilities.ould also expect understanding to be very accurate. In a sense, understanding is just a complex form of perception, and we humans are also very good at it.

Understanding

Just like perception, you would also expect understanding to be very accurate. In a sense, understanding is just a complex form of perception, and we humans are also very good at it. In a Quora response, Peter Norvig explains Google’s “AI first” and mentions the following:

“With information retrieval, anything over 80% recall and precision is pretty good — not every suggestion has to be perfect, since the user can ignore the bad suggestions. With assistance, there is a much higher barrier. You wouldn’t use a service that booked the wrong reservation 20% of the time, or even 2% of the time. So, an assistant needs to be much more accurate, and thus more intelligent, more aware of the situation.”

Acquiring this common sense is all about dealing with ambiguity in our world, thanks to contextual awareness and reasoning that is intuitive for any human but too large and unstructured for an AI to understand today. This is the reason any chatbot you will talk to will turn out to be not so smart after a few exchanges: they use a massive amount of text data to get answers that will only fake intelligence, to a small extent. Microsoft learned this the hard way with their chatbot Tay, that started tweeting nazi ideas as it was merely repeating what cunny followers were telling her. Even Siri’s and M for Messenger’s fathers acknowledge that their systems are not anywhere near intelligence (article in French).

Cognition / Decision making

Decision making is very different as the concept of accuracy is fuzzy. Good decisions are a balance between the risks taken and the potential gains, and there often several good and bad decisions that will rely on both reflection (available data / precise calculations) and intuition (unavailable data / lack of time), with a blurred line in-between. In this context, accuracy is not entirely relevant. But you will rather expect decision making to be robust, in the sense that it will keep a reasonable balance between risks and potential gains even with deviance in the data, but also explainable, as you will be held accountable for it, and unbiased.

AlphaGo is very interesting example of a decision making system. First, because it has a genuine robust decision making strategy of maximizing the chance of winning and not the winning margin, as it was illustrated during the game against the world’s best Go player in May 2017. In the end, there are certainly multiple moves that would have led AlphaGo’s to victory, but it chose them according to this specific strategy. Second, it is interesting because there are both aspects of decision making (reflection and intuition) in its play. Indeed, AlphaGo relies on two scores, with equal weights, to assess the next best move:

  • The first score comes from a value network which simply gives, for a given state of the game, the probability to win the game. That’s the intuition part. It’s like when you pop up in the middle of a game between two friends and after a quick look at the board you are able to tell one of them “It looks like you’re having a hard time.”
  • The second score comes from a Monte Carlo Tree search algorithm that simulates thousands of potential outcomes. This is the reflection part.

However, AlphaGo is can be seen as a quite simple decision making system, because it does not have to make its decisions with hidden information (unlike Libratus, the AI playing poker!). Also, no one can really explain AlphaGo’s moves. When they removed AlphaGo from international competition, DeepMind said they would work on a “teaching tool […] providing an insight into how the program thinks, and hopefully giving all players and fans the opportunity to see the game through the lens of AlphaGo.”

Conclusion

Intuitively, we see that those three aspects of AI are under very different constraints that are summarized below.

As we will see, adapting to these constraints can lead to different technical approaches that we are now going to detail.

What have been the main technical trends in AI?

Most of the recent accomplishments in AI have been permitted by the development of Deep Learning (DL), a technique exposing multilayered neural networks to vast amounts of data. The expansion of the internet has been a precious source of data that new generation chips suited for DL are able to process. AlphaGo, for instance, could not have existed without the c. 160,000 human plays available on the KGS Go server, from which it learnt in the first place.

But a lot of researchers actually see this need of large datasets as a genuine limitation to DL. To illustrate their point of view, they will usually refer to the way toddlers learn. A child might be told a cat is a cat a few times, but not million times. In a similar way, a child will easily understand that if a bird can fly, other birds can fly too, except if they have a broken wing. A simple reasoning that is anything but obvious for any DL system to figure out by itself. As we reach “the end of the beginning” for AI, as Fei-Fei Li, chief scientist for Google Cloud puts it, these experts think that building systems that will only process even more data is not the right way to go to come closer to human intelligence, and that other approaches are required.

We can say that there have traditionally been three approaches in AI: the logical approach, also called symbolic, the deterministic/frequentist approach and the probabilistic/bayesian approach.

The logical/symbolic approach

Symbolic artificial intelligence is the collective name for all methods in artificial intelligence research that are based on high-level human-readable representations of problems, logic and search. Symbolic AI was the dominant paradigm of AI research from the mid-1950s until the late 1980s. Note that logic is a vast field that can modelize more complex situations than you may think, as it includes many different kinds of logic: first-order logic, propositional logic, modal logic, temporal logic, …

The most successful form of symbolic AI is expert systems, which use a network of production rules, hard coded by experts in the form of “IF THEN” statements following the best practices acquired with their domain knowledge. Because they are easily understandable by humans, these systems have a genuine advantage when it comes to building AI for decision making.

These systems are designed so that all the necessary information is provided by humans. They will learn nothing from data, because experts assess that they have all the required knowledge at hand to build systems that will create value. However, they are obviously entirely tractable for humans. These hard coded systems may be considered as AI in a broad definition, but certainly not as machine learning, in which the idea of a machine improving itself with experience (i.e. data) is clear.

The machine learning approach

Before elaborating on the two different approaches in machine learning, let’s just explain quickly the difference between the Frequentist and the Bayesian philosophies.

Frequentists use probabilities as frequencies to describe past events. Frequentists will say, “Let’s toss a coin a 100 times to approximate the true, unknown probability of heads”. And after it came up heads 48 times, they will tell you that “I am 95% confident that [0.46–0.50] contains the true probability of heads”. Because it relies on past data only, the frequentist approach is entirely objective. Randomness in this setup is merely due to incomplete sampling.
The Bayesians use probability to refer to the future: it is a belief of the likeliness of a coming event, that includes one’s uncertainty. Bayesians will say: “I think the probability of heads follows a normal distribution centered on 0.5 with a standard deviation of, say, 0.01”. They have this prior, subjective belief on the probability that they consider not as an unknown single value but as a distribution reflecting uncertainty. Then they will toss it a 100 times and with similar results, they will say: “Looks like my belief was not bad, maybe my normal distribution should rather be centered slightly lower than 0.5”. They will update the belief based on previous data using Bayes’ theorem.

There is an excellent article that explains the practical implications in details, but I’ll stay at a higher level in the explanations below, and explain the difference between the two philosophies along with the distinction between deterministic and probabilistic models.

The deterministic/frequentist approach

Deterministic models assume that there is no randomness in the phenomenon they are trying to modelize, or at least that it is neglectable. Indeed, the deterministic approach will assume that most hidden variables are either a cause or a correlated effect of the observed ones, that mathematical techniques such as Factor Analysis can approximate. It means that the model will be made of observed variables only — or, most likely, a combination of them. The probabilities that such a model outputs are comparable to frequencies computed from the past data they are fed with.

Although all kinds of neural networks exist today, they have known a large part of their success as deterministic models trained in a supervised learning way. They are capable of modeling extremely complex patterns in the data (i.e. reconstruct latent variables that humans could not do) but, because of the number of parameters they have to optimize (one per connection between neurons, also called “units”, that is one per gray line in the image below), they usually require huge amount of data before reaching satisfying levels of accuracy.

A neural network representation (source)

This is the reason why DL experts are convinced that, with enough data, enough computation power and the right algorithms, they can build systems that achieve human-level intelligence, while other experts will rather count on more probabilistic approaches.

For perception, DL (and more precisely Convolutional Neural Networks) is a very good approach because there is no hidden information in a perception task, all the information needed is what you get to see, which definitely makes it a deterministic problem. DL can count on the huge amount of data available on the internet, that is sometimes hand-labelled, to learn the complex patterns. For the same reasons, Deep Learning has also shown promising results in Understanding tasks, notably with LSTM Neural Networks. However, because understanding will often deal with ambiguity, the probabilistic approach can also be very performing.

The probabilistic/Bayesian approach

Probabilistic models, on the other hand, assume there is randomness in the phenomenon they are trying to modelize, because of hidden information (hidden variable or missing data).

Most often, the concepts of randomness and probability are used as useful tools to modelize deterministic phenomenons to simplify the problem at hand. That is because getting access to enough information to make effectively deterministic predictions is usually uneconomical. In a nutshell, probabilistic modeling is often a heuristic approach to deal with uncertainty.

To modelize this randomness, probabilistic models will either infer:

  • A probability distribution for the variable Y given the observed variables X, i.e. P(Y|X). Below are illustrated different possible probability distributions vs. a categorical forecast, according to the level of uncertainty.
(source)
  • A joint probability distribution for both X and Y, i.e. P(X,Y), in a supervised learning problem or for X only, i.e. P(X), in unsupervised learning configurations. In this case, we talk about full probabilistic models.

These full probabilistic models are called generative models, because they have not only the ability to classify new unseen examples but also to generate some. They are opposed to discriminative models, which are only suitable for supervised learning problems, and can only classify new unseen example because they only learn the decision boundary, not the whole distribution (see graph below).

For instance, let’s say you want to learn to make the difference between Spanish and Portuguese. You have two ways to learn that. Either you learn how to make the difference between the two languages (the pronunciation, …), which is the discriminative approach, or you learn Spanish and Portuguese, which is the generative approach. The latter will be more difficult, but then you will be able to create your own sentences in any of the two languages.

(source)

Inferring the probability distribution is the key (and tricky!) part in probabilistic models. It gives you the opportunity to insert in your model prior knowledge that you may have from your experience / domain knowledge. This is where Bayesian inference, that we have described earlier, and probabilistic programming come into play.

Probabilistic models are considered more explainable and also more robust as they are designed to deal with uncertainty. Probabilistic models are models that “know when they don’t know” whereas classical deterministic models might output results quite off-the-mark when the input is very different from what can be found in the data they have been trained on.

Conclusion

From the explications above, we conclude that certain approaches are intrinsically more suitable for certain tasks. You will find below a summary of these findings. Please consider this framework with care, as each of the three different approaches (except maybe for the symbolic approach) have shown interesting results in virtually any AI task, and as they are often mixed, the reality is actually very complex today as AI research is abundant. But I believe the general idea remains true and is a good framework to have in mind to understand the current debates in AI.

How are these different approaches applied to real world issues ?

In this part, I will try to provide a simple understanding on how the 3 different approaches introduced in the previous part will tackle differently the problems of Perception and Cognition with examples as concrete as possible.

Perception

I just won’t mention symbolic approach for perception task as I am not aware of any perception system that relied on such technique worth mentioning.

Deterministic / DL approach
We already know how perception works with deterministic deep learning systems: neural networks are fed with huge amount of labeled data to define complex patterns to classify each object. The main issue is that labeled data can be costly to gather. For instance, a video of one hour that will be used to train autonomous vehicles takes 800 man hours to label. Many businesses have emerged to fuel AI, sometimes based on crowdsourcing: thousands of Americans make a bit of money labeling videos for autonomous driving on mighty.ai.

Probabilistic approach
To explain how the probabilistic approach can help for perception tasks, I want to use Gamalon’s example of the lamp to illustrate how the approach can be different between deep learning and probabilistic programs in perception tasks. Gamalon is a start-up that boasts about a new proprietary technology that they call “Bayesian Program Synthesis” (BPS), that learns generative Bayesian models from data. You can see a few talks and videos explaining their vision on their website, I’ll just try to summarize one of the examples below.

The example is based on Google’s Quick Draw app, that challenges you to draw a given object under 20 seconds. While you’re drawing, a neural network is trying to recognize the object you drew. You can try it here, it’s pretty fun!

Examples of lamps drawn by users

The deep learning technology learns on an ever increasing number of labeled training examples provided by the users. It only gets good because it sees pretty much every kind of possible draws of lamps there are in the world. This is what it takes to optimize the million parameters of the network. In the end, it is very good at recognizing lamps, but it’s because it got to see a lot of them beforehand. However, it was able to learn from scratch.

On the contrary, here is how Gamalon present their technology:

1- They teach their program how a lamp looks like. They give two examples: a lamp with a tall lamppost, and one with a short one, and enter “lamp” in the system. Given that the program is probabilistic, it assumes that there is a probability distribution for the size of the lamppost (say, a normal distribution), that it is not just these two examples. This is what Ben Vigoda, Gamalon’s CEO, calls “machines having ideas”. Thanks to these two examples (see first drawing), it understands there is a high variance for the size of the lamp post, so the probability distribution might look quite as this one:

2- Thanks to this distribution, it already generalizes pretty well the idea of a lamppost, that can be a lot of different sizes. This is the reason why, in the second drawing (see above), it is able to recognize lamps with different sizes of lamppost, which a neural network would be completely unable to do with only two training examples. This is how probabilistic programming is better at generalizing.

3- However, it did not learn any variation regarding the size of the bottom rectangle nor on the position of the lamppost (which is always at the center in the two training examples), so when you draw lamps that go out of this scheme, it won’t recognize them. This shows that the algorithm is still robust and does not generalize too much on elements it should not generalize.

In this case, Gamalon’s technology is performing way better, because it is not starting from scratch, it is provided with the idea of a lamp with elements that can vary in size according to a certain probability distribution.

Understanding

Before discussing the difference between deep learning and probabilistic programming, we will see how the issue of understanding has been approached with the symbolic/logic approach.

Symbolic approach
When it comes to common sense, humans are expert at it, so why not build an expert system of common sense? Put simply, this is the rationale behind the very ambitious Cyc project, that was started in 1984. With a team of PhDs in philosophy, Cyc was about building a knowledge base with million of rules that include simple truths like “you can’t be in two places at the same time,” “you can’t pick something up unless you’re near it”. The project now lives in the start-up Lucid.ai, that has been launched… in 2016, more than 30 years after the beginning of the project. This gives you a sense of the magnitude of the task, and this is the main issue with hard coded logic. As the techniques have improved, Cyc have grown quite lonely in this approach, as learning from data appears to be way more scalable. Although the Cyc project had some real-world applications, reportedly at Goldman Sachs or the NSA, we don’t hear much about it today. Learn more in this article.

Neural network approach with word embedding
Deterministic methods relies on a technique called “word embedding” for understanding tasks, and I found it quite well described in this article, so I am going to be lazy and just quote it:

“A neural network can “learn” words by spooling through text and calculating how each word it encounters could have been predicted from the words before or after it. By doing this, the software learns to represent every word as a vector that indicates its relationship to other words — a process that uncannily captures concepts in language. The difference between the vectors for “king” and “queen” is the same as for “husband” and “wife,” for example. The vectors for “paper” and “cardboard” are close together, and those for “large” and “big” are even closer.”

The most efficient implementations of word embedding are called “Word2Vec”, a group of algorithms based on shallow (2 layers) neural networks. It is important to understand that word embedding can be done from scratch with a large corpus of texts.

Probabilistic modeling
Gamalon’s technology that we illustrated for a perception task is also used to structure knowledge from unstructured data, reportedly for e-commerce and manufacturing actors. One example they give, also on their website, is how their technology is able to correctly guess abbreviations with significantly better accuracy than deep learning. Guessing that “MA” means “Massachusetts” or “Moving Average” will typically rely on understanding contextual information and making inference that makes it a good candidate to try probabilistic approaches.

Decision making

When it comes to decision making, autonomous driving is a very good illustration because the three approaches have been used by the big actors.

Symbolic approach
The symbolic approach had prevailed so far at most companies such as Google, Tesla, or General Motors. These companies would use deep learning system for perception tasks, but they would subsequently decompose the decision making process into different parts, such as lane marking detection, path planning, and control. With this traditional robotics approach, they are able to understand exactly what’s going on in the system, analyze what goes wrong and fix it. When it can be a matter of life and death, explainability cannot be ignored. But you can imagine the limitations of such human-based decision rules when they are so many possible situations.

Deep learning approach
Nvidia was the first company to develop an end-to-end deep learning system for steering the wheel in autonomous cars (you can learn more in their research paper), with no symbolic rules between the perception and the decision making process. Because of the explainability issue, they also built a system that will highlight the inputs that are taken into account by the system (more explanations here). This is obviously not a sufficient explanation of what actually made the system decide what to do, but it’s a good start with interesting results, as you can see in the illustration below. The inputs that most influenced the decision are highlighted in green.

Now the deep learning approach can also rely on Reinforcement learning (RL) and Virtual Reality (VR), which are used as state-of-the-art tools to train autonomous cars. In a word, RL is a technique that mimics how animals learn. It works well for situations where a positive outcome is frequent enough so that, from a succession of random trials, an agent can progressively learn the action that will most likely lead to this positive outcome. It is at the core of AlphaGo’s success, which played millions of games against itself, and DeepMind has developed several other systems based on this approach. A car is actually able to train its system in virtual reality with reinforcement learning: it will crash millions of times in simulations trying to go from point A to point B (reaching point B being the positive outcome), slightly learning what is wrong and right to do. How fantastic is that?

A car training in simulation (source: TechCrunch)

End-to-end deep learning for autonomous driving with reinforcement learning is quite promising. One of the most famous startups working on it is drive.ai, that just welcomed Andrew Ng, former Chief Scientist at Baidu, in its board, and big companies like Tesla are definitely starting to go down this path too.

So what about the probabilistic approach ?
We can easily assume why the probabilistic approach is important in driving. This is typically a situation where you constantly deal with uncertainty and beliefs about what is coming next. When you look around and see a car behind or next to you, you assess the probability of the different actions it might take, based on the behaviour of the driver, how fast the car is going, … But you are sure of nothing, and while deterministic models might tell you to do an action with erroneously high confidence, probabilistic models are systems that “know when they don’t know”, and will give you a set of actions that include higher uncertainty.

As a good evidence of how probabilistic modeling is important in autonomous driving, Uber acquired Geometric Intelligence last year. Geometric Intelligence was one of the well-known startup working on new approaches in AI based on probabilistic programming as opposed to pure deep learning, and they joined Uber to form their AI lab.

Conclusion — What is AI next frontier?

I hope I have been able to make you feel the difference between the symbolic approach, deep learning and probabilistic methods.

I should now conclude on the fact that most expert agree on the fact that the algorithms of the future will most likely rely on a clever combination of the three approaches. This is what Pedro Domingos explains in his book, “The Master Algorithm”, and as underlined in this great article, the unfortunate Cyc project that does not seem to be very successful is not very close from Google’s Knowledge Graph that contains vast amount of relationship knowledge. Yoshua Bengio, a renowned deep learning expert from the University of Montreal, believes that neural nets will reach “common sense” by analyzing enormous amounts of data, and mentions: “We can take something like Cyc and we can treat it like data. We can learn from that.”. As for the start-ups working on the probabilistic approach, which include Gamalon but also Vicarious (backed by prestigious investors such as Mark Zuckerberg, Elon Musk or Jeff Bezos) or prowler.io, they all insist on the fact that their approach still mixes the different approaches.

More generally, AI’s next technological frontier will be about data efficiency, generalization and conceptual understanding, and it can count not only on these different modeling approaches but also on techniques such as Transfer learning, which is able to take a narrow AI trained on a certain task and transfer part of what it learned to train another AI faster for a similar — yet different — task (learn more in this excellent blog post). This is notably how self-driving cars trained in Virtual Reality can eventually work in the real world.

AI’s next frontier will also be about tackling its current issues. We mentioned how algorithms for decision making should be explainable and unbiased. Perception systems also suffer from weaknesses as they can easily be fooled by so-called “adversarial examples” (learn more here) that could have dramatic consequences if hackers decided to leverage such flaw.

Lots of work remains to be done to reach AI that starts looking like human intelligence — or even better if we manage to avoid replicating our own flaws in the process — but progress are being done everyday. We are definitely entering exciting times!

--

--