How Are Machine Learning and Artificial Intelligence Used in Cybersecurity?

--

Artificial intelligence (AI), machine learning (ML), and deep neural networks (DNNs) are the talk of the town these days. However, few people understand the difference between these innovative technologies.

Artificial intelligence is an overarching concept that comprises several fields of computer science. It is geared toward solving tasks intrinsic to the human mind, such as speech recognition and object classification. Machine learning is part of the artificial intelligence ecosystem. Rather than modeling tasks on its own, it builds patterns by learning from data collected in the course of solving multiple similar tasks.

Machine learning includes different algorithms, such as random forest, decision trees, Naive Bayes classifiers, gradient boosting, and more. Neural networks, in general, and deep neural networks, in particular, also fall under the category of machine learning algorithms.

The goals of artificial intelligence

When working with data, artificial intelligence is intended to solve four basic tasks:

● Classification

● Regression

● Ranking

● Clustering

Machine learning, in its turn, involves two stages. The first one is training, where a human collects an array of data, uses it to train a model, and eventually gets some kind of a classifier. The second stage is the use of ML in practice, where the trained classifier is introduced into a system, and then new data is supplied. As a result, the classifier generates predictions.

Big Data Jobs

How can machine learning help in cybersecurity?

To demonstrate the benefits of leveraging ML to thwart malicious activity and, for example, prevent ransomware infections, let’s consider several email usage scenarios. In this area, we can distinguish four patterns of human behavior. Analyzing them helps predict the user’s actions and pinpoint anomalies.

1. At what time of day the person uses email: in the morning, in the afternoon, or in the evening.

2. How many devices do they use to access email: smartphone, computer, or several devices simultaneously?

3. Where the person is located when using email.

4. What approach the user follows when checking emails: top-down or bottom-up. We can tell this by the way they reply to emails and deletes them.

The answers to these questions help create a profile of the person. For machine learning, these routine behaviors are predictable as long as they occur without particular deviations.

Now, let’s imagine that a hacker has gained unauthorized access to the user’s email account. The crook’s behavior would apparently differ from that of the original account owner, with unnatural spikes of activity being tracked along the way. The task of the algorithm is to determine the moment when the behavior of the person has changed.

Another good example is the “Catch Me If You Can” contest described on Kaggle.com. The task is to distinguish the behavior of a malicious actor from that of a regular user based on the visited sites and the time the person spent on them. In other words, you can identify the attacker through website session tracking.

Trending AI Articles:

1. Why Corporate AI projects fail?

2. How AI Will Power the Next Wave of Healthcare Innovation?

3. Machine Learning by Using Regression Model

4. Top Data Science Platforms in 2021 Other than Kaggle

The building blocks of a machine learning workflow

Machine learning combines the following trio of components: data, attributes, and algorithms. Let’s look at each one to get the big picture.

Data

There are many datasets in the public domain you can use to train algorithms. However, such datasets have their disadvantages. For example, they may be incomplete, poorly marked up, and outright inaccurate.

If you want to implement an effective ML-based solution, you need to assemble a dataset that ideally fits the context of a particular task. Researchers willingly post their algorithms on publicly available resources and explain what they do, but few of them share their datasets.

A data scientist’s job is to create a dataset ready for use by integrating different pieces of relevant information, performing markup, and tidying it all up. This is a very tedious process that takes about 50% — 70% of the entire work.

Attributes

Consider a simple web request. It has about 600 attributes in total. The request length, response code, URL, context, and domain authority are a few examples. This raises two important questions:

1. Which of these attributes to use and which ones to ignore?

2. How will the solution be used — in real-time or offline?

You will have to find a trade-off between these parameters. For example, if you are going to use the solution in real-time, you want the model to be able to calculate quickly. That’s why you should use fewer attributes and a weaker model. In offline mode, you can download, analyze, and categorize data. In this case, you can leverage any model of any complexity because the algorithm can do the math for as long as you need.

Here is another example. Suppose you need to figure out if the file in front of you is dangerous or not. To do this, you should figure out the answers to the following questions first:

1. Does the file require Internet access?

2. Does it do anything similar to scanning?

3. What IPs is the file using?

4. Does it want to access the system registry?

5. Does the file interact with memory?

6. Does it try to change the file system?

7. Does the file have the ability to self-replicate or capture other files?

The answers to these questions will help identify attributes that can be used to solve your problem.

Algorithms

Machine learning algorithms can be split into several categories:

● Learning without a “teacher”

● Learning with a “teacher”

● Learning with partial involvement of a “teacher” (semi-supervised learning)

● Reinforcement learning

In an unsupervised learning situation, a person enters unlabeled data into the algorithm and waits for the algorithm to produce the desired result. In the case of learning with a teacher, the data is already marked up, and it is possible to distinguish an intruder’s actions from those of a regular user. This is the most convenient and widespread option. The approach of learning without a teacher is not used in security at this point.

Partial teacher-assisted learning is something in between the first and second types. You need to train an algorithm based on unlabeled data and then test its accuracy using labeled data. For example, you can use 90% of your dataset to train a model and 10% to test its accuracy. This kind of algorithm saves a good deal of time.

In reinforcement learning, there is an agent and an environment. When interacting with the environment, the agent receives either a reward or a penalty. It has to adjust its strategy so that it always receives the reward. For example, it could be a walking robot trying to learn how to take steps or a wheeled mobile robot attempting to get from point A to point B without bumping into anything.

The plus side of this algorithm is that it works in real-time when interacting with the system. You don’t have to complete the dataset here, as it’s collected and labeled concurrently. This type of algorithm is not used in cybersecurity, though.

How do ML algorithms differ from one another?

It’s worth considering seven machine learning algorithms:

● Linear regression

● Logistic regression

● Decision trees

● Support-vector machine

● Naive Bayes classifier

● Random forest

● Gradient boosting

The gradient boosting algorithm and random forest are powerful spin-offs of one of the fundamental algorithms called decision trees. For instance, in a random forest algorithm, up to 1,000 decision trees can run in parallel to archive a maximum result.

In the case of gradient boosting, the task is calculated differently. For example, it can combine two random trees that line up and create a more accurate result than with any other single algorithm and neural network. Let’s now go over the functional peculiarities of some popular algorithms.

Naive Bayes classifier

The Naive Bayes classifier had been heavily used in spam filters until around 2010, but spammers started tweaking it and made it virtually unusable in that area. Nevertheless, let’s see how it works.

Imagine that your task is to determine whether an email is a spam or not. As you train the classifier, you get two lists: one with “good” words and the other with “bad” words that are very often encountered in spam messages. These lists include words themselves and specify the frequency with which they usually occur.

Suppose the letter contains the word “dog.” The algorithm will tell you that it occurs much more often in benign emails than in spam. Next, information about each word is entered into a Naive Bayes classifier formula that calculates the probability of whether it is spam.

Decision Trees

Here, again, we can use the example of analyzing a specific file for malicious traits. At each step, the algorithm will decide what question to ask in the process of learning. Suppose it asks if the file requires an internet connection. The answer can be either positive or negative. This determines what the next question will be.

For example, if the answer is “No,” the algorithm asks if the file requires access to the registry. If the answer to this question is “Yes,” it means that the file is dangerous. Otherwise, the algorithm asks whether the file works with memory directly. If it doesn’t, then the file is harmless. If the answer is “Yes,” a new question must be asked. This process goes on until it becomes clear whether the file is harmful or not.

Differences Between Neural Networks and Other Algorithms

The classic machine learning scheme is based on the following workflow: a person manually extracts data attributes, selects the important ones, and builds a classifier that will produce a result.

Neural networks select the relevant attributes and perform the classification independently. However, they are more sensitive to adjustments and quite resource intensive. Furthermore, depending on the number of layers in a deep neural network, the accuracy of the result can vary greatly — for example, from 80% to 99%.

Neural networks completely solve a task without involving a human, but it’s necessary to understand their scope of use. When it comes to Internet security, they work best in speech recognition or processing images and video materials. For instance, neural networks are used in the Face ID feature on iPhones.

It’s noteworthy that neural networks can be susceptible to exploitation. In 2018, researchers at Google found a way to inject “noise data” into neural networks and thereby make them generate inaccurate results.

The place of machine learning in the information security paradigm

As far as security goes, here is what the use of machine learning looks like in practice:

1. Loading, collecting, and processing data.

2. Singling out the attributes.

3. Training several models, comparing those models, and selecting the most suitable one.

4. Introducing the best model into a system.

5. Injecting data into the model and analyzing the output.

Some people think that in a security solution based on machine learning, the algorithm does everything autonomously: learns, detects anomalies, and identifies intruders. However, this is a misconception. Machine learning involves two steps: learning and using the model. These stages don’t overlap. It means that you need to train the model first, and when you start using it, the model no longer undergoes training.

For example, suppose you have taught a network to recognize cats and dogs. If this network is shown a walrus, it will keep comparing it against the predefined attributes of a cat or a dog. But obviously, a walrus is not a cat or a dog, and the solution will encounter an error. The same thing can happen when we trust a network to keep track of some aspect of security.

Let’s say you have an array of information about how hackers operate. You need to somehow turn that into a dataset and mark it up in the system. When hackers act according to the scheme from your dataset, the classifier will instantly recognize the threat and respond appropriately. If hackers come up with a new vector of compromise that is not in the dataset, the output can be unpredictable.

Summary

1. There is no universal algorithm that solves all problems. Different algorithms help solve problems in different scenarios.

2. The data needs to be constantly updated. You can’t train an algorithm once and rely on it indefinitely. As previously mentioned, the Naive Bayes classifier used to do a great job filtering spam, but threat actors found a way to bypass it. Therefore, you can’t use the algorithm for that purpose anymore.

3. You need to choose the complexity of a model depending on where and how it will be used. For example, if you need the solution to work in real-time, consider choosing a “lighter” model that operates fast and doesn’t gobble up too many resources. In offline mode, the time is not limited, so the model can perform analysis for as long as you want.

4. If a dataset is incomplete or inaccurate, no algorithm will be effective.

Therefore, you need to constantly monitor the situation in your area of expertise, enrich the datasets, mark them up, and keep training the classifier. This is the only way to minimize the number of system flaws.

Don’t forget to give us your 👏 !

--

--

David Balaban is a computer security researcher with over 15 years of experience in malware analysis and antivirus software.