A Deep Dive into ONNX & ONNX Runtime (Part 1)

Published in

Becoming Human: Artificial Intelligence Magazine

8 min readOct 27, 2022

Rise of deep learning started in the early 2010s thanks to the existing hardware and accelerators, and with this support, more complex and larger models were presented by researchers and engineers. However, limitations such as memory consumption and execution time remain a challenge. These challenges in the field of engineering and business become more prominent because of the limitations in computing resources.

With the spread of larger and more complex models in recent years, optimizing the models so that they can output the maximum possible quality with the minimum consumption of resources has become a critical issue. The importance of this topic in the Inference phase of deep learning models has been twofold. This importance, in addition to optimization at runtime (speed and memory consumption), is also related to the differences between the model training environment and its inference environment. In general, it is not always possible to train and infer a model on an environment. Therefore, providing the conditions to separate the development and training process from the deployment and inference process has been one of the concerns of machine learning engineers and researchers.

In this series of articles, we intend to examine the challenges of optimizing neural network models with the help of ONNX and ONNX Runtime by addressing the low-level details of them.

In part 1, we will review the introduction of optimization and acceleration methods of models and explain the reasons for the need for a common intermediate representation, ONNX.

And in part 2, we will have an in-depth look at the building blocks of ONNX Runtime.

In the following, we will have an overview of the optimization and acceleration methods available in the Inference phase:

Inference acceleration stack

This stack consists of different levels, at least one of these levels should be used to have faster inference. It can be divided into three types of levels:

1. Hardware: Improve computation with parallelization

Hardware devices are at the lowest level of accelerators. Units such as CPU, GPU, TPU, etc. perform calculations. Despite the differences, faster calculations lead to faster model derivations. However, acceleration in hardware has fundamental limitations due to reasons such as Moore’s Law.

Moore’s law — Source: https://en.wikipedia.org/wiki/Moore%27s_law

2. Software: Acceleration without changing model

Any acceleration without changing the model is included in this category. These methods are used to optimize the computational graph in such a way that the model remains unchanged.

These methods are divided into two groups:

Low-level libraries: hardware-specific optimizations

These libraries(e.g cuDNN, MKL-DNN, etc.) are generally used for graphics computing and provide highly-tuned implementations for common processes such as forward and backward convolutions, pooling, normalization, activation layers that mainly use GPU parallelism.

Low-level libraries & Graph compilers — Sourcce: https://www.sodalite.eu/content/graph-compilers-ai-training-and-inference

2. Graph compilers: Optimizing forward or backward paths in computational graph

We will talk about these groups of software accelerators later.

3. Algorithmic: Acceleration with changing model

These types of acceleration accelerate the inference process by changing the model or architecture. The focus of these methods is on removing possible redundancies in the model or keeping the important information of the model and discarding its less important information.

These methods can be classified into three categories:

Pruning: Filtering the less important weights of the network
Network quantization: Replacing floating-point weights or activations with compact representations with less precision
Neural Architecture Search(NAS): Automation of network architecture engineering by choosing the right architecture from the space of allowed architectures

Graph Compiler

Earlier in the Inference acceleration stack, graph compilers were mentioned as one of the software accelerators. In this section, we will discuss them in more depth.

Most deep learning architectures can be described using a directed acyclic graph (DAG) where each node represents a neuron. If the output of one node is the input of another node, two nodes are connected by an edge. Similar to this DAG representation, the nodes in a computational graph represent the vector operators and their edges represent the data dependencies between them.

Forward and backward pass in computational graph — Source: https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

When we define a neural network in TensorFlow or PyTorch, the network becomes a computational graph that is then executed on the desired hardware. Therefore, the computational graph can be considered an Intermediate Representation (IR), which is very useful for optimization and execution on different devices.

Computational graph complexity increases linearly with size. Graph compilers actually play a role at this point. Their goal is to optimize the generated computational graphs for inference on a given hardware.

However, the biggest challenge in using graph compilers stems from the fact that frameworks and compilers are usually developed independently of each other:

A framework may implement a new operator that is not yet implemented by the compiler.
Some basic layers (such as convolution) have different implementations.
Some compilers only work with certain frameworks; For example, OpenVino only works with TensorFlow and ONNX. So if we have a PyTorch model that we want to compile on OpenVino, it must first be compiled to ONNX and then to OpenVino(we’ll talk about ONNX later).

Graph compilers(e.g. TVM, TensorRT, OpenVino, etc.) map high-level computational graph from deep learning frameworks to operations that can be executed on a given hardware. When compiling a computational graph or mapping it to a hardware, compilers perform optimizations on the hardware to increase the speed of inference. These optimizations include:

Graph Rewriting

The structure of the graph specifies the order of execution of operations. Job scheduling considers determining the optimal order to execute a sequence of operations.

It is usually possible to achieve this optimal order by applying some basic actions; Actions such as:

Remove/add a node or edge
Node fusion
Replacing a subgraph with another subgraph
Remove layers with unused output

Operation Fusion

Computational graphs often contain sequences of operations that are relatively frequent or have special hardware kernels for them.

This fact is exploited by many compilers, by fusing operations (when possible) and eliminating unnecessary memory accesses.

Fusion of operations is seen in many cases; For example, convolution, ReLU, and BatchNorm are usually combined into one operation.

Assignment of Operations / Operation Scheduling

Part of optimization is determining the best assignment of operations for the target hardware, especially in the inference phase on multiple devices.

Graph compilers provide an additional hardware abstraction layer that accelerates the inference process on various devices. Operations are scheduled based on different policies.

In case there are several different devices, each device has its own queue of ready-to-run operations. In this case, graph compilers perform optimization by determining the appropriate scheduling strategy required to assign priorities to different nodes in the graph and by considering cross-device dependencies.

Today, each framework has its own representation of the computational graph. In addition, frameworks are usually optimized for specific purposes (such as fast training, support for complex architectures, inference on mobile devices, etc.). For this reason, the developer can choose the desired framework based on one of the mentioned goals.

According to what was said, there is a requirement that the models developed with different frameworks can work optimally in any environment and according to the configurations of that environment. In other words, we need a common intermediate representation.

ONNX, a common IR

ONNX is a common Intermediate Representation (IR) to help build a powerful ecosystem in this field. By providing a common representation of the computation graph, ONNX helps developers choose the right framework for their purposes, allows authors to focus on creative improvements, and gives hardware vendors the ability to facilitate optimizations on their platforms.

onnx format, a common IR — Source: https://bleedai.com/training-a-custom-image-classifier-with-tensorflow-converting-to-onnx-and-using-it-in-opencv-dnn-module/

The three main tasks of ONNX can be listed as follows:

Convert the model from any framework to ONNX format
Convert ONNX format to any desired framework
Faster inference with ONNX model on supported runtime engines

ONNX File Format

An ONNX file is actually an extensible specification consisting of three parts:

Definition of an extensible computational graph model
Standard data type definition
Definition of internal operators (built-in)

The first two cases together actually form the same intermediate representation or IR. The full list of built-in operators is also available here and here.

ONNX stores data in Protocol Buffers format. This format is a data serialization method that has its own IR and compiler. In the IR format of Protocol Buffers, the messages that will be exchanged are defined. Each field in the messages is numbered with a unique number, and only this number is transferred in the exchange of information to avoid sending a large amount of information. In this format, only the data type and data order are determined. Each data is interpreted by the software that uses it.

For example, in Protocol Buffers format, a node with specifications:

Y = Conv[kernel=1, pad=1, stride=1](X, W, B)

Will be defined as below:

There is a similar structure for other components of a computational graph such as graph, node, feature, tensor, etc.

The highest level data structure in ONNX is a “model” defined in Protocol Buffers as ModelProto:

For example, opset_import is a collection of “operation set” identifiers made available to the model. An implementation must support all operators in a set or reject the model. Changing, increasing or decreasing the number of operators can cause a new version of opset. For this reason, operations in ONNX are versioned.

The model developed in a given framework is converted to ONNX format by running the model on mostly random data. This transformation occurs in such a way that the performed operations are mapped to ONNX operations and finally the entire model graph is mapped to ONNX format.

Since the ONNX file is a binary file, its contents can be checked after decoding using the Protocol Buffers compiler. In the ONNX source code, this structure and how it is interpreted is laid out, which can be used to encode and decode the binary model.

To decode ONNX model to Protocol Buffers format:

protoc — decode=onnx.ModelProto onnx.proto < yourfile.onnx > yourfile.onnx.txt

And also to encode to ONNX model:

protoc — encode=onnx.ModelProto onnx.proto < yourfile.onnx.txt > yourfile.onnx

can be used.

Also, tools such as Netron, VisualDL and Zetane have been developed for the abstract visualization of ONNX model computation graph, which can be visually checked by giving the ONNX model to them.

In the next article, we’ll discuss about ONNX Runtime building blocks.