Share This Post

Nasscom Community

Hardware Acceleration of Deep Neural Network Models on FPGA (Part 2 of 2)

Hardware Acceleration of Deep Neural Network Models on FPGA (Part 2 of 2)
Ignitarium
Mon, 06/28/2021 – 13:49

While Part 1 of this 2-part blog series covered Deep Neural Networks and the different accelerators for implementing Deep Neural Network Models, Part 2 will talk about different Deep Learning Frameworks and hardware frameworks provided by FPGA Vendors.

Deep Learning Frameworks:

Deep learning framework can be considered as a tool or library that helps us to build DNN models quickly and easily without any in-depth knowledge of the underlying algorithms. It provides a condensed way for defining the models using pre-built and optimized components. Some of the important deep learning frameworks are Caffe, TensorFlow, Pytorch, Keras, etc.

Caffe is a deep neural network framework designed to improve speed and modularity. It is developed by Berkeley AI Research. Caffe mainly focuses on image processing applications involving convolutional neural networks (CNNs), but it also provides support for Region-based CNN, RNN, Long Short-term Memory and fully connected neural networks designs. It also supports CPU and GPU acceleration libraries such as NVIDIA cuDNN and Intel MKL. It provides support for C, C++, Python and MATLAB.

TensorFlow is a completely open-source deep learning framework which has pre-written code for deep learning models like RCNN and CNN. It was developed by researchers from Google. It has support for R, C++ and Python languages. It has a flexible architecture that allows deploying models across different platforms like CPU and GPU. TensorFlow works well on sequence-based data as well as on images. The latest version of TensorFlow is TensorFlow 2.0 which has significant improvements in performance on GPU.

Keras is an open-source framework that can run on top of TensorFlow. It is a high-level API which helps in fast experimentation of neural network models. Keras supports both CNN and RNN. It was developed by Francois Chollet, a Google engineer. Keras is written in python and it works perfectly on CPU as well as GPU.

PyTorch is an open-source machine learning library. It is developed by Facebook’s AI research lab and used for applications like computer vision, natural language processing etc. It has Python as well as C++ interface.

Hardware Frameworks for DNN:

FPGA as a hardware accelerator for Deep Neural Networks has its own advantages and disadvantages.  One of the main challenges is that FPGA is programmed by describing functionalities using Hardware Description Language (HDL) like VHDL or Verilog. This is different from regular programming like C or C++. To reduce the complexity, tools exist like High-Level Synthesis (HLS) which synthesize high-level languages to HDL codes. Even though implementing neural network models defined in Caffe or TensorFlow frameworks are still complex as designers require in-depth knowledge in both machine learning frameworks as well as FPGA hardware, there are different hardware frameworks developed by FPGA vendors and other third-party companies to significantly reduce such complexity.

Some of the hardware frameworks that we cover here are OpenCL, Intel’s OpenVino, Xilinx DNNDK, Xilinx Vitis AI and Lattice sensAI stack.

Open Computing Language (OpenCL) is a heterogeneous framework for writing and executing programs on different computing platforms, including CPUs, GPUs, FPGAs, Digital Signal Processors (DSPs) and other hardware accelerators. It was launched in 2009 by Apple to utilise the acceleration possibilities of on-board GPU. The newest version is 3.0, which incorporated more C++ features to the language.

The OpenCL framework officially supports C and C++, but unofficial support is available for Python, Java, Perl and INET. An OpenCL implementation of a program is based around a host containing different computing devices, such as a CPU and a GPU, which is further divided into multiple processing elements.  A function which is executed using OpenCL is called a kernel and can run in parallel on all processing elements. A programmer can utilise the acceleration capabilities available on a system by getting the device information from the computer the program is running on.

While OpenCL provides good possibilities for acceleration and resource usage, it is limited by its low-level nature. While it has functions for standard operations like FFT, neural networks have to be manually declared unless the frameworks used to generate the network have OpenCL branches. Caffe has such a branch, but it is currently under development. TensorFlow has an OpenCL-branch on its roadmap. The lack of  neural network framework support limits its adoption. A more supported and similar framework to OpenCL is Nvidia’s CUDA, although this only runs on Nvidia GPUs.

OpenVINO toolkit is provided by Intel for running neural networks on FPGAs and aims to simplify the process compared to existing solutions. The OpenVINO toolkit was launched in 2018 and it allows users to program applications where neural networks can be accelerated on Intel processors, GPUs, FPGAs and Vision Processing Units (VPUs). The toolkit is compatible with different inference targets and varies between platforms.

OpenVINO is mainly used for accelerating image recognition CNNs but can be used for other purposes such as speech recognition. It supports frameworks such as Caffe and TensorFlow and deep learning architectures such as AlexNET and GoogleNET. It supports a set number of layers for each framework out of the box, with custom layer support available for developers. 

In OpenVINO toolkit, the neural network models are optimised using Models Optimizer by taking the models files provided by the neural network framework, such as a caffemodel (from Caffe), with the calculated weights. The default model’s precision is single-precision floating-point, while quantisation to half-precision floating-point is available in the Optimizer. 8-bit integer quantisation is also available.

The Optimizer provides an optimised intermediate representation which is loaded into the code using the Inference Engine API. The API prepares and infers the network to the target device and runs the network with the supplied input data. All pre-processing and post-processing is done in C++, so the only part which has to be replaced is the inference or prediction process.

On an FPGA, OpenVINO uses a pre-loaded bitstream programmed onto the FPGA to accelerate instructions. It does not utilise HLS, but uses the FPGA as a specialised processor for performing mathematical operations found in neural networks, such as convolutions and activations. The OpenVINO bitstreams are fixed for an FPGA and do not allow customizations like adding other IO functions.

To compete with OpenVINO, Xilinx acquired Chinese developer DeePhi in 2018 and their neural network FPGA acceleration SDK Kit (DNNDK). The DNNDK SDK performs model pruning, quantisation and deployment on Xilinx FPGA development kits such as the Xilinx ZCU102, ZCU104 and Avnet Ultra96, along with some of DeePhi’s development kits.

Along with FPGAs, the systems have embedded MCUs, on the Xilinx devices called Multi-Processor System-on-Chip (MPSoC), with FPGA as Programmable Logic and MCU as Processor System (PS). DeePhi claims that the SDK is capable of accelerating CNNs as well as RNNs, achieving a speedup of 1.8x and 19x when compared to Application Specific Integrated Circuit (ASIC) and HLS-implementations of the same network, using 56x less power than the HLS implementation.

DNNDK tool kit utilizes a soft-core processor, the Deep-learning Processor Unit (DPU) to accelerate high computational tasks of DNN algorithms. The DPU is designed to support and accelerate common neural network designs, such as VGG, ResNet, GoogLeNet, YOLO, AlexNET, SSD and SqueezeNet, as well as custom networks. In contrast to OpenVINO, the FPGA image does not occupy the whole FPGA, leaving space for custom HDL code to run alongside the SDK. DNNDK is not available as a separate tool from September 2020. There will not be any new releases further. Xilinx has introduced a new version of a tool called Vitis AI for the deployment of DNN models.

Vitis AI is Xilinx’s latest development platform for DNN inference on Xilinx hardware such as edge devices and Alveo cards. It has tools, well-optimized IPs, models, libraries and example designs. It has the same development flow as DNNDK. It is developed with ease of use and efficiency in mind. Vitis AI also uses Deep Learning Processing Unit (DPU) for AI acceleration. DPU can be scaled to fit different Xilinx hardware Zynq®-7000 devices, Zynq UltraScale+ MPSoCs, and Alveo boards from edge to cloud to meet the requirements of many diverse applications.

Lattice sensAI is a full-featured stack that helps to evaluate, develop and deploy machine learning models in Lattice FPGAs provided by Lattice Semiconductor. It supports popular frameworks like Caffe, TensorFlow and Keras. They have IP cores specially designed to accelerate CNN models. They provide easy to implement, highly flexible, small and low power machine learning solutions.

FPGA Families Targeted for AI Acceleration:

FPGA vendors have optimized their FPGA families to specifically target AI Acceleration.

  • Intel® Stratix® 10 NX FPGA is Intel’s first AI-optimized FPGA. It embeds a new type of AI-optimized block, the AI Tensor Block, tuned for common matrix-matrix or vector-matrix multiplications.
  • Intel® Agilex™ FPGAs and SoCs deliver up to 40 percent higher performance or up to 40 percent lower power for applications in the data centre, networking, and edge compute.
  • Xilinx SoCs are an optimal solution for AI applications. They integrate a processor for software programmability and FPGA for hardware programmability providing scalability, flexibility and performance. They include cost-effective Zynq 7000 SoC and high end Zynq Ultrascale+ MPSoC, Zynq Ultrascale+ RFSoC.
  • Lattice Semiconductor provides FPGAs for machine learning applications which are easy to implement, low power and highly flexible. Their hardware platforms include iCE40 UltraPlus FPGA, ECP5 FPGA and CrossLink-NX.
  • Microchip has PolarFire SoC that is suitable for reliable, secure and power-efficient computations in Artificial Intelligence/Machine Learning (AI/ML), industrial automation, imaging and Internet of Things (IoT) etc

Summary: 

FPGAs are now widely used in data centres for offloading GPU-based and CPU-based inference engines. These are early days in the definition, expansion and deployment of such capabilities starting from targeted FPGAs, model development and optimization frameworks and ecosystem of supported libraries. A rapid acceleration of capabilities of FPGAs is envisaged over the next five years to tackle a plethora of applications that could be deployed in the real world.

Read Part 1 here…

This blog originally appeared on Ignitarium.com’s Blog Page.

Cover Image
Image
Publish Location

Share This Post