Part 8: Hardware Accelerator for Neural Networks
Objective
This tutorial contains information on how to build a neural network accelerator. The design process starts with modeling using MATLAB, then building RTL modules, and finally integration with the SoC.
Source Code
This repository contains all of the code required in order to follow this tutorial.
References
A high Artificial Neural Network: modeling the brain’s exposure to psychedelics, https://becominghuman.ai/a-high-artificial-neural-network-modelling-the-brains-exposure-to-psychedelics-5fa9fb13fa51
Overview of a Neural Network’s Learning Process, https://medium.com/data-science-365/overview-of-a-neural-networks-learning-process-61690a502fa
21st LSI Design Contests-in Okinawa, http://www.lsi-contest.com/2018/shiyou_3-2e.html
Why Systolic Architectures?, https://www.eecs.harvard.edu/~htk/publication/1982-kung-why-systolic-architecture.pdf
Learn More
This project-based online course offers practical insights into designing AI accelerators, specifically a CNN algorithm for handwritten digit classification. The focus of the course is on the system design level, showing how to integrate a CNN module (written in Verilog RTL) with an application processor running Linux. The final result of this project is a web application for taking a handwritten digit and sending this data to be processed with the CNN accelerator on the FPGA. On average, a speedup factor of 12x is achieved by using this accelerator compared to the CPU.

FPGA Project: CNN Accelerator for Digit Recognition: https://www.udemy.com/course/fpga-project-cnn-accelerator-for-digit-recognition/?referralCode=60E47BBAD02232833118
1. Overview
1.1. What is Neural Network
In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.
A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs , weights , and an output .
In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.

For every neuron in NN, it does a mathematical computation, expressed as the following equation.
For the whole NN, there are two main steps, which are forward propagation and backward propagation.
In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.
In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.
An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.

In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.
1.2. Hardware Accelerator
Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.
One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.

Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.

2. NN Model
2.1. Simple NN Example
In this tutorial, we are going to use an example of a simple NN model from this reference. Let's consider the following example:
There are four types of fruits: orange, lemon, pineapple, and Japanese persimmon.
A man eats these four types of fruits, and decides the level of sweetness and sourness of the fruits from the range of 0 to 9.
After deciding the level of sweetness and sourness , then he decides which fruits he likes and which fruits he dislikes.
So let’s consider the fruits he likes as and the fruits he dislikes as .

Orange
8
8
[1,0]
Like
Lemon
8
5
[0,1]
Dislike
Pineapple
5
8
[0,1]
Dislike
Persimmon
5
5
[0,1]
Dislike
So, this is a classification problem. The goal is to classify whether the man will like the fruit or not.
2.2. NN Parameters
In this tutorial, we are only considering the design of forward propagation in hardware as matrix multiplications. Figure 5 shows the neural network model with each parameter in its respective layer. It consists of an input layer, a hidden layer, and an output layer.

The definition of each parameter are as describe below:
is the input signal
is the weight from input layer to hidden layer
is the bias for hidden layer
is the input for hidden layer
is the output for hidden layer
is the weight from hidden layer to output layer
is the bias for output layer
is the input for output layer
is the output for output layer
The and is the sum of the bias and product of each weight from input signal and the input signal itself. The calculations are as below:
The and is the activation function for and , respectively. We can use any functions that can be differentiate and normalize as activation function. For example if we use sigmoid function, the equations for and are as follow:
For the backpropagation (training) process, please refer to these file.
2.3. Calculation with Matrix Multiplications
The calculation for every layer can be done at once with matrix operations. The following matrices are constructed from the model in Figure 5. It turns out that the bias values can be included with the weight matrices. But we have to modify the input matrices by adding a new row with a value of all ones.
The following are the trained weight values ( and ) , bias values ( and ) obtained from the MATLAB program after training, and input values ().
Padding the input with 1.
The following are the calculations for the hidden layer.
Padding the with 1.
The following are the calculations for the output layer.
2.4. Comparison with MATLAB
The following is the MATLAB code for the calculations. You can run this file on the online MATLAB or Octave compiler.
The result of a3_rounded is the same as the result in the previous table.
3. Matrix Multiplication Module
3.1. Systolic Array Matrix Multiplication
The multiplication of matrices is a very common operation in engineering and scientific problems. The sequential implementation of this operation is very time consuming for large matrices. In hardware design, there is an algorithm for processing matrix multiplication using a systolic array. The 2D systolic array forms the heart of the Matrix Multiplier Unit (MXU) on the Google TPU and the new deep learning FPGAs from Xilinx.

A systolic array consists of multiple processing elements (PE) and registers. A PE consists of a multiplier and an adder as show in Figure 6. An example of 4x4 systolic array is shown in Figure 7. This illustration shows how to multiply two 4x4 matrices, and .
The input is called moving input, and the input is called stationary input. Every clock cycle, the input enter the systolic array diagonally. Then, the output come out of the systolic array diagonally for every clock cycle.

This animation shows how the systolic array multiplies these 4x4 matrices step-by-step.

This is the Verilog implementation of the 4x4 systolic array. In this implementation, I also add registers arranged before the input in such a way that the input pattern is no longer diagonal. This is done in order to simplify the control process.
To test the systolic module, we can use this testbench.
Figure 9 shows the simulation waveform of the systolic computation for the 4x4 matrix multiplication.

3.2. Process NN with Systolic Array
Our NN works with decimal numbers. In the hardware, we are going to use fixed-point representation to represent the decimal number. The Q notation is used to specify the parameters of a binary fixed-point number format.
In this design, we use Q5.10. It means that the fixed-point numbers have:
5 bits for the integer part,
10 bits for the fraction part, and
1 bit for the sign.
So, the total number of bits is 16-bit.
In the Verilog module for the systolic, we can change how many fraction bits the module works with by changing the FRAC_BIT parameter.
We can test the systolic module with the real matrix value from the NN model with this testbench.
Here is the result of matrix multiplication for the hidden layer .

Then, this is the result of matrix multiplication for the output layer .

4. Sigmoid Module
To calculate the sigmoid function, we use approximation with the look-up table (LUT) module. The module can calculate sigmoid with inputs ranging from -8.00000 to 7.93750. Any input outside this range will be saturated to the maximum input within this range. This is the sigmoid LUT module in Verilog.
5. NN Module
5.1. Control and Datapath
Now, we already have systolic and sigmoid blocks. The next step is to connect these blocks to the memory input and output. We also need a control module to control the flow of data in this system. This is how the data flows:
Input data are read from the BRAM input, then sent to the stationary input of the systolic.
Weight and bias for the hidden layer are read from the BRAM weight, then sent to the moving input of the systolic.
Output from the systolic is processed by the sigmoid module, and then the result is sent to the stationary input of the systolic.
Weight and bias for the output layer are read from the BRAM weight, then sent to the moving input of the systolic.
Output from the systolic is processed by the sigmoid module, and then the result is sent to the BRAM output.

A start signal is used to start the NN module. A done signal is used to indicate that the NN computation is finished. During NN computation, the ready signal will be zero, which gives an indication that the NN module is busy.
This is the Verilog implementation of the NN module.
5.2. BRAM Data Map
The following figures show how the data is stored inside the BRAM. The BRAM data width is 64-bit. For weight and bias, the data depth is 8, while for input and output, it is 4.



The following figures show the BRAM value in the Vivado simulation.



5.3. Timing Diagram
The following figure shows the timing diagram of the NN module. The module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the NN module is busy.

First, the NN module reads the input and weight for the hidden layer. Then, it is processed with the systolic and sigmoid modules. The result is processed again with the output layer's weight. Finally, the final result is stored in the output BRAM.
6. AXI-Stream Module
6.1. Control and Datapath
Now, we already have the NN module. The next step is to connect these blocks to the standard interface that can be understood by the CPU. In this case, we use the AXI-Stream interface. This is how the data flows:
Both the weight and the input data are streamed through the
S_AXISport.The demultiplexer separates which data goes to the weight port and which data goes to the input port of the NN module.
The control unit starts the NN module and waits until it is finished.
The output data is streamed out to the
M_AXISport.

This is the Verilog implementation of the AXIS NN module.
6.2. Timing Diagram
The following figure shows the timing diagram of the AXIS NN module. The control unit module starts when the number of received streams of data is 7.

The AXIS NN control module reads the data that is temporarily stored in FIFO and sends it to the NN module. Then, it starts the module, and it waits until it is finished. Then, the output data is temporarily stored in FIFO before being sent to the M_AXIS port.
7. System Design
The following figure shows the overall SoC system design for the NN accelerator. We use the AXI Streaming FIFO IP that converts memory-mapped data to stream data and vice versa. This method is the most basic conversion. Another method that can be used is the AXI DMA IP.

The following figure shows the block design in Vivado.

8. Software Design
For the software design, we use the SDK library for AXI Streaming FIFO IP. We need to declare an array for the source and destination. Then, we define TxSend() to send weight and input data to the NN module and RxReceive() to receive output data from the NN module.
The output data format is still in fixed point format, so we have converted it to a float by dividing it by 1024 (10-bit fractions).
9. Result
The following figure shows the result on the serial terminal. The result of the output layer is similar to our calculation before:

10. Conclusion
In this tutorial, we covered the main project of a NN accelerator.
Last updated