Part 8: Hardware Accelerator for Neural Networks

Objective

This tutorial contains information on how to build a neural network accelerator. The design process starts with modeling using MATLAB, then building RTL modules, and finally integration with the SoC.

Source Code

This repository contains all of the code required in order to follow this tutorial.

References

Learn More

This project-based online course offers practical insights into designing AI accelerators, specifically a CNN algorithm for handwritten digit classification. The focus of the course is on the system design level, showing how to integrate a CNN module (written in Verilog RTL) with an application processor running Linux. The final result of this project is a web application for taking a handwritten digit and sending this data to be processed with the CNN accelerator on the FPGA. On average, a speedup factor of 12x is achieved by using this accelerator compared to the CPU.

FPGA Project: CNN Accelerator for Digit Recognition: https://www.udemy.com/course/fpga-project-cnn-accelerator-for-digit-recognition/?referralCode=60E47BBAD02232833118

1. Overview

1.1. What is Neural Network

In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.

A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs xx, weights ww, and an output yy.

In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.

For every neuron in NN, it does a mathematical computation, expressed as the following equation.

yj=f(∑i=1nxiwi)y_j=f(\sum_{i=1}^{n} x_iw_i)

For the whole NN, there are two main steps, which are forward propagation and backward propagation.

  • In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.

  • In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.

An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.

In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.

1.2. Hardware Accelerator

Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.

One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.

Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.

2. NN Model

2.1. Simple NN Example

In this tutorial, we are going to use an example of a simple NN model from this reference. Let's consider the following example:

  • There are four types of fruits: orange, lemon, pineapple, and Japanese persimmon.

  • A man eats these four types of fruits, and decides the level of sweetness and sourness of the fruits from the range of 0 to 9.

  • After deciding the level of sweetness k1k_1 and sourness k2k_2, then he decides which fruits he likes and which fruits he dislikes.

  • So let’s consider the fruits he likes as [t1,t2]=[1,0][t_1,t_2]=[1,0] and the fruits he dislikes as [t1,t2]=[0,1][t_1,t_2]=[0,1].

Fruits
Sweetness
Sourness
Supervisor
Taste

Orange

8

8

[1,0]

Like

Lemon

8

5

[0,1]

Dislike

Pineapple

5

8

[0,1]

Dislike

Persimmon

5

5

[0,1]

Dislike

So, this is a classification problem. The goal is to classify whether the man will like the fruit or not.

2.2. NN Parameters

In this tutorial, we are only considering the design of forward propagation in hardware as matrix multiplications. Figure 5 shows the neural network model with each parameter in its respective layer. It consists of an input layer, a hidden layer, and an output layer.

Figure 5. Three-layer model with parameter. Source: http://www.lsi-contest.com/2018/shiyou_3-2e.html

The definition of each parameter are as describe below:

  • kik_i is the input signal

  • wi  j2w^2_{i\;j} is the weight from input layer to hidden layer

  • bi2b^2_i is the bias for hidden layer

  • zi2z^2_i is the input for hidden layer

  • ai2a^2_i is the output for hidden layer

  • wi  j3w^3_{i\;j} is the weight from hidden layer to output layer

  • bi3b^3_i is the bias for output layer

  • zi3z^3_i is the input for output layer

  • ai3a^3_i is the output for output layer

The zi2z^2_i and zi3z^3_i is the sum of the bias and product of each weight from input signal and the input signal itself. The calculations are as below:

zi2=w1  i2k1+w2  i2k2+bi2z^2_i=w^2_{1\;i}k_1+w^2_{2\;i}k_2+b^2_i
zi3=w1  i3a12+w2  i3a22+w3  i3a32+bi3z^3_i=w^3_{1\;i}a^2_1+w^3_{2\;i}a^2_2+w^3_{3\;i}a^2_3+b^3_i

The ai2a^2_i and ai3a^3_i is the activation function for zi2z^2_i and zi3z^3_i, respectively. We can use any functions that can be differentiate and normalize as activation function. For example if we use sigmoid function, the equations for ai2a^2_i and ai3a^3_i are as follow:

ai2=σ(zi2)=11+e−zi2a^2_i=\sigma(z^2_i)=\frac{1}{1+e^{-z^2_i}}
ai3=σ(zi3)=11+e−zi3a^3_i=\sigma(z^3_i)=\frac{1}{1+e^{-z^3_i}}

For the backpropagation (training) process, please refer to these file.

2.3. Calculation with Matrix Multiplications

The calculation for every layer can be done at once with matrix operations. The following matrices are constructed from the model in Figure 5. It turns out that the bias values can be included with the weight matrices. But we have to modify the input matrices by adding a new row with a value of all ones.

WB2=[w1  12w2  12b12w1  22w2  22b22w1  32w2  32b32],WB3=[w1  13w2  13w3  13b13w1  23w2  23w3  23b23]\bf{WB_2=\begin{bmatrix}w^2_{1\;1} & w^2_{2\;1} & b^2_{1}\\ w^2_{1\;2} & w^2_{2\;2} & b^2_{2}\\ w^2_{1\;3} & w^2_{2\;3} & b^2_{3}\end{bmatrix}, WB_3=\begin{bmatrix}w^3_{1\;1} & w^3_{2\;1} & w^3_{3\;1} & b^3_{1}\\ w^3_{1\;2} & w^3_{2\;2} & w^3_{3\;2} & b^3_{2}\end{bmatrix}}

The following are the trained weight values (W2\bf{W_2} and W3\bf{W_3}) , bias values (B2\bf{B_2} and B3\bf{B_3}) obtained from the MATLAB program after training, and input values (K\bf{K}).

WB2=[1.371.37−19.880.770.97−0.901.050.64−0.89],WB3=[7.11−1.310.08−2.59−7.101.631.970.20]\bf{WB_2=\begin{bmatrix}1.37 & 1.37 & -19.88\\ 0.77 & 0.97 & -0.90\\ 1.05 & 0.64 & -0.89\end{bmatrix}, WB_3=\begin{bmatrix}7.11 & -1.31 & 0.08 & -2.59\\ -7.10 & 1.63 & 1.97 & 0.20\end{bmatrix}}
K=[88558585]\bf{K=\begin{bmatrix}8 & 8 & 5 & 5\\ 8 & 5 & 8 & 5\end{bmatrix}}

Padding the input with 1.

Kp=[885585851111]\bf{K_{p}=\begin{bmatrix}8 & 8 & 5 & 5\\ 8 & 5 & 8 & 5\\ 1 & 1 & 1 & 1\end{bmatrix}}

The following are the calculations for the hidden layer.

Z2=WB2∗Kp\bf{Z_2=WB_2*K_p}
Z2=[1.371.37−19.880.770.97−0.901.050.64−0.89]∗[885585851111]\bf{Z_2=\begin{bmatrix}1.37 & 1.37 & -19.88\\ 0.77 & 0.97 & -0.90\\ 1.05 & 0.64 & -0.89\end{bmatrix} * \begin{bmatrix}8 & 8 & 5 & 5\\ 8 & 5 & 8 & 5\\ 1 & 1 & 1 & 1\end{bmatrix}}
Z2=[2.04−2.07−2.07−6.1813.0210.1110.717.8012.6310.719.487.56]\bf{Z_2=\begin{bmatrix}2.04 & -2.07 & -2.07 & -6.18\\ 13.02 & 10.11 & 10.71 & 7.80\\ 12.63 & 10.71 & 9.48 & 7.56\end{bmatrix}}
A2=1(1+e−Z2)=[0.8840.1120.1120.0021.0000.9990.9990.9991.0000.9990.9990.999]\bf{A_2=\frac{1}{(1+e^{-Z_2})}=\begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\end{bmatrix}}

Padding the A2A_2 with 1.

A2p=[0.8840.1120.1120.0021.0000.9990.9990.9991.0000.9990.9990.9991111]\bf{A_{2p}=\begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1 & 1 & 1 & 1\end{bmatrix}}

The following are the calculations for the output layer.

Z3=WB3∗A2p\bf{Z_3 = WB_3 * A_{2p} }
Z3=[7.11−1.310.08−2.59−7.101.631.970.20]∗[0.8840.1120.1120.0021.0000.9990.9990.9991.0000.9990.9990.9991111]\bf{Z_3=\begin{bmatrix}7.11 & -1.31 & 0.08 & -2.59\\ -7.10 & 1.63 & 1.97 & 0.20\end{bmatrix} * \begin{bmatrix}0.884 & 0.112 & 0.112 & 0.002\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1.000 & 0.999 & 0.999 & 0.999\\ 1 & 1 & 1 & 1\end{bmatrix}}
Z3=[2.47−3.02−3.02−3.80−2.483.003.003.78]\bf{Z_3=\begin{bmatrix}2.47 & -3.02 & -3.02 & -3.80\\ -2.48 & 3.00 & 3.00 & 3.78\end{bmatrix}}
A3=1(1+e−Z2)=[0.9220.0460.0460.0210.0770.9520.9520.977]\bf{A_3=\frac{1}{(1+e^{-Z_2})}=\begin{bmatrix}0.922 & 0.046 & 0.046 & 0.021\\ 0.077 & 0.952 & 0.952 & 0.977\end{bmatrix}}

2.4. Comparison with MATLAB

The following is the MATLAB code for the calculations. You can run this file on the online MATLAB or Octave compiler.

The result of a3_rounded is the same as the result in the previous table.

3. Matrix Multiplication Module

3.1. Systolic Array Matrix Multiplication

The multiplication of matrices is a very common operation in engineering and scientific problems. The sequential implementation of this operation is very time consuming for large matrices. In hardware design, there is an algorithm for processing matrix multiplication using a systolic array. The 2D systolic array forms the heart of the Matrix Multiplier Unit (MXU) on the Google TPU and the new deep learning FPGAs from Xilinx.

Figure 6. A processing element (PE)

A systolic array consists of multiple processing elements (PE) and registers. A PE consists of a multiplier and an adder as show in Figure 6. An example of 4x4 systolic array is shown in Figure 7. This illustration shows how to multiply two 4x4 matrices, A\bf{A} and B\bf{B}.

Y=A∗B=[12345678910111213141516]∗[12345678910111213141516]=[90100110120202228254280314356398440426484542600]\bf{Y=A*B=\begin{bmatrix}1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12\\ 13 & 14 & 15 & 16\end{bmatrix} * \begin{bmatrix}1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12\\ 13 & 14 & 15 & 16\end{bmatrix}=\begin{bmatrix}90 & 100 & 110 & 120\\ 202 & 228 & 254 & 280\\ 314 & 356 & 398 & 440\\ 426 & 484 & 542 & 600\end{bmatrix}}

The input A\bf{A} is called moving input, and the input B\bf{B} is called stationary input. Every clock cycle, the input A\bf{A} enter the systolic array diagonally. Then, the output Y\bf{Y} come out of the systolic array diagonally for every clock cycle.

Figure 7. An exmaple of 4x4 systolic array

This animation shows how the systolic array multiplies these 4x4 matrices step-by-step.

Figure 8. The step-by-step processing of 4x4 matrices in the systolic array

This is the Verilog implementation of the 4x4 systolic array. In this implementation, I also add registers arranged before the input in such a way that the input pattern is no longer diagonal. This is done in order to simplify the control process.

To test the systolic module, we can use this testbench.

Figure 9 shows the simulation waveform of the systolic computation for the 4x4 matrix multiplication.

Figure 9. Simulation waveform of the systolic computation for the 4x4 matrix multiplication

3.2. Process NN with Systolic Array

Our NN works with decimal numbers. In the hardware, we are going to use fixed-point representation to represent the decimal number. The Q notation is used to specify the parameters of a binary fixed-point number format.

In this design, we use Q5.10. It means that the fixed-point numbers have:

  • 5 bits for the integer part,

  • 10 bits for the fraction part, and

  • 1 bit for the sign.

So, the total number of bits is 16-bit.

In the Verilog module for the systolic, we can change how many fraction bits the module works with by changing the FRAC_BIT parameter.

We can test the systolic module with the real matrix value from the NN model with this testbench.

Here is the result of matrix multiplication for the hidden layer Z2=WB2∗Kp\bf{Z_2=WB_2*K_p}.

Figure 10. Simulation waveform for the hidden layer

Then, this is the result of matrix multiplication for the output layer Z3=WB3∗A2p\bf{Z_3 = WB_3 * A_{2p} }.

Figure 11. Simulation waveform for the output layer

4. Sigmoid Module

To calculate the sigmoid function, we use approximation with the look-up table (LUT) module. The module can calculate sigmoid with inputs ranging from -8.00000 to 7.93750. Any input outside this range will be saturated to the maximum input within this range. This is the sigmoid LUT module in Verilog.

5. NN Module

5.1. Control and Datapath

Now, we already have systolic and sigmoid blocks. The next step is to connect these blocks to the memory input and output. We also need a control module to control the flow of data in this system. This is how the data flows:

  1. Input data are read from the BRAM input, then sent to the stationary input of the systolic.

  2. Weight and bias for the hidden layer are read from the BRAM weight, then sent to the moving input of the systolic.

  3. Output from the systolic is processed by the sigmoid module, and then the result is sent to the stationary input of the systolic.

  4. Weight and bias for the output layer are read from the BRAM weight, then sent to the moving input of the systolic.

  5. Output from the systolic is processed by the sigmoid module, and then the result is sent to the BRAM output.

Figure 12. NN module block diagram

A start signal is used to start the NN module. A done signal is used to indicate that the NN computation is finished. During NN computation, the ready signal will be zero, which gives an indication that the NN module is busy.

This is the Verilog implementation of the NN module.

5.2. BRAM Data Map

The following figures show how the data is stored inside the BRAM. The BRAM data width is 64-bit. For weight and bias, the data depth is 8, while for input and output, it is 4.

Figure 13. BRAM weight and bias
Figure 14. BRAM input
Figure 15. BRAM output

The following figures show the BRAM value in the Vivado simulation.

Figure 16. BRAM weight and bias in simulation
Figure 17. BRAM input in simulation
Figure 18. BRAM output in simulation

5.3. Timing Diagram

The following figure shows the timing diagram of the NN module. The module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the NN module is busy.

Figure 19. Timing diagram of NN module

First, the NN module reads the input and weight for the hidden layer. Then, it is processed with the systolic and sigmoid modules. The result is processed again with the output layer's weight. Finally, the final result is stored in the output BRAM.

6. AXI-Stream Module

6.1. Control and Datapath

Now, we already have the NN module. The next step is to connect these blocks to the standard interface that can be understood by the CPU. In this case, we use the AXI-Stream interface. This is how the data flows:

  1. Both the weight and the input data are streamed through the S_AXIS port.

  2. The demultiplexer separates which data goes to the weight port and which data goes to the input port of the NN module.

  3. The control unit starts the NN module and waits until it is finished.

  4. The output data is streamed out to the M_AXIS port.

Figure 20. AXIS NN module block diagram

This is the Verilog implementation of the AXIS NN module.

6.2. Timing Diagram

The following figure shows the timing diagram of the AXIS NN module. The control unit module starts when the number of received streams of data is 7.

Figure 21. Timing diagram of AXIS NN module

The AXIS NN control module reads the data that is temporarily stored in FIFO and sends it to the NN module. Then, it starts the module, and it waits until it is finished. Then, the output data is temporarily stored in FIFO before being sent to the M_AXIS port.

7. System Design

The following figure shows the overall SoC system design for the NN accelerator. We use the AXI Streaming FIFO IP that converts memory-mapped data to stream data and vice versa. This method is the most basic conversion. Another method that can be used is the AXI DMA IP.

Figure 22. Block diagram of NN system

The following figure shows the block design in Vivado.

Figure 23. Block diagram of NN system in Vivado block design

8. Software Design

For the software design, we use the SDK library for AXI Streaming FIFO IP. We need to declare an array for the source and destination. Then, we define TxSend() to send weight and input data to the NN module and RxReceive() to receive output data from the NN module.

The output data format is still in fixed point format, so we have converted it to a float by dividing it by 1024 (10-bit fractions).

9. Result

The following figure shows the result on the serial terminal. The result of the output layer is similar to our calculation before:

A3=[0.9220.0460.0460.0210.0770.9520.9520.977]\bf{A_3=\begin{bmatrix}0.922 & 0.046 & 0.046 & 0.021\\ 0.077 & 0.952 & 0.952 & 0.977\end{bmatrix}}
Figure 24. NN module output printed on the serial terminal

10. Conclusion

In this tutorial, we covered the main project of a NN accelerator.

Last updated