Part 4: ANN Processor

Objective

After you complete this tutorial, you should be able to:

  • Understand how to design a simple ANN accelerator from HW to SW.

Source Code

This repository contains all of the code required in order to follow this tutorial.

Learn More

This project-based online course offers practical insights into designing AI accelerators, specifically a CNN algorithm for handwritten digit classification. The focus of the course is on the system design level, showing how to integrate a CNN module (written in Verilog RTL) with an application processor running Linux. The final result of this project is a web application for taking a handwritten digit and sending this data to be processed with the CNN accelerator on the FPGA. On average, a speedup factor of 12x is achieved by using this accelerator compared to the CPU.

FPGA Project: CNN Accelerator for Digit Recognition: https://www.udemy.com/course/fpga-project-cnn-accelerator-for-digit-recognition/?referralCode=60E47BBAD02232833118

1. Introduction

1.1. What is ANN

In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.

A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs xx, weights ww, and an output yy.

In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.

For every neuron in NN, it does a mathematical computation, expressed as the following equation.

yj=f(∑i=1nxiwi)y_j=f(\sum_{i=1}^{n} x_iw_i)

For the whole NN, there are two main steps, which are forward propagation and backward propagation.

  • In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.

  • In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.

An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.

In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.

1.2. Hardware Accelerator

Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.

One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.

Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.

2. ANN Model

2.1. Simple ANN Example

In this tutorial, we are going to use an example of a simple NN model. Let's consider an example of the classification of someone's tastes to Indonesian food.

The following table shows the dataset of someone's taste in Indonesian food.

Name
Sour
Sweet
Salty
Spicy
Taste
Label

Sate Maranggi

2

10

5

3

Like

[1,0]

Soto

7

2

3

3

Dislike

[0,1]

Karedok

6

8

1

6

Dislike

[0,1]

Gudeg

3

10

3

1

Like

[1,0]

Ikan Bakar

6

9

5

6

Like

[1,0]

Rendang

3

2

6

10

Dislike

[0,1]

  • There are six types of foods.

  • A man eats these six types of foods and decides the level of sourness, sweetness, saltiness, and spiciness of the foods.

  • After deciding the level of sourness k1k_1, sweetness k2k_2, saltiness k3k_3, and spiciness k4k_4, then he decides which foods he likes and which foods he dislikes.

  • So let’s consider the foods he likes as [t1,t2]=[1,0][t_1,t_2]=[1,0] and the foods he dislikes as [t1,t2]=[0,1][t_1,t_2]=[0,1].

2.2. ANN Computation

This is the neural network architecture for this application. It consists of one input layer, one hidden layer, and one output layer. It takes an input matrix KpK_p to produce the final output matrix A3A_3.

The calculation of NN inference using matrix multiplication by hand consists of 6 steps. The ANN processor design implements these steps in FPGA. This calculation is useful for the verification process.

  1. Padding input:

Kp=[276363102810925313563361610111111]K_p=\begin{bmatrix} 2 & 7 & 6 & 3 & 6 & 3\\ 10 & 2 & 8 & 10 & 9 & 2\\ 5 & 3 & 1 & 3 & 5 & 6\\ 3 & 3 & 6 & 1 & 6 & 10\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}
  1. Matrix multiplication hidden layer 1:

WB2∗Kp=Z2WB_2*K_p=Z_2
[−1.21.31.7−1.3−1.30.30.50.21−10.60.10.81.5−11.3−1.2−1.41.3−0.91.30.30.50.4−1]∗[276363102810925313563361610111111]=[13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5]\begin{bmatrix} -1.2 & 1.3 & 1.7 & -1.3 & -1.3\\ 0.3 & 0.5 & 0.2 & 1 & -1\\ 0.6 & 0.1 & 0.8 & 1.5 & -1\\ 1.3 & -1.2 & -1.4 & 1.3 & -0.9\\ 1.3 & 0.3 & 0.5 & 0.4 & -1\\ \end{bmatrix}* \begin{bmatrix} 2 & 7 & 6 & 3 & 6 & 3\\ 10 & 2 & 8 & 10 & 9 & 2\\ 5 & 3 & 1 & 3 & 5 & 6\\ 3 & 3 & 6 & 1 & 6 & 10\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}= \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix}
  1. Activation hidden layer 1:

σ(Z2)=A2\sigma(Z_2)=A_2
σ([13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5])=[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99]\sigma( \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix} )= \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ \end{bmatrix}
  1. Padding output hidden 1:

A2p=[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99111111]A_{2p}= \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}
  1. Matrix multiplication hidden layer 2:

WB3∗A2p=Z3WB_3*A_{2p}=Z_3
[5.2−0.30.8−3.50.1−1.5−4.80.10.740.9−1.4]∗[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99111111]=[4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24]\begin{bmatrix} 5.2 & -0.3 & 0.8 & -3.5 & 0.1 & -1.5\\ -4.8 & 0.1 & 0.7 & 4 & 0.9 & -1.4\\ \end{bmatrix}* \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}= \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix}
  1. Activation hidden layer 2:

σ(Z3)=A3\sigma(Z_3)=A_3
σ([4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24])=[0.980.010.010.980.980.010.010.980.980.010.010.98]\sigma( \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix} )= \begin{bmatrix} 0.98 & 0.01 & 0.01 & 0.98 & 0.98 & 0.01\\ 0.01 & 0.98 & 0.98 & 0.01 & 0.01 & 0.98\\ \end{bmatrix}

You can compare the result matrix A3A_3 with the label from the dataset. The result should be the same.

3. Hardware Design

3.1. Basic Processing Elements

There are two basic processing elements for ANN computation: register and MAC operation.

This code is an implementation of a 16-bit register in Verilog.

This code is an implementation of a MAC or PE in Verilog.

3.2. Systolic Matrix Multiplication

From the basic modules register and PE, we can construct a module for matrix multiplication using systolic architecture.

This code is an implementation of a systolic module in Verilog.

You can verify the matrix multiplication operation of the systolic using the testbench.

Matrix multiplication hidden layer 1:

Verification with the model. The decimal may be different due to rounding and fixed-point implementation.

Z2=[13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5]Z_2= \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix}

Matrix multiplication hidden layer 2:

Verification with the model. The decimal may be different due to rounding and fixed-point implementation.

Z3=[4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24]Z_3= \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix}

3.3. Sigmoid LUT

To calculate the sigmoid function, we can use the lookup table method. The following figure illustrates a basic LUT implementation of sigmoid.

This code is an implementation of a sigmoid module in Verilog.

3.4. ANN Core

We already have the systolic module and sigmoid module. The next step is to construct the ANN core computation. Additionally, we need block memories to store weight, input, and output.

This code is an implementation of the ANN core in Verilog.

You can simulate using testbench to get the ANN core timing diagram.

How it works:

  1. Start the controller FSM

  2. Read input from memory followed by weight and bias 2

  3. Systolic input stream hidden layer 1

  4. Output stream from sigmoid hidden layer 1

  5. Read weight and bias 3 from memory

  6. Systolic input stream hidden layer 2

  7. Output from sigmoid hidden layer 2

  8. Write output to memory

  9. Done signal

3.5. AXI Stream Module

Once we get the ANN core, we need to wrap this with the AXI stream module so that it is compatible with the AXI stream protocol.

This code is an implementation of the AXIS ANN in Verilog.

You can simulate using testbench to get the AXIS ANN module timing diagram.

3.6. SoC Design

At this point, we already have the AXIS ANN module. Next, you need to create a block design that consists of Zynq IP, AXI DMA, and AXIS ANN.

Configure the AXI DMA stream width to 128-bit as shown in the following figure.

The AXI DMA will read the weight, bias, and input and also write output to a specific location in DDR memory.

Data mapping inside the DDR memory for weight, bias, and input.

Data mapping inside the DDR memory for output.

4. Software Design

At this point, the required files to program the FPGA are already on the board. The next step is to create Jupyter Notebook files.

  • Open a web browser and open Jupyter Notebook on the board. Create a new file from menu New, Python 3 (pykernel).

  • Write the following code to test the design.

5. Performance

We can compare the performance of the HW-based ANN computation with SW computation using this code.

Last updated