This tutorial contains information on how to build a neural network accelerator. The design process starts with modeling using MATLAB, then building RTL modules, and finally integration with the SoC.
Source Code
This repository contains all of the code required in order to follow this tutorial.
In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.
A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs x, weights w, and an output y.
In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.
For every neuron in NN, it does a mathematical computation, expressed as the following equation.
yj​=f(i=1∑n​xi​wi​)
For the whole NN, there are two main steps, which are forward propagation and backward propagation.
In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.
In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.
An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.
In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.
1.2. Hardware Accelerator
Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.
One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.
Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.
2. NN Model
2.1. Simple NN Example
In this tutorial, we are going to use an example of a simple NN model from this reference. Let's consider the following example:
There are four types of fruits: orange, lemon, pineapple, and Japanese persimmon.
A man eats these four types of fruits, and decides the level of sweetness and sourness of the fruits from the range of 0 to 9.
After deciding the level of sweetness k1​ and sourness k2​, then he decides which fruits he likes and which fruits he dislikes.
So let’s consider the fruits he likes as [t1​,t2​]=[1,0] and the fruits he dislikes as [t1​,t2​]=[0,1].
Fruits
Sweetness
Sourness
Supervisor
Taste
Orange
8
8
[1,0]
Like
Lemon
8
5
[0,1]
Dislike
Pineapple
5
8
[0,1]
Dislike
Persimmon
5
5
[0,1]
Dislike
So, this is a classification problem. The goal is to classify whether the man will like the fruit or not.
2.2. NN Parameters
In this tutorial, we are only considering the design of forward propagation in hardware as matrix multiplications. Figure 5 shows the neural network model with each parameter in its respective layer. It consists of an input layer, a hidden layer, and an output layer.
The definition of each parameter are as describe below:
ki​ is the input signal
wij2​ is the weight from input layer to hidden layer
bi2​ is the bias for hidden layer
zi2​ is the input for hidden layer
ai2​ is the output for hidden layer
wij3​ is the weight from hidden layer to output layer
bi3​ is the bias for output layer
zi3​ is the input for output layer
ai3​ is the output for output layer
The zi2​ and zi3​ is the sum of the bias and product of each weight from input signal and the input signal itself. The calculations are as below:
The ai2​ and ai3​ is the activation function for zi2​ and zi3​, respectively. We can use any functions that can be differentiate and normalize as activation function. For example if we use sigmoid function, the equations for ai2​ and ai3​ are as follow:
ai2​=σ(zi2​)=1+e−zi2​1​
ai3​=σ(zi3​)=1+e−zi3​1​
For the backpropagation (training) process, please refer to these file.
The calculation for every layer can be done at once with matrix operations. The following matrices are constructed from the model in Figure 5. It turns out that the bias values can be included with the weight matrices. But we have to modify the input matrices by adding a new row with a value of all ones.
The following are the trained weight values (W2​ and W3​) , bias values (B2​ and B3​) obtained from the MATLAB program after training, and input values (K).
The result of a3_rounded is the same as the result in the previous table.
3. Matrix Multiplication Module
3.1. Systolic Array Matrix Multiplication
The multiplication of matrices is a very common operation in engineering and scientific problems. The sequential implementation of this operation is very time consuming for large matrices. In hardware design, there is an algorithm for processing matrix multiplication using a systolic array. The 2D systolic array forms the heart of the Matrix Multiplier Unit (MXU) on the Google TPU and the new deep learning FPGAs from Xilinx.
A systolic array consists of multiple processing elements (PE) and registers. A PE consists of a multiplier and an adder as show in Figure 6. An example of 4x4 systolic array is shown in Figure 7. This illustration shows how to multiply two 4x4 matrices, A and B.
The input A is called moving input, and the input B is called stationary input. Every clock cycle, the input A enter the systolic array diagonally. Then, the output Y come out of the systolic array diagonally for every clock cycle.
This animation shows how the systolic array multiplies these 4x4 matrices step-by-step.
This is the Verilog implementation of the 4x4 systolic array. In this implementation, I also add registers arranged before the input in such a way that the input pattern is no longer diagonal. This is done in order to simplify the control process.
Figure 9 shows the simulation waveform of the systolic computation for the 4x4 matrix multiplication.
3.2. Process NN with Systolic Array
Our NN works with decimal numbers. In the hardware, we are going to use fixed-point representation to represent the decimal number. The Q notation is used to specify the parameters of a binary fixed-point number format.
In this design, we use Q5.10. It means that the fixed-point numbers have:
5 bits for the integer part,
10 bits for the fraction part, and
1 bit for the sign.
So, the total number of bits is 16-bit.
In the Verilog module for the systolic, we can change how many fraction bits the module works with by changing the FRAC_BIT parameter.
We can test the systolic module with the real matrix value from the NN model with this testbench.
Here is the result of matrix multiplication for the hidden layer Z2​=WB2​∗Kp​.
Then, this is the result of matrix multiplication for the output layer Z3​=WB3​∗A2p​.
4. Sigmoid Module
To calculate the sigmoid function, we use approximation with the look-up table (LUT) module. The module can calculate sigmoid with inputs ranging from -8.00000 to 7.93750. Any input outside this range will be saturated to the maximum input within this range. This is the sigmoid LUT module in Verilog.
Now, we already have systolic and sigmoid blocks. The next step is to connect these blocks to the memory input and output. We also need a control module to control the flow of data in this system. This is how the data flows:
Input data are read from the BRAM input, then sent to the stationary input of the systolic.
Weight and bias for the hidden layer are read from the BRAM weight, then sent to the moving input of the systolic.
Output from the systolic is processed by the sigmoid module, and then the result is sent to the stationary input of the systolic.
Weight and bias for the output layer are read from the BRAM weight, then sent to the moving input of the systolic.
Output from the systolic is processed by the sigmoid module, and then the result is sent to the BRAM output.
A start signal is used to start the NN module. A done signal is used to indicate that the NN computation is finished. During NN computation, the ready signal will be zero, which gives an indication that the NN module is busy.
This is the Verilog implementation of the NN module.
The following figures show how the data is stored inside the BRAM. The BRAM data width is 64-bit. For weight and bias, the data depth is 8, while for input and output, it is 4.
The following figures show the BRAM value in the Vivado simulation.
5.3. Timing Diagram
The following figure shows the timing diagram of the NN module. The module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the NN module is busy.
First, the NN module reads the input and weight for the hidden layer. Then, it is processed with the systolic and sigmoid modules. The result is processed again with the output layer's weight. Finally, the final result is stored in the output BRAM.
6. AXI-Stream Module
6.1. Control and Datapath
Now, we already have the NN module. The next step is to connect these blocks to the standard interface that can be understood by the CPU. In this case, we use the AXI-Stream interface. This is how the data flows:
Both the weight and the input data are streamed through the S_AXIS port.
The demultiplexer separates which data goes to the weight port and which data goes to the input port of the NN module.
The control unit starts the NN module and waits until it is finished.
The output data is streamed out to the M_AXIS port.
This is the Verilog implementation of the AXIS NN module.
The following figure shows the timing diagram of the AXIS NN module. The control unit module starts when the number of received streams of data is 7.
The AXIS NN control module reads the data that is temporarily stored in FIFO and sends it to the NN module. Then, it starts the module, and it waits until it is finished. Then, the output data is temporarily stored in FIFO before being sent to the M_AXIS port.
7. System Design
The following figure shows the overall SoC system design for the NN accelerator. We use the AXI Streaming FIFO IP that converts memory-mapped data to stream data and vice versa. This method is the most basic conversion. Another method that can be used is the AXI DMA IP.
The following figure shows the block design in Vivado.
8. Software Design
For the software design, we use the SDK library for AXI Streaming FIFO IP. We need to declare an array for the source and destination. Then, we define TxSend() to send weight and input data to the NN module and RxReceive() to receive output data from the NN module.
The output data format is still in fixed point format, so we have converted it to a float by dividing it by 1024 (10-bit fractions).
helloworld.c
#include <stdio.h>
#include "xparameters.h"
#include "xllfifo.h"
#include "xstatus.h"
#define WORD_SIZE 8 // Size of words in bytes
#define INPUT_LEN 7 // Weight and input
#define OUTPUT_LEN 2 // Output
int Init_XLlFifo(XLlFifo *InstancePtr, u16 DeviceId);
int TxSend(XLlFifo *InstancePtr, u64 *SourceAddr);
int RxReceive(XLlFifo *InstancePtr, u64 *DestinationAddr);
XLlFifo FifoInstance;
u64 SourceBuffer[INPUT_LEN];
u64 DestinationBuffer[OUTPUT_LEN];
float out0[4], out1[4];
int main()
{
// Initialize AXI Stream FIFO IP
Init_XLlFifo(&FifoInstance, XPAR_AXI_FIFO_0_DEVICE_ID);
printf("Initialization success\n");
// Weight
SourceBuffer[0] = 0x0000B07A057A057A;
SourceBuffer[1] = 0x0000FC6603E10314;
SourceBuffer[2] = 0x0000FC70028F0433;
SourceBuffer[3] = 0xF5A30051FAC21C70;
SourceBuffer[4] = 0x00CC07E10685E399;
// Input
SourceBuffer[5] = 0x1400140020002000;
SourceBuffer[6] = 0x1400200014002000;
// Check weight and input
printf("SourceBuffer:\n");
for (int i = 0; i < INPUT_LEN; i++)
printf(" 0x%016llX\n", SourceBuffer[i]);
// Send to NN core
TxSend(&FifoInstance, SourceBuffer);
// Read from NN core
RxReceive(&FifoInstance, DestinationBuffer);
// Check output
printf("DestinationBuffer:\n");
for (int i = 0; i < OUTPUT_LEN; i++)
printf(" 0x%016llX\n", DestinationBuffer[i]);
// Decode output (We use 10 fraction bits, so we divide by 2^10 = 1024)
out0[0] = (u16)((DestinationBuffer[0] & 0x000000000000FFFF)) / 1024.0;
out1[0] = (u16)((DestinationBuffer[1] & 0x000000000000FFFF)) / 1024.0;
out0[1] = (u16)((DestinationBuffer[0] & 0x00000000FFFF0000) >> 16) / 1024.0;
out1[1] = (u16)((DestinationBuffer[1] & 0x00000000FFFF0000) >> 16) / 1024.0;
out0[2] = (u16)((DestinationBuffer[0] & 0x0000FFFF00000000) >> 32) / 1024.0;
out1[2] = (u16)((DestinationBuffer[1] & 0x0000FFFF00000000) >> 32) / 1024.0;
out0[3] = (u16)((DestinationBuffer[0] & 0xFFFF000000000000) >> 48) / 1024.0;
out1[3] = (u16)((DestinationBuffer[1] & 0xFFFF000000000000) >> 48) / 1024.0;
// Print final output
printf("Final NN output:\n");
printf(" [%.3f, %.3f]\n", out0[0], out1[0]);
printf(" [%.3f, %.3f]\n", out0[1], out1[1]);
printf(" [%.3f, %.3f]\n", out0[2], out1[2]);
printf(" [%.3f, %.3f]\n", out0[3], out1[3]);
return 0;
}
int Init_XLlFifo(XLlFifo *InstancePtr, u16 DeviceId)
{
XLlFifo_Config *Config;
int Status;
Config = XLlFfio_LookupConfig(DeviceId);
if (!Config)
{
printf("No config found for %d\n", DeviceId);
return XST_FAILURE;
}
Status = XLlFifo_CfgInitialize(InstancePtr, Config, Config->BaseAddress);
if (Status != XST_SUCCESS)
{
printf("Initialization failed\n");
return XST_FAILURE;
}
XLlFifo_IntClear(InstancePtr, 0xffffffff);
Status = XLlFifo_Status(InstancePtr);
if (Status != 0x0)
{
printf("Reset failed\n");
return XST_FAILURE;
}
return XST_SUCCESS;
}
int TxSend(XLlFifo *InstancePtr, u64 *SourceAddr)
{
// Writing into the FIFO transmit buffer
for(int i = 0; i < INPUT_LEN; i++)
if (XLlFifo_iTxVacancy(InstancePtr))
Xil_Out64(InstancePtr->Axi4BaseAddress + XLLF_AXI4_TDFD_OFFSET, *(SourceAddr+i));
// Start transmission by writing transmission length into the TLR
XLlFifo_iTxSetLen(InstancePtr, (INPUT_LEN * WORD_SIZE));
// Check for transmission completion
while (!(XLlFifo_IsTxDone(InstancePtr)));
return XST_SUCCESS;
}
int RxReceive(XLlFifo *InstancePtr, u64* DestinationAddr)
{
static u32 ReceiveLength;
u64 RxWord;
int Status;
while (XLlFifo_iRxOccupancy(InstancePtr))
{
// Read receive length
ReceiveLength = XLlFifo_iRxGetLen(InstancePtr) / WORD_SIZE;
// Reading from the FIFO receive buffer
for (int i = 0; i < ReceiveLength; i++)
{
RxWord = Xil_In64(InstancePtr->Axi4BaseAddress + XLLF_AXI4_RDFD_OFFSET);
*(DestinationAddr+i) = RxWord;
}
}
// Check for receive completion
Status = XLlFifo_IsRxDone(InstancePtr);
if (Status != TRUE)
{
printf("Failing in receive complete\n");
return XST_FAILURE;
}
return XST_SUCCESS;
}
9. Result
The following figure shows the result on the serial terminal. The result of the output layer is similar to our calculation before: