Part 4: ANN Processor
Objective
After you complete this tutorial, you should be able to:
Understand how to design a simple ANN accelerator from HW to SW.
Source Code
This repository contains all of the code required in order to follow this tutorial.
1. Introduction
1.1. What is ANN
In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.
A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs x, weights w, and an output y.
In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.
For every neuron in NN, it does a mathematical computation, expressed as the following equation.
For the whole NN, there are two main steps, which are forward propagation and backward propagation.
In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.
In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.
An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.
In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.
1.2. Hardware Accelerator
Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.
One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.
Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.
2. ANN Model
2.1. Simple ANN Example
In this tutorial, we are going to use an example of a simple NN model. Let's consider an example of the classification of someone's tastes to Indonesian food.
The following table shows the dataset of someone's taste in Indonesian food.
Sate Maranggi
2
10
5
3
Like
[1,0]
Soto
7
2
3
3
Dislike
[0,1]
Karedok
6
8
1
6
Dislike
[0,1]
Gudeg
3
10
3
1
Like
[1,0]
Ikan Bakar
6
9
5
6
Like
[1,0]
Rendang
3
2
6
10
Dislike
[0,1]
There are six types of foods.
A man eats these six types of foods and decides the level of sourness, sweetness, saltiness, and spiciness of the foods.
After deciding the level of sourness k1​, sweetness k2​, saltiness k3​, and spiciness k4​, then he decides which foods he likes and which foods he dislikes.
So let’s consider the foods he likes as [t1​,t2​]=[1,0] and the foods he dislikes as [t1​,t2​]=[0,1].
2.2. ANN Computation
This is the neural network architecture for this application. It consists of one input layer, one hidden layer, and one output layer. It takes an input matrix Kp​ to produce the final output matrix A3​.
The calculation of NN inference using matrix multiplication by hand consists of 6 steps. The ANN processor design implements these steps in FPGA. This calculation is useful for the verification process.
Padding input:
Matrix multiplication hidden layer 1:
Activation hidden layer 1:
Padding output hidden 1:
Matrix multiplication hidden layer 2:
Activation hidden layer 2:
You can compare the result matrix A3​ with the label from the dataset. The result should be the same.
3. Hardware Design
3.1. Basic Processing Elements
There are two basic processing elements for ANN computation: register and MAC operation.
This code is an implementation of a 16-bit register in Verilog.
module register
#(
parameter WIDTH = 16
)
(
input wire clk,
input wire rst_n,
input wire en,
input wire clr,
input wire signed [WIDTH-1:0] d,
output reg signed [WIDTH-1:0] q
);
always @(posedge clk)
begin
if (!rst_n || clr)
begin
q <= 0;
end
else if (en)
begin
q <= d;
end
end
endmodule
This code is an implementation of a MAC or PE in Verilog.
module pe
#(
parameter WIDTH = 16,
parameter FRAC_BIT = 10
)
(
input wire signed [WIDTH-1:0] a_in,
input wire signed [WIDTH-1:0] y_in,
input wire signed [WIDTH-1:0] b,
output wire signed [WIDTH-1:0] a_out,
output wire signed [WIDTH-1:0] y_out
);
wire signed [WIDTH*2-1:0] y_out_i;
assign a_out = a_in;
assign y_out_i = a_in * b;
assign y_out = y_in + y_out_i[WIDTH+FRAC_BIT-1:FRAC_BIT];
endmodule
3.2. Systolic Matrix Multiplication
From the basic modules register and PE, we can construct a module for matrix multiplication using systolic architecture.
This code is an implementation of a systolic module in Verilog.
module systolic
#(
parameter WIDTH = 16,
parameter FRAC_BIT = 10
)
(
input wire clk,
input wire rst_n,
input wire en,
input wire clr,
input wire signed [WIDTH-1:0] a0, a1, a2, a3, a4, a5,
input wire in_valid,
input wire signed [WIDTH-1:0] b00, b01, b02, b03, b04, b05,
input wire signed [WIDTH-1:0] b10, b11, b12, b13, b14, b15,
input wire signed [WIDTH-1:0] b20, b21, b22, b23, b24, b25,
input wire signed [WIDTH-1:0] b30, b31, b32, b33, b34, b35,
input wire signed [WIDTH-1:0] b40, b41, b42, b43, b44, b45,
input wire signed [WIDTH-1:0] b50, b51, b52, b53, b54, b55,
output wire signed [WIDTH-1:0] y0, y1, y2, y3, y4, y5,
output wire out_valid
);
// *** Input registers ***
wire signed [WIDTH-1:0] a0_reg0;
wire signed [WIDTH-1:0] a1_reg0, a1_reg1;
wire signed [WIDTH-1:0] a2_reg0, a2_reg1, a2_reg2;
wire signed [WIDTH-1:0] a3_reg0, a3_reg1, a3_reg2, a3_reg3;
wire signed [WIDTH-1:0] a4_reg0, a4_reg1, a4_reg2, a4_reg3, a4_reg4;
wire signed [WIDTH-1:0] a5_reg0, a5_reg1, a5_reg2, a5_reg3, a5_reg4, a5_reg5;
// *** a in ***
wire signed [WIDTH-1:0] a00_in, a01_in, a02_in, a03_in, a04_in, a05_in,
a10_in, a11_in, a12_in, a13_in, a14_in, a15_in,
a20_in, a21_in, a22_in, a23_in, a24_in, a25_in,
a30_in, a31_in, a32_in, a33_in, a34_in, a35_in,
a40_in, a41_in, a42_in, a43_in, a44_in, a45_in,
a50_in, a51_in, a52_in, a53_in, a54_in, a55_in;
// *** y in ***
wire signed [WIDTH-1:0] y00_in, y01_in, y02_in, y03_in, y04_in, y05_in,
y10_in, y11_in, y12_in, y13_in, y14_in, y15_in,
y20_in, y21_in, y22_in, y23_in, y24_in, y25_in,
y30_in, y31_in, y32_in, y33_in, y34_in, y35_in,
y40_in, y41_in, y42_in, y43_in, y44_in, y45_in,
y50_in, y51_in, y52_in, y53_in, y54_in, y55_in;
// *** a out ***
wire signed [WIDTH-1:0] a00_out, a01_out, a02_out, a03_out, a04_out, a05_out,
a10_out, a11_out, a12_out, a13_out, a14_out, a15_out,
a20_out, a21_out, a22_out, a23_out, a24_out, a25_out,
a30_out, a31_out, a32_out, a33_out, a34_out, a35_out,
a40_out, a41_out, a42_out, a43_out, a44_out, a45_out,
a50_out, a51_out, a52_out, a53_out, a54_out, a55_out;
// *** y out ***
wire signed [WIDTH-1:0] y00_out, y01_out, y02_out, y03_out, y04_out, y05_out,
y10_out, y11_out, y12_out, y13_out, y14_out, y15_out,
y20_out, y21_out, y22_out, y23_out, y24_out, y25_out,
y30_out, y31_out, y32_out, y33_out, y34_out, y35_out,
y40_out, y41_out, y42_out, y43_out, y44_out, y45_out,
y50_out, y51_out, y52_out, y53_out, y54_out, y55_out;
// *** Output registers ***
wire signed [WIDTH-1:0] y0_tmp, y1_tmp, y2_tmp, y3_tmp, y4_tmp, y5_tmp;
wire signed [WIDTH-1:0] y0_reg0, y0_reg1, y0_reg2, y0_reg3, y0_reg4, y0_reg5;
wire signed [WIDTH-1:0] y1_reg0, y1_reg1, y1_reg2, y1_reg3, y1_reg4;
wire signed [WIDTH-1:0] y2_reg0, y2_reg1, y2_reg2, y2_reg3;
wire signed [WIDTH-1:0] y3_reg0, y3_reg1, y3_reg2;
wire signed [WIDTH-1:0] y4_reg0, y4_reg1;
wire signed [WIDTH-1:0] y5_reg0;
// *** Valid registers ***
wire in_valid_reg0, in_valid_reg1, in_valid_reg2, in_valid_reg3, in_valid_reg4, in_valid_reg5, in_valid_reg6, in_valid_reg7, in_valid_reg8, in_valid_reg9, in_valid_reg10, in_valid_reg11, in_valid_reg12;
// *** Input registers for systolic data setup ***
register #(WIDTH) reg_a0_0(clk, rst_n, en, clr, a0, a0_reg0);
register #(WIDTH) reg_a1_0(clk, rst_n, en, clr, a1, a1_reg0);
register #(WIDTH) reg_a1_1(clk, rst_n, en, clr, a1_reg0, a1_reg1);
register #(WIDTH) reg_a2_0(clk, rst_n, en, clr, a2, a2_reg0);
register #(WIDTH) reg_a2_1(clk, rst_n, en, clr, a2_reg0, a2_reg1);
register #(WIDTH) reg_a2_2(clk, rst_n, en, clr, a2_reg1, a2_reg2);
register #(WIDTH) reg_a3_0(clk, rst_n, en, clr, a3, a3_reg0);
register #(WIDTH) reg_a3_1(clk, rst_n, en, clr, a3_reg0, a3_reg1);
register #(WIDTH) reg_a3_2(clk, rst_n, en, clr, a3_reg1, a3_reg2);
register #(WIDTH) reg_a3_3(clk, rst_n, en, clr, a3_reg2, a3_reg3);
register #(WIDTH) reg_a4_0(clk, rst_n, en, clr, a4, a4_reg0);
register #(WIDTH) reg_a4_1(clk, rst_n, en, clr, a4_reg0, a4_reg1);
register #(WIDTH) reg_a4_2(clk, rst_n, en, clr, a4_reg1, a4_reg2);
register #(WIDTH) reg_a4_3(clk, rst_n, en, clr, a4_reg2, a4_reg3);
register #(WIDTH) reg_a4_4(clk, rst_n, en, clr, a4_reg3, a4_reg4);
register #(WIDTH) reg_a5_0(clk, rst_n, en, clr, a5, a5_reg0);
register #(WIDTH) reg_a5_1(clk, rst_n, en, clr, a5_reg0, a5_reg1);
register #(WIDTH) reg_a5_2(clk, rst_n, en, clr, a5_reg1, a5_reg2);
register #(WIDTH) reg_a5_3(clk, rst_n, en, clr, a5_reg2, a5_reg3);
register #(WIDTH) reg_a5_4(clk, rst_n, en, clr, a5_reg3, a5_reg4);
register #(WIDTH) reg_a5_5(clk, rst_n, en, clr, a5_reg4, a5_reg5);
// *** First x inputs ***
assign a00_in = a0_reg0;
assign a10_in = a1_reg1;
assign a20_in = a2_reg2;
assign a30_in = a3_reg3;
assign a40_in = a4_reg4;
assign a50_in = a5_reg5;
// *** First z inputs ***
assign y00_in = 0;
assign y01_in = 0;
assign y02_in = 0;
assign y03_in = 0;
assign y04_in = 0;
assign y05_in = 0;
// *** 6x6 systolic array ***
// *** Row 0 from bottom ***
pe #(WIDTH, FRAC_BIT) pe00(a00_in, y00_in, b00, a00_out, y00_out);
pe #(WIDTH, FRAC_BIT) pe01(a01_in, y01_in, b01, a01_out, y01_out);
pe #(WIDTH, FRAC_BIT) pe02(a02_in, y02_in, b02, a02_out, y02_out);
pe #(WIDTH, FRAC_BIT) pe03(a03_in, y03_in, b03, a03_out, y03_out);
pe #(WIDTH, FRAC_BIT) pe04(a04_in, y04_in, b04, a04_out, y04_out);
pe #(WIDTH, FRAC_BIT) pe05(a05_in, y05_in, b05, a05_out, y05_out);
// *** Row 1 from bottom ***
pe #(WIDTH, FRAC_BIT) pe10(a10_in, y10_in, b10, a10_out, y10_out);
pe #(WIDTH, FRAC_BIT) pe11(a11_in, y11_in, b11, a11_out, y11_out);
pe #(WIDTH, FRAC_BIT) pe12(a12_in, y12_in, b12, a12_out, y12_out);
pe #(WIDTH, FRAC_BIT) pe13(a13_in, y13_in, b13, a13_out, y13_out);
pe #(WIDTH, FRAC_BIT) pe14(a14_in, y14_in, b14, a14_out, y14_out);
pe #(WIDTH, FRAC_BIT) pe15(a15_in, y15_in, b15, a15_out, y15_out);
// *** Row 2 from bottom ***
pe #(WIDTH, FRAC_BIT) pe20(a20_in, y20_in, b20, a20_out, y20_out);
pe #(WIDTH, FRAC_BIT) pe21(a21_in, y21_in, b21, a21_out, y21_out);
pe #(WIDTH, FRAC_BIT) pe22(a22_in, y22_in, b22, a22_out, y22_out);
pe #(WIDTH, FRAC_BIT) pe23(a23_in, y23_in, b23, a23_out, y23_out);
pe #(WIDTH, FRAC_BIT) pe24(a24_in, y24_in, b24, a24_out, y24_out);
pe #(WIDTH, FRAC_BIT) pe25(a25_in, y25_in, b25, a25_out, y25_out);
// *** Row 3 from bottom ***
pe #(WIDTH, FRAC_BIT) pe30(a30_in, y30_in, b30, a30_out, y30_out);
pe #(WIDTH, FRAC_BIT) pe31(a31_in, y31_in, b31, a31_out, y31_out);
pe #(WIDTH, FRAC_BIT) pe32(a32_in, y32_in, b32, a32_out, y32_out);
pe #(WIDTH, FRAC_BIT) pe33(a33_in, y33_in, b33, a33_out, y33_out);
pe #(WIDTH, FRAC_BIT) pe34(a34_in, y34_in, b34, a34_out, y34_out);
pe #(WIDTH, FRAC_BIT) pe35(a35_in, y35_in, b35, a35_out, y35_out);
// *** Row 4 from bottom ***
pe #(WIDTH, FRAC_BIT) pe40(a40_in, y40_in, b40, a40_out, y40_out);
pe #(WIDTH, FRAC_BIT) pe41(a41_in, y41_in, b41, a41_out, y41_out);
pe #(WIDTH, FRAC_BIT) pe42(a42_in, y42_in, b42, a42_out, y42_out);
pe #(WIDTH, FRAC_BIT) pe43(a43_in, y43_in, b43, a43_out, y43_out);
pe #(WIDTH, FRAC_BIT) pe44(a44_in, y44_in, b44, a44_out, y44_out);
pe #(WIDTH, FRAC_BIT) pe45(a45_in, y45_in, b45, a45_out, y45_out);
// *** Row 5 from bottom ***
pe #(WIDTH, FRAC_BIT) pe50(a50_in, y50_in, b50, a50_out, y50_out);
pe #(WIDTH, FRAC_BIT) pe51(a51_in, y51_in, b51, a51_out, y51_out);
pe #(WIDTH, FRAC_BIT) pe52(a52_in, y52_in, b52, a52_out, y52_out);
pe #(WIDTH, FRAC_BIT) pe53(a53_in, y53_in, b53, a53_out, y53_out);
pe #(WIDTH, FRAC_BIT) pe54(a54_in, y54_in, b54, a54_out, y54_out);
pe #(WIDTH, FRAC_BIT) pe55(a55_in, y55_in, b55, a55_out, y55_out);
// *** Internal registers ***
// *** Row 0 from bottom ***
register #(WIDTH) reg_a00(clk, rst_n, en, clr, a00_out, a01_in);
register #(WIDTH) reg_a01(clk, rst_n, en, clr, a01_out, a02_in);
register #(WIDTH) reg_a02(clk, rst_n, en, clr, a02_out, a03_in);
register #(WIDTH) reg_a03(clk, rst_n, en, clr, a03_out, a04_in);
register #(WIDTH) reg_a04(clk, rst_n, en, clr, a04_out, a05_in);
// *** Row 1 from bottom ***
register #(WIDTH) reg_a10(clk, rst_n, en, clr, a10_out, a11_in);
register #(WIDTH) reg_a11(clk, rst_n, en, clr, a11_out, a12_in);
register #(WIDTH) reg_a12(clk, rst_n, en, clr, a12_out, a13_in);
register #(WIDTH) reg_a13(clk, rst_n, en, clr, a13_out, a14_in);
register #(WIDTH) reg_a14(clk, rst_n, en, clr, a14_out, a15_in);
// *** Row 2 from bottom ***
register #(WIDTH) reg_a20(clk, rst_n, en, clr, a20_out, a21_in);
register #(WIDTH) reg_a21(clk, rst_n, en, clr, a21_out, a22_in);
register #(WIDTH) reg_a22(clk, rst_n, en, clr, a22_out, a23_in);
register #(WIDTH) reg_a23(clk, rst_n, en, clr, a23_out, a24_in);
register #(WIDTH) reg_a24(clk, rst_n, en, clr, a24_out, a25_in);
// *** Row 3 from bottom ***
register #(WIDTH) reg_a30(clk, rst_n, en, clr, a30_out, a31_in);
register #(WIDTH) reg_a31(clk, rst_n, en, clr, a31_out, a32_in);
register #(WIDTH) reg_a32(clk, rst_n, en, clr, a32_out, a33_in);
register #(WIDTH) reg_a33(clk, rst_n, en, clr, a33_out, a34_in);
register #(WIDTH) reg_a34(clk, rst_n, en, clr, a34_out, a35_in);
// *** Row 4 from bottom ***
register #(WIDTH) reg_a40(clk, rst_n, en, clr, a40_out, a41_in);
register #(WIDTH) reg_a41(clk, rst_n, en, clr, a41_out, a42_in);
register #(WIDTH) reg_a42(clk, rst_n, en, clr, a42_out, a43_in);
register #(WIDTH) reg_a43(clk, rst_n, en, clr, a43_out, a44_in);
register #(WIDTH) reg_a44(clk, rst_n, en, clr, a44_out, a45_in);
// *** Row 5 from bottom ***
register #(WIDTH) reg_a50(clk, rst_n, en, clr, a50_out, a51_in);
register #(WIDTH) reg_a51(clk, rst_n, en, clr, a51_out, a52_in);
register #(WIDTH) reg_a52(clk, rst_n, en, clr, a52_out, a53_in);
register #(WIDTH) reg_a53(clk, rst_n, en, clr, a53_out, a54_in);
register #(WIDTH) reg_a54(clk, rst_n, en, clr, a54_out, a55_in);
// *** Column 0 from left ***
register #(WIDTH) reg_y00(clk, rst_n, en, clr, y00_out, y10_in);
register #(WIDTH) reg_y10(clk, rst_n, en, clr, y10_out, y20_in);
register #(WIDTH) reg_y20(clk, rst_n, en, clr, y20_out, y30_in);
register #(WIDTH) reg_y30(clk, rst_n, en, clr, y30_out, y40_in);
register #(WIDTH) reg_y40(clk, rst_n, en, clr, y40_out, y50_in);
register #(WIDTH) reg_y50(clk, rst_n, en, clr, y50_out, y0_tmp);
// *** Column 1 from left ***
register #(WIDTH) reg_y01(clk, rst_n, en, clr, y01_out, y11_in);
register #(WIDTH) reg_y11(clk, rst_n, en, clr, y11_out, y21_in);
register #(WIDTH) reg_y21(clk, rst_n, en, clr, y21_out, y31_in);
register #(WIDTH) reg_y31(clk, rst_n, en, clr, y31_out, y41_in);
register #(WIDTH) reg_y41(clk, rst_n, en, clr, y41_out, y51_in);
register #(WIDTH) reg_y51(clk, rst_n, en, clr, y51_out, y1_tmp);
// *** Column 2 from left ***
register #(WIDTH) reg_y02(clk, rst_n, en, clr, y02_out, y12_in);
register #(WIDTH) reg_y12(clk, rst_n, en, clr, y12_out, y22_in);
register #(WIDTH) reg_y22(clk, rst_n, en, clr, y22_out, y32_in);
register #(WIDTH) reg_y32(clk, rst_n, en, clr, y32_out, y42_in);
register #(WIDTH) reg_y42(clk, rst_n, en, clr, y42_out, y52_in);
register #(WIDTH) reg_y52(clk, rst_n, en, clr, y52_out, y2_tmp);
// *** Column 3 from left ***
register #(WIDTH) reg_y03(clk, rst_n, en, clr, y03_out, y13_in);
register #(WIDTH) reg_y13(clk, rst_n, en, clr, y13_out, y23_in);
register #(WIDTH) reg_y23(clk, rst_n, en, clr, y23_out, y33_in);
register #(WIDTH) reg_y33(clk, rst_n, en, clr, y33_out, y43_in);
register #(WIDTH) reg_y43(clk, rst_n, en, clr, y43_out, y53_in);
register #(WIDTH) reg_y53(clk, rst_n, en, clr, y53_out, y3_tmp);
// *** Column 4 from left ***
register #(WIDTH) reg_y04(clk, rst_n, en, clr, y04_out, y14_in);
register #(WIDTH) reg_y14(clk, rst_n, en, clr, y14_out, y24_in);
register #(WIDTH) reg_y24(clk, rst_n, en, clr, y24_out, y34_in);
register #(WIDTH) reg_y34(clk, rst_n, en, clr, y34_out, y44_in);
register #(WIDTH) reg_y44(clk, rst_n, en, clr, y44_out, y54_in);
register #(WIDTH) reg_y54(clk, rst_n, en, clr, y54_out, y4_tmp);
// *** Column 5 from left ***
register #(WIDTH) reg_y05(clk, rst_n, en, clr, y05_out, y15_in);
register #(WIDTH) reg_y15(clk, rst_n, en, clr, y15_out, y25_in);
register #(WIDTH) reg_y25(clk, rst_n, en, clr, y25_out, y35_in);
register #(WIDTH) reg_y35(clk, rst_n, en, clr, y35_out, y45_in);
register #(WIDTH) reg_y45(clk, rst_n, en, clr, y45_out, y55_in);
register #(WIDTH) reg_y55(clk, rst_n, en, clr, y55_out, y5_tmp);
// *** Output registers ***
register #(WIDTH) reg_y0_0(clk, rst_n, en, clr, y0_tmp, y0_reg0);
register #(WIDTH) reg_y0_1(clk, rst_n, en, clr, y0_reg0, y0_reg1);
register #(WIDTH) reg_y0_2(clk, rst_n, en, clr, y0_reg1, y0_reg2);
register #(WIDTH) reg_y0_3(clk, rst_n, en, clr, y0_reg2, y0_reg3);
register #(WIDTH) reg_y0_4(clk, rst_n, en, clr, y0_reg3, y0_reg4);
register #(WIDTH) reg_y0_5(clk, rst_n, en, clr, y0_reg4, y0_reg5);
register #(WIDTH) reg_y1_0(clk, rst_n, en, clr, y1_tmp, y1_reg0);
register #(WIDTH) reg_y1_1(clk, rst_n, en, clr, y1_reg0, y1_reg1);
register #(WIDTH) reg_y1_2(clk, rst_n, en, clr, y1_reg1, y1_reg2);
register #(WIDTH) reg_y1_3(clk, rst_n, en, clr, y1_reg2, y1_reg3);
register #(WIDTH) reg_y1_4(clk, rst_n, en, clr, y1_reg3, y1_reg4);
register #(WIDTH) reg_y2_0(clk, rst_n, en, clr, y2_tmp, y2_reg0);
register #(WIDTH) reg_y2_1(clk, rst_n, en, clr, y2_reg0, y2_reg1);
register #(WIDTH) reg_y2_2(clk, rst_n, en, clr, y2_reg1, y2_reg2);
register #(WIDTH) reg_y2_3(clk, rst_n, en, clr, y2_reg2, y2_reg3);
register #(WIDTH) reg_y3_0(clk, rst_n, en, clr, y3_tmp, y3_reg0);
register #(WIDTH) reg_y3_1(clk, rst_n, en, clr, y3_reg0, y3_reg1);
register #(WIDTH) reg_y3_2(clk, rst_n, en, clr, y3_reg1, y3_reg2);
register #(WIDTH) reg_y4_0(clk, rst_n, en, clr, y4_tmp, y4_reg0);
register #(WIDTH) reg_y4_1(clk, rst_n, en, clr, y4_reg0, y4_reg1);
register #(WIDTH) reg_y5_0(clk, rst_n, en, clr, y5_tmp, y5_reg0);
// *** Valid registers ***
register #(1) reg_valid_0(clk, rst_n, en, clr, in_valid, in_valid_reg0);
register #(1) reg_valid_1(clk, rst_n, en, clr, in_valid_reg0, in_valid_reg1);
register #(1) reg_valid_2(clk, rst_n, en, clr, in_valid_reg1, in_valid_reg2);
register #(1) reg_valid_3(clk, rst_n, en, clr, in_valid_reg2, in_valid_reg3);
register #(1) reg_valid_4(clk, rst_n, en, clr, in_valid_reg3, in_valid_reg4);
register #(1) reg_valid_5(clk, rst_n, en, clr, in_valid_reg4, in_valid_reg5);
register #(1) reg_valid_6(clk, rst_n, en, clr, in_valid_reg5, in_valid_reg6);
register #(1) reg_valid_7(clk, rst_n, en, clr, in_valid_reg6, in_valid_reg7);
register #(1) reg_valid_8(clk, rst_n, en, clr, in_valid_reg7, in_valid_reg8);
register #(1) reg_valid_9(clk, rst_n, en, clr, in_valid_reg8, in_valid_reg9);
register #(1) reg_valid_10(clk, rst_n, en, clr, in_valid_reg9, in_valid_reg10);
register #(1) reg_valid_11(clk, rst_n, en, clr, in_valid_reg10, in_valid_reg11);
register #(1) reg_valid_12(clk, rst_n, en, clr, in_valid_reg11, in_valid_reg12);
// *** Outputs ***
assign y0 = y0_reg5;
assign y1 = y1_reg4;
assign y2 = y2_reg3;
assign y3 = y3_reg2;
assign y4 = y4_reg1;
assign y5 = y5_reg0;
assign out_valid = in_valid_reg12;
endmodule
You can verify the matrix multiplication operation of the systolic using the testbench.
Matrix multiplication hidden layer 1:
Verification with the model. The decimal may be different due to rounding and fixed-point implementation.
Matrix multiplication hidden layer 2:
Verification with the model. The decimal may be different due to rounding and fixed-point implementation.
3.3. Sigmoid LUT
To calculate the sigmoid function, we can use the lookup table method. The following figure illustrates a basic LUT implementation of sigmoid.
This code is an implementation of a sigmoid module in Verilog.
3.4. ANN Core
We already have the systolic module and sigmoid module. The next step is to construct the ANN core computation. Additionally, we need block memories to store weight, input, and output.
This code is an implementation of the ANN core in Verilog.
module ann
(
input wire clk,
input wire rst_n,
input wire en,
input wire clr,
// *** Control and status port ***
output wire ready,
input wire start,
output wire done,
// *** Weight port ***
input wire wb_ena,
input wire [2:0] wb_addra,
input wire [127:0] wb_dina,
input wire [15:0] wb_wea,
// *** Data input port ***
input wire k_ena,
input wire [1:0] k_addra,
input wire [127:0] k_dina,
input wire [15:0] k_wea,
// *** Data output port ***
input wire a_enb,
input wire [1:0] a_addrb,
output wire [127:0] a_doutb
);
// Weight BRAM
wire wb_enb;
wire [2:0] wb_addrb;
wire [127:0] wb_doutb;
wire [15:0] wb_doutb_0;
wire [15:0] wb_doutb_1;
wire [15:0] wb_doutb_2;
wire [15:0] wb_doutb_3;
wire [15:0] wb_doutb_4;
wire [15:0] wb_doutb_5;
// Input BRAM
wire k_enb;
wire [1:0] k_addrb;
wire [127:0] k_doutb;
wire [15:0] k_doutb_0;
wire [15:0] k_doutb_1;
wire [15:0] k_doutb_2;
wire [15:0] k_doutb_3;
wire [15:0] k_doutb_4;
wire [15:0] k_doutb_5;
// Counter for main controller
reg [5:0] cnt_main_reg;
// Multiplexer and register for systolic moving input
wire [0:0] a0_sel, a1_sel, a2_sel, a3_sel, a4_sel, a5_sel;
wire [15:0] a0, a1, a2, a3, a4, a5;
// Multiplexer and register for systolic stationary input
wire [1:0] b00_sel, b01_sel, b02_sel, b03_sel, b04_sel, b05_sel;
wire [1:0] b10_sel, b11_sel, b12_sel, b13_sel, b14_sel, b15_sel;
wire [1:0] b20_sel, b21_sel, b22_sel, b23_sel, b24_sel, b25_sel;
wire [1:0] b30_sel, b31_sel, b32_sel, b33_sel, b34_sel, b35_sel;
wire [1:0] b40_sel, b41_sel, b42_sel, b43_sel, b44_sel, b45_sel;
wire [1:0] b50_sel, b51_sel, b52_sel, b53_sel, b54_sel, b55_sel;
wire [15:0] b00_next, b01_next, b02_next, b03_next, b04_next, b05_next;
wire [15:0] b10_next, b11_next, b12_next, b13_next, b14_next, b15_next;
wire [15:0] b20_next, b21_next, b22_next, b23_next, b24_next, b25_next;
wire [15:0] b30_next, b31_next, b32_next, b33_next, b34_next, b35_next;
wire [15:0] b40_next, b41_next, b42_next, b43_next, b44_next, b45_next;
wire [15:0] b50_next, b51_next, b52_next, b53_next, b54_next, b55_next;
wire [15:0] b00_reg, b01_reg, b02_reg, b03_reg, b04_reg, b05_reg;
wire [15:0] b10_reg, b11_reg, b12_reg, b13_reg, b14_reg, b15_reg;
wire [15:0] b20_reg, b21_reg, b22_reg, b23_reg, b24_reg, b25_reg;
wire [15:0] b30_reg, b31_reg, b32_reg, b33_reg, b34_reg, b35_reg;
wire [15:0] b40_reg, b41_reg, b42_reg, b43_reg, b44_reg, b45_reg;
wire [15:0] b50_reg, b51_reg, b52_reg, b53_reg, b54_reg, b55_reg;
// Systolic
wire sys_in_valid;
wire [15:0] y0, y1, y2, y3, y4, y5;
wire sys_out_valid;
// Sigmoid
wire [15:0] s0, s1, s2, s3, s4, s5;
wire sig_out_valid;
wire [15:0] s0_reg0, s0_reg1, s0_reg2, s0_reg3;
wire [15:0] s1_reg0, s1_reg1, s1_reg2, s1_reg3;
wire [15:0] s2_reg0, s2_reg1, s2_reg2, s2_reg3;
wire [15:0] s3_reg0, s3_reg1, s3_reg2, s3_reg3;
wire [15:0] s4_reg0, s4_reg1, s4_reg2, s4_reg3;
wire [15:0] s5_reg0, s5_reg1, s5_reg2, s5_reg3;
wire sig_out_valid_reg0, sig_out_valid_reg1, sig_out_valid_reg2, sig_out_valid_reg3;
// Output BRAM
wire a_ena;
wire [15:0] a_wea;
wire [1:0] a_addra;
wire [127:0] a_dina;
// *** Weight BRAM **********************************************************
// xpm_memory_tdpram: True Dual Port RAM
// Xilinx Parameterized Macro, version 2018.3
xpm_memory_tdpram
#(
// Common module parameters
.MEMORY_SIZE(1024), // DECIMAL, size: 8x128bit= 1024 bits
.MEMORY_PRIMITIVE("auto"), // String
.CLOCKING_MODE("common_clock"), // String, "common_clock"
.MEMORY_INIT_FILE("none"), // String
.MEMORY_INIT_PARAM("0"), // String
.USE_MEM_INIT(1), // DECIMAL
.WAKEUP_TIME("disable_sleep"), // String
.MESSAGE_CONTROL(0), // DECIMAL
.AUTO_SLEEP_TIME(0), // DECIMAL
.ECC_MODE("no_ecc"), // String
.MEMORY_OPTIMIZATION("true"), // String
.USE_EMBEDDED_CONSTRAINT(0), // DECIMAL
// Port A module parameters
.WRITE_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_A(8), // DECIMAL
.ADDR_WIDTH_A(3), // DECIMAL, clog2(1024/128)=clog2(8)= 3
.READ_RESET_VALUE_A("0"), // String
.READ_LATENCY_A(1), // DECIMAL
.WRITE_MODE_A("write_first"), // String
.RST_MODE_A("SYNC"), // String
// Port B module parameters
.WRITE_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_B(8), // DECIMAL
.ADDR_WIDTH_B(3), // DECIMAL, clog2(1024/128)=clog2(8)= 3
.READ_RESET_VALUE_B("0"), // String
.READ_LATENCY_B(1), // DECIMAL
.WRITE_MODE_B("write_first"), // String
.RST_MODE_B("SYNC") // String
)
xpm_memory_tdpram_wb
(
.sleep(1'b0),
.regcea(1'b1), //do not change
.injectsbiterra(1'b0), //do not change
.injectdbiterra(1'b0), //do not change
.sbiterra(), //do not change
.dbiterra(), //do not change
.regceb(1'b1), //do not change
.injectsbiterrb(1'b0), //do not change
.injectdbiterrb(1'b0), //do not change
.sbiterrb(), //do not change
.dbiterrb(), //do not change
// Port A module ports
.clka(clk),
.rsta(~rst_n),
.ena(wb_ena),
.wea(wb_wea),
.addra(wb_addra),
.dina(wb_dina),
.douta(),
// Port B module ports
.clkb(clk),
.rstb(~rst_n),
.enb(wb_enb),
.web(0),
.addrb(wb_addrb),
.dinb(0),
.doutb(wb_doutb)
);
assign wb_doutb_0 = wb_doutb[15:0];
assign wb_doutb_1 = wb_doutb[31:16];
assign wb_doutb_2 = wb_doutb[47:32];
assign wb_doutb_3 = wb_doutb[63:48];
assign wb_doutb_4 = wb_doutb[79:64];
assign wb_doutb_5 = wb_doutb[95:80];
// *** Input BRAM ***********************************************************
// xpm_memory_tdpram: True Dual Port RAM
// Xilinx Parameterized Macro, version 2018.3
xpm_memory_tdpram
#(
// Common module parameters
.MEMORY_SIZE(512), // DECIMAL, size: 4x128bit= 512 bits
.MEMORY_PRIMITIVE("auto"), // String
.CLOCKING_MODE("common_clock"), // String, "common_clock"
.MEMORY_INIT_FILE("none"), // String
.MEMORY_INIT_PARAM("0"), // String
.USE_MEM_INIT(1), // DECIMAL
.WAKEUP_TIME("disable_sleep"), // String
.MESSAGE_CONTROL(0), // DECIMAL
.AUTO_SLEEP_TIME(0), // DECIMAL
.ECC_MODE("no_ecc"), // String
.MEMORY_OPTIMIZATION("true"), // String
.USE_EMBEDDED_CONSTRAINT(0), // DECIMAL
// Port A module parameters
.WRITE_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_A(8), // DECIMAL
.ADDR_WIDTH_A(2), // DECIMAL, clog2(512/128)=clog2(4)= 2
.READ_RESET_VALUE_A("0"), // String
.READ_LATENCY_A(1), // DECIMAL
.WRITE_MODE_A("write_first"), // String
.RST_MODE_A("SYNC"), // String
// Port B module parameters
.WRITE_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_B(8), // DECIMAL
.ADDR_WIDTH_B(2), // DECIMAL, clog2(512/128)=clog2(4)= 2
.READ_RESET_VALUE_B("0"), // String
.READ_LATENCY_B(1), // DECIMAL
.WRITE_MODE_B("write_first"), // String
.RST_MODE_B("SYNC") // String
)
xpm_memory_tdpram_k
(
.sleep(1'b0),
.regcea(1'b1), //do not change
.injectsbiterra(1'b0), //do not change
.injectdbiterra(1'b0), //do not change
.sbiterra(), //do not change
.dbiterra(), //do not change
.regceb(1'b1), //do not change
.injectsbiterrb(1'b0), //do not change
.injectdbiterrb(1'b0), //do not change
.sbiterrb(), //do not change
.dbiterrb(), //do not change
// Port A module ports
.clka(clk),
.rsta(~rst_n),
.ena(k_ena),
.wea(k_wea),
.addra(k_addra),
.dina(k_dina),
.douta(),
// Port B module ports
.clkb(clk),
.rstb(~rst_n),
.enb(k_enb),
.web(0),
.addrb(k_addrb),
.dinb(0),
.doutb(k_doutb)
);
assign k_doutb_0 = k_doutb[15:0];
assign k_doutb_1 = k_doutb[31:16];
assign k_doutb_2 = k_doutb[47:32];
assign k_doutb_3 = k_doutb[63:48];
assign k_doutb_4 = k_doutb[79:64];
assign k_doutb_5 = k_doutb[95:80];
// *** Counter for main controller ******************************************
always @(posedge clk)
begin
if (!rst_n || clr)
begin
cnt_main_reg <= 0;
end
else if (start)
begin
cnt_main_reg <= cnt_main_reg + 1;
end
else if (cnt_main_reg >= 1 && cnt_main_reg <= 49)
begin
cnt_main_reg <= cnt_main_reg + 1;
end
else if (cnt_main_reg >= 50)
begin
cnt_main_reg <= 0;
end
end
// Weight BRAM control
assign wb_enb = ((cnt_main_reg >= 3) && (cnt_main_reg <= 7)) ? 1 :
((cnt_main_reg >= 25) && (cnt_main_reg <= 26)) ? 1 : 0;
assign wb_addrb = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 4) ? 1 :
(cnt_main_reg == 5) ? 2 :
(cnt_main_reg == 6) ? 3 :
(cnt_main_reg == 7) ? 4 :
(cnt_main_reg == 25) ? 5 :
(cnt_main_reg == 26) ? 6 : 0;
// Systolic moving input multiplexer control
assign a0_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
assign a1_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
assign a2_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
assign a3_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
assign a4_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
assign a5_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
// Input BRAM control
assign k_enb = ((cnt_main_reg >= 1) && (cnt_main_reg <= 4)) ? 1 : 0;
assign k_addrb = (cnt_main_reg == 1) ? 0 :
(cnt_main_reg == 2) ? 1 :
(cnt_main_reg == 3) ? 2 :
(cnt_main_reg == 4) ? 3 : 0;
// Systolic stationary input multiplexer control
assign b00_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b01_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b02_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b03_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b04_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b05_sel = (cnt_main_reg == 2) ? 0 :
(cnt_main_reg == 22) ? 1 : 3;
assign b10_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b11_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b12_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b13_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b14_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b15_sel = (cnt_main_reg == 3) ? 0 :
(cnt_main_reg == 23) ? 1 : 3;
assign b20_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b21_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b22_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b23_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b24_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b25_sel = (cnt_main_reg == 4) ? 0 :
(cnt_main_reg == 24) ? 1 : 3;
assign b30_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b31_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b32_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b33_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b34_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b35_sel = (cnt_main_reg == 5) ? 0 :
(cnt_main_reg == 25) ? 1 : 3;
assign b40_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b41_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b42_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b43_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b44_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b45_sel = (cnt_main_reg == 2) ? 2 :
(cnt_main_reg == 26) ? 1 : 3;
assign b50_sel = (cnt_main_reg == 22) ? 2 : 3;
assign b51_sel = (cnt_main_reg == 22) ? 2 : 3;
assign b52_sel = (cnt_main_reg == 22) ? 2 : 3;
assign b53_sel = (cnt_main_reg == 22) ? 2 : 3;
assign b54_sel = (cnt_main_reg == 22) ? 2 : 3;
assign b55_sel = (cnt_main_reg == 22) ? 2 : 3;
// Systolic control
assign sys_in_valid = ((cnt_main_reg >= 4) && (cnt_main_reg <= 9)) ? 1 :
((cnt_main_reg >= 26) && (cnt_main_reg <= 31)) ? 1 : 0;
// Output BRAM control
assign a_ena = ((cnt_main_reg >= 40) && (cnt_main_reg <= 41)) ? 1 : 0;
assign a_wea = ((cnt_main_reg >= 40) && (cnt_main_reg <= 41)) ? 16'hffff : 0;
assign a_addra = (cnt_main_reg == 40) ? 0 :
(cnt_main_reg == 41) ? 1 : 0;
// Status control
assign ready = (cnt_main_reg == 0) ? 1 : 0;
assign done = (cnt_main_reg == 50) ? 1 : 0;
// *** Multiplexer for systolic moving input *******************
assign a0 = (a0_sel == 0) ? wb_doutb_0 : 0;
assign a1 = (a1_sel == 0) ? wb_doutb_1 : 0;
assign a2 = (a2_sel == 0) ? wb_doutb_2 : 0;
assign a3 = (a3_sel == 0) ? wb_doutb_3 : 0;
assign a4 = (a4_sel == 0) ? wb_doutb_4 : 0;
assign a5 = (a5_sel == 0) ? wb_doutb_5 : 0;
// *** Multiplexer and register for systolic stationary input ***************
assign b00_next = (b00_sel == 0) ? k_doutb_0 :
(b00_sel == 1) ? s0_reg3 :
(b00_sel == 2) ? 16'b0000010000000000 : b00_reg;
assign b01_next = (b01_sel == 0) ? k_doutb_1 :
(b01_sel == 1) ? s1_reg3 :
(b01_sel == 2) ? 16'b0000010000000000 : b01_reg;
assign b02_next = (b02_sel == 0) ? k_doutb_2 :
(b02_sel == 1) ? s2_reg3 :
(b02_sel == 2) ? 16'b0000010000000000 : b02_reg;
assign b03_next = (b03_sel == 0) ? k_doutb_3 :
(b03_sel == 1) ? s3_reg3 :
(b03_sel == 2) ? 16'b0000010000000000 : b03_reg;
assign b04_next = (b04_sel == 0) ? k_doutb_4 :
(b04_sel == 1) ? s4_reg3 :
(b04_sel == 2) ? 16'b0000010000000000 : b04_reg;
assign b05_next = (b05_sel == 0) ? k_doutb_5 :
(b05_sel == 1) ? s5_reg3 :
(b05_sel == 2) ? 16'b0000010000000000 : b05_reg;
register #(16) reg_b00(clk, rst_n, en, clr, b00_next, b00_reg);
register #(16) reg_b01(clk, rst_n, en, clr, b01_next, b01_reg);
register #(16) reg_b02(clk, rst_n, en, clr, b02_next, b02_reg);
register #(16) reg_b03(clk, rst_n, en, clr, b03_next, b03_reg);
register #(16) reg_b04(clk, rst_n, en, clr, b04_next, b04_reg);
register #(16) reg_b05(clk, rst_n, en, clr, b05_next, b05_reg);
assign b10_next = (b10_sel == 0) ? k_doutb_0 :
(b10_sel == 1) ? s0_reg3 :
(b10_sel == 2) ? 16'b0000010000000000 : b10_reg;
assign b11_next = (b11_sel == 0) ? k_doutb_1 :
(b11_sel == 1) ? s1_reg3 :
(b11_sel == 2) ? 16'b0000010000000000 : b11_reg;
assign b12_next = (b12_sel == 0) ? k_doutb_2 :
(b12_sel == 1) ? s2_reg3 :
(b12_sel == 2) ? 16'b0000010000000000 : b12_reg;
assign b13_next = (b13_sel == 0) ? k_doutb_3 :
(b13_sel == 1) ? s3_reg3 :
(b13_sel == 2) ? 16'b0000010000000000 : b13_reg;
assign b14_next = (b14_sel == 0) ? k_doutb_4 :
(b14_sel == 1) ? s4_reg3 :
(b14_sel == 2) ? 16'b0000010000000000 : b14_reg;
assign b15_next = (b15_sel == 0) ? k_doutb_5 :
(b15_sel == 1) ? s5_reg3 :
(b15_sel == 2) ? 16'b0000010000000000 : b15_reg;
register #(16) reg_b10(clk, rst_n, en, clr, b10_next, b10_reg);
register #(16) reg_b11(clk, rst_n, en, clr, b11_next, b11_reg);
register #(16) reg_b12(clk, rst_n, en, clr, b12_next, b12_reg);
register #(16) reg_b13(clk, rst_n, en, clr, b13_next, b13_reg);
register #(16) reg_b14(clk, rst_n, en, clr, b14_next, b14_reg);
register #(16) reg_b15(clk, rst_n, en, clr, b15_next, b15_reg);
assign b20_next = (b20_sel == 0) ? k_doutb_0 :
(b20_sel == 1) ? s0_reg3 :
(b20_sel == 2) ? 16'b0000010000000000 : b20_reg;
assign b21_next = (b21_sel == 0) ? k_doutb_1 :
(b21_sel == 1) ? s1_reg3 :
(b21_sel == 2) ? 16'b0000010000000000 : b21_reg;
assign b22_next = (b22_sel == 0) ? k_doutb_2 :
(b22_sel == 1) ? s2_reg3 :
(b22_sel == 2) ? 16'b0000010000000000 : b22_reg;
assign b23_next = (b23_sel == 0) ? k_doutb_3 :
(b23_sel == 1) ? s3_reg3 :
(b23_sel == 2) ? 16'b0000010000000000 : b23_reg;
assign b24_next = (b24_sel == 0) ? k_doutb_4 :
(b24_sel == 1) ? s4_reg3 :
(b24_sel == 2) ? 16'b0000010000000000 : b24_reg;
assign b25_next = (b25_sel == 0) ? k_doutb_5 :
(b25_sel == 1) ? s5_reg3 :
(b25_sel == 2) ? 16'b0000010000000000 : b25_reg;
register #(16) reg_b20(clk, rst_n, en, clr, b20_next, b20_reg);
register #(16) reg_b21(clk, rst_n, en, clr, b21_next, b21_reg);
register #(16) reg_b22(clk, rst_n, en, clr, b22_next, b22_reg);
register #(16) reg_b23(clk, rst_n, en, clr, b23_next, b23_reg);
register #(16) reg_b24(clk, rst_n, en, clr, b24_next, b24_reg);
register #(16) reg_b25(clk, rst_n, en, clr, b25_next, b25_reg);
assign b30_next = (b30_sel == 0) ? k_doutb_0 :
(b30_sel == 1) ? s0_reg3 :
(b30_sel == 2) ? 16'b0000010000000000 : b30_reg;
assign b31_next = (b31_sel == 0) ? k_doutb_1 :
(b31_sel == 1) ? s1_reg3 :
(b31_sel == 2) ? 16'b0000010000000000 : b31_reg;
assign b32_next = (b32_sel == 0) ? k_doutb_2 :
(b32_sel == 1) ? s2_reg3 :
(b32_sel == 2) ? 16'b0000010000000000 : b32_reg;
assign b33_next = (b33_sel == 0) ? k_doutb_3 :
(b33_sel == 1) ? s3_reg3 :
(b33_sel == 2) ? 16'b0000010000000000 : b33_reg;
assign b34_next = (b34_sel == 0) ? k_doutb_4 :
(b34_sel == 1) ? s4_reg3 :
(b34_sel == 2) ? 16'b0000010000000000 : b34_reg;
assign b35_next = (b35_sel == 0) ? k_doutb_5 :
(b35_sel == 1) ? s5_reg3 :
(b35_sel == 2) ? 16'b0000010000000000 : b35_reg;
register #(16) reg_b30(clk, rst_n, en, clr, b30_next, b30_reg);
register #(16) reg_b31(clk, rst_n, en, clr, b31_next, b31_reg);
register #(16) reg_b32(clk, rst_n, en, clr, b32_next, b32_reg);
register #(16) reg_b33(clk, rst_n, en, clr, b33_next, b33_reg);
register #(16) reg_b34(clk, rst_n, en, clr, b34_next, b34_reg);
register #(16) reg_b35(clk, rst_n, en, clr, b35_next, b35_reg);
assign b40_next = (b40_sel == 0) ? k_doutb_0 :
(b40_sel == 1) ? s0_reg3 :
(b40_sel == 2) ? 16'b0000010000000000 : b40_reg;
assign b41_next = (b41_sel == 0) ? k_doutb_1 :
(b41_sel == 1) ? s1_reg3 :
(b41_sel == 2) ? 16'b0000010000000000 : b41_reg;
assign b42_next = (b42_sel == 0) ? k_doutb_2 :
(b42_sel == 1) ? s2_reg3 :
(b42_sel == 2) ? 16'b0000010000000000 : b42_reg;
assign b43_next = (b43_sel == 0) ? k_doutb_3 :
(b43_sel == 1) ? s3_reg3 :
(b43_sel == 2) ? 16'b0000010000000000 : b43_reg;
assign b44_next = (b44_sel == 0) ? k_doutb_4 :
(b44_sel == 1) ? s4_reg3 :
(b44_sel == 2) ? 16'b0000010000000000 : b44_reg;
assign b45_next = (b45_sel == 0) ? k_doutb_5 :
(b45_sel == 1) ? s5_reg3 :
(b45_sel == 2) ? 16'b0000010000000000 : b45_reg;
register #(16) reg_b40(clk, rst_n, en, clr, b40_next, b40_reg);
register #(16) reg_b41(clk, rst_n, en, clr, b41_next, b41_reg);
register #(16) reg_b42(clk, rst_n, en, clr, b42_next, b42_reg);
register #(16) reg_b43(clk, rst_n, en, clr, b43_next, b43_reg);
register #(16) reg_b44(clk, rst_n, en, clr, b44_next, b44_reg);
register #(16) reg_b45(clk, rst_n, en, clr, b45_next, b45_reg);
assign b50_next = (b50_sel == 0) ? k_doutb_0 :
(b50_sel == 1) ? s0_reg3 :
(b50_sel == 2) ? 16'b0000010000000000 : b50_reg;
assign b51_next = (b51_sel == 0) ? k_doutb_1 :
(b51_sel == 1) ? s1_reg3 :
(b51_sel == 2) ? 16'b0000010000000000 : b51_reg;
assign b52_next = (b52_sel == 0) ? k_doutb_2 :
(b52_sel == 1) ? s2_reg3 :
(b52_sel == 2) ? 16'b0000010000000000 : b52_reg;
assign b53_next = (b53_sel == 0) ? k_doutb_3 :
(b53_sel == 1) ? s3_reg3 :
(b53_sel == 2) ? 16'b0000010000000000 : b53_reg;
assign b54_next = (b54_sel == 0) ? k_doutb_4 :
(b54_sel == 1) ? s4_reg3 :
(b54_sel == 2) ? 16'b0000010000000000 : b54_reg;
assign b55_next = (b55_sel == 0) ? k_doutb_5 :
(b55_sel == 1) ? s5_reg3 :
(b55_sel == 2) ? 16'b0000010000000000 : b55_reg;
register #(16) reg_b50(clk, rst_n, en, clr, b50_next, b50_reg);
register #(16) reg_b51(clk, rst_n, en, clr, b51_next, b51_reg);
register #(16) reg_b52(clk, rst_n, en, clr, b52_next, b52_reg);
register #(16) reg_b53(clk, rst_n, en, clr, b53_next, b53_reg);
register #(16) reg_b54(clk, rst_n, en, clr, b54_next, b54_reg);
register #(16) reg_b55(clk, rst_n, en, clr, b55_next, b55_reg);
// *** Systolic *************************************************************
systolic
#(
.WIDTH(16),
.FRAC_BIT(10)
)
systolic_0
(
.clk(clk),
.rst_n(rst_n),
.en(en),
.clr(clr),
.a0(a0), .a1(a1), .a2(a2), .a3(a3), .a4(a4), .a5(a5),
.in_valid(sys_in_valid),
.b00(b00_reg), .b01(b01_reg), .b02(b02_reg), .b03(b03_reg), .b04(b04_reg), .b05(b05_reg),
.b10(b10_reg), .b11(b11_reg), .b12(b12_reg), .b13(b13_reg), .b14(b14_reg), .b15(b15_reg),
.b20(b20_reg), .b21(b21_reg), .b22(b22_reg), .b23(b23_reg), .b24(b24_reg), .b25(b25_reg),
.b30(b30_reg), .b31(b31_reg), .b32(b32_reg), .b33(b33_reg), .b34(b34_reg), .b35(b35_reg),
.b40(b40_reg), .b41(b41_reg), .b42(b42_reg), .b43(b43_reg), .b44(b44_reg), .b45(b45_reg),
.b50(b50_reg), .b51(b51_reg), .b52(b52_reg), .b53(b53_reg), .b54(b54_reg), .b55(b55_reg),
.y0(y0), .y1(y1), .y2(y2), .y3(y3), .y4(y4), .y5(y5),
.out_valid(sys_out_valid)
);
// *** Sigmoid **************************************************************
sigmoid sigmoid_0(clk, rst_n, en, clr, y0, s0);
sigmoid sigmoid_1(clk, rst_n, en, clr, y1, s1);
sigmoid sigmoid_2(clk, rst_n, en, clr, y2, s2);
sigmoid sigmoid_3(clk, rst_n, en, clr, y3, s3);
sigmoid sigmoid_4(clk, rst_n, en, clr, y4, s4);
sigmoid sigmoid_5(clk, rst_n, en, clr, y5, s5);
register #(16) reg_sig_00(clk, rst_n, en, clr, s0, s0_reg0);
register #(16) reg_sig_01(clk, rst_n, en, clr, s0_reg0, s0_reg1);
register #(16) reg_sig_02(clk, rst_n, en, clr, s0_reg1, s0_reg2);
register #(16) reg_sig_03(clk, rst_n, en, clr, s0_reg2, s0_reg3);
register #(16) reg_sig_10(clk, rst_n, en, clr, s1, s1_reg0);
register #(16) reg_sig_11(clk, rst_n, en, clr, s1_reg0, s1_reg1);
register #(16) reg_sig_12(clk, rst_n, en, clr, s1_reg1, s1_reg2);
register #(16) reg_sig_13(clk, rst_n, en, clr, s1_reg2, s1_reg3);
register #(16) reg_sig_20(clk, rst_n, en, clr, s2, s2_reg0);
register #(16) reg_sig_21(clk, rst_n, en, clr, s2_reg0, s2_reg1);
register #(16) reg_sig_22(clk, rst_n, en, clr, s2_reg1, s2_reg2);
register #(16) reg_sig_23(clk, rst_n, en, clr, s2_reg2, s2_reg3);
register #(16) reg_sig_30(clk, rst_n, en, clr, s3, s3_reg0);
register #(16) reg_sig_31(clk, rst_n, en, clr, s3_reg0, s3_reg1);
register #(16) reg_sig_32(clk, rst_n, en, clr, s3_reg1, s3_reg2);
register #(16) reg_sig_33(clk, rst_n, en, clr, s3_reg2, s3_reg3);
register #(16) reg_sig_40(clk, rst_n, en, clr, s4, s4_reg0);
register #(16) reg_sig_41(clk, rst_n, en, clr, s4_reg0, s4_reg1);
register #(16) reg_sig_42(clk, rst_n, en, clr, s4_reg1, s4_reg2);
register #(16) reg_sig_43(clk, rst_n, en, clr, s4_reg2, s4_reg3);
register #(16) reg_sig_50(clk, rst_n, en, clr, s5, s5_reg0);
register #(16) reg_sig_51(clk, rst_n, en, clr, s5_reg0, s5_reg1);
register #(16) reg_sig_52(clk, rst_n, en, clr, s5_reg1, s5_reg2);
register #(16) reg_sig_53(clk, rst_n, en, clr, s5_reg2, s5_reg3);
register #(1) reg_sig_valid_0(clk, rst_n, en, clr, sys_out_valid, sig_out_valid);
register #(1) reg_sig_valid_1(clk, rst_n, en, clr, sig_out_valid, sig_out_valid_reg0);
register #(1) reg_sig_valid_2(clk, rst_n, en, clr, sig_out_valid_reg0, sig_out_valid_reg1);
register #(1) reg_sig_valid_3(clk, rst_n, en, clr, sig_out_valid_reg1, sig_out_valid_reg2);
register #(1) reg_sig_valid_4(clk, rst_n, en, clr, sig_out_valid_reg2, sig_out_valid_reg3);
// *** Output BRAM **********************************************************
assign a_dina = {32'd0, s5, s4, s3, s2, s1, s0};
// xpm_memory_tdpram: True Dual Port RAM
// Xilinx Parameterized Macro, version 2018.3
xpm_memory_tdpram
#(
// Common module parameters
.MEMORY_SIZE(512), // DECIMAL, size: 4x128bit= 512 bits
.MEMORY_PRIMITIVE("auto"), // String
.CLOCKING_MODE("common_clock"), // String, "common_clock"
.MEMORY_INIT_FILE("none"), // String
.MEMORY_INIT_PARAM("0"), // String
.USE_MEM_INIT(1), // DECIMAL
.WAKEUP_TIME("disable_sleep"), // String
.MESSAGE_CONTROL(0), // DECIMAL
.AUTO_SLEEP_TIME(0), // DECIMAL
.ECC_MODE("no_ecc"), // String
.MEMORY_OPTIMIZATION("true"), // String
.USE_EMBEDDED_CONSTRAINT(0), // DECIMAL
// Port A module parameters
.WRITE_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_A(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_A(8), // DECIMAL
.ADDR_WIDTH_A(2), // DECIMAL, clog2(512/128)=clog2(4)= 2
.READ_RESET_VALUE_A("0"), // String
.READ_LATENCY_A(1), // DECIMAL
.WRITE_MODE_A("write_first"), // String
.RST_MODE_A("SYNC"), // String
// Port B module parameters
.WRITE_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.READ_DATA_WIDTH_B(128), // DECIMAL, data width: 128-bit
.BYTE_WRITE_WIDTH_B(8), // DECIMAL
.ADDR_WIDTH_B(2), // DECIMAL, clog2(512/128)=clog2(4)= 2
.READ_RESET_VALUE_B("0"), // String
.READ_LATENCY_B(1), // DECIMAL
.WRITE_MODE_B("write_first"), // String
.RST_MODE_B("SYNC") // String
)
xpm_memory_tdpram_a
(
.sleep(1'b0),
.regcea(1'b1), //do not change
.injectsbiterra(1'b0), //do not change
.injectdbiterra(1'b0), //do not change
.sbiterra(), //do not change
.dbiterra(), //do not change
.regceb(1'b1), //do not change
.injectsbiterrb(1'b0), //do not change
.injectdbiterrb(1'b0), //do not change
.sbiterrb(), //do not change
.dbiterrb(), //do not change
// Port A module ports
.clka(clk),
.rsta(~rst_n),
.ena(a_ena),
.wea(a_wea),
.addra(a_addra),
.dina(a_dina),
.douta(),
// Port B module ports
.clkb(clk),
.rstb(~rst_n),
.enb(a_enb),
.web(0),
.addrb(a_addrb),
.dinb(0),
.doutb(a_doutb)
);
endmodule
You can simulate using testbench to get the ANN core timing diagram.
How it works:
Start the controller FSM
Read input from memory followed by weight and bias 2
Systolic input stream hidden layer 1
Output stream from sigmoid hidden layer 1
Read weight and bias 3 from memory
Systolic input stream hidden layer 2
Output from sigmoid hidden layer 2
Write output to memory
Done signal
3.5. AXI Stream Module
Once we get the ANN core, we need to wrap this with the AXI stream module so that it is compatible with the AXI stream protocol.
This code is an implementation of the AXIS ANN in Verilog.
You can simulate using testbench to get the AXIS ANN module timing diagram.
3.6. SoC Design
At this point, we already have the AXIS ANN module. Next, you need to create a block design that consists of Zynq IP, AXI DMA, and AXIS ANN.
Configure the AXI DMA stream width to 128-bit as shown in the following figure.
The AXI DMA will read the weight, bias, and input and also write output to a specific location in DDR memory.
Data mapping inside the DDR memory for weight, bias, and input.
Data mapping inside the DDR memory for output.
4. Software Design
At this point, the required files to program the FPGA are already on the board. The next step is to create Jupyter Notebook files.
Open a web browser and open Jupyter Notebook on the board. Create a new file from menu New, Python 3 (pykernel).
Write the following code to test the design.
5. Performance
We can compare the performance of the HW-based ANN computation with SW computation using this code.
Last updated