Part 4: ANN Processor

Objective

After you complete this tutorial, you should be able to:

  • Understand how to design a simple ANN accelerator from HW to SW.

Source Code

This repository contains all of the code required in order to follow this tutorial.

1. Introduction

1.1. What is ANN

In machine learning, a neural network (also called an artificial neural network, abbreviated ANN or NN) is a mathematical model inspired by the structure and function of biological neural networks in human brains.

A NN consists of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. Figure 1(a) shows a neuron in the human brain, and Figure 1(b) shows an artificial neuron. An artificial neuron consists of inputs xx, weights ww, and an output yy.

In the human brain, a neuron can connect to more than one neuron as shown in Figure 1(c). This is the same for the artificial neuron in a NN as shown in Figure 1(d). A NN consists of multiple layers, and each layer consists of multiple neurons.

For every neuron in NN, it does a mathematical computation, expressed as the following equation.

yj=f(∑i=1nxiwi)y_j=f(\sum_{i=1}^{n} x_iw_i)

For the whole NN, there are two main steps, which are forward propagation and backward propagation.

  • In forward propagation, we do the mathematical calculation from the input layer until the output layer to get a prediction. The forward propagation process is also called inference.

  • In backward propagation, we compare the prediction result from the forward propagation process with the true values. Then, we calculate the loss score using a loss function. After that, we use this loss score to update the weight using an optimizer. The back propagation process is also called training.

An untrained NN starts with random weights and can't make any predictions. So, the goal of training is to obtain trained weights that can predict the output correctly in the inference process.

In this tutorial, we are going to focus on how to accelerate the forward propagation process on the FPGA as a matrix multiplication process.

1.2. Hardware Accelerator

Hardware accelerators are purpose-built designs that accompany a processor for accelerating a specific computations. Since processors are designed to handle a wide range of workloads, processor architectures are rarely the most optimal for specific computations.

One example of a hardware accelerator for NN is the Google Tensor Processing Unit (TPU) as shown in Figure 3. TPU is an accelerator application-specific integrated circuit (ASIC) for NN, using Google's own TensorFlow software.

Figure 4 shows the block diagram of the TPU. Its main processing unit is a matrix multiplication unit. It uses a systolic array mechanism that contains 256x256 processing elements (total 65536 ALUs). In this tutorial, we are going to do something similar in concept to this TPU on a smaller scale, with only a 4x4 systolic array.

2. ANN Model

2.1. Simple ANN Example

In this tutorial, we are going to use an example of a simple NN model. Let's consider an example of the classification of someone's tastes to Indonesian food.

The following table shows the dataset of someone's taste in Indonesian food.

Name
Sour
Sweet
Salty
Spicy
Taste
Label

Sate Maranggi

2

10

5

3

Like

[1,0]

Soto

7

2

3

3

Dislike

[0,1]

Karedok

6

8

1

6

Dislike

[0,1]

Gudeg

3

10

3

1

Like

[1,0]

Ikan Bakar

6

9

5

6

Like

[1,0]

Rendang

3

2

6

10

Dislike

[0,1]

  • There are six types of foods.

  • A man eats these six types of foods and decides the level of sourness, sweetness, saltiness, and spiciness of the foods.

  • After deciding the level of sourness k1k_1, sweetness k2k_2, saltiness k3k_3, and spiciness k4k_4, then he decides which foods he likes and which foods he dislikes.

  • So let’s consider the foods he likes as [t1,t2]=[1,0][t_1,t_2]=[1,0] and the foods he dislikes as [t1,t2]=[0,1][t_1,t_2]=[0,1].

2.2. ANN Computation

This is the neural network architecture for this application. It consists of one input layer, one hidden layer, and one output layer. It takes an input matrix KpK_p to produce the final output matrix A3A_3.

The calculation of NN inference using matrix multiplication by hand consists of 6 steps. The ANN processor design implements these steps in FPGA. This calculation is useful for the verification process.

  1. Padding input:

Kp=[276363102810925313563361610111111]K_p=\begin{bmatrix} 2 & 7 & 6 & 3 & 6 & 3\\ 10 & 2 & 8 & 10 & 9 & 2\\ 5 & 3 & 1 & 3 & 5 & 6\\ 3 & 3 & 6 & 1 & 6 & 10\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}
  1. Matrix multiplication hidden layer 1:

WB2∗Kp=Z2WB_2*K_p=Z_2
[−1.21.31.7−1.3−1.30.30.50.21−10.60.10.81.5−11.3−1.2−1.41.3−0.91.30.30.50.4−1]∗[276363102810925313563361610111111]=[13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5]\begin{bmatrix} -1.2 & 1.3 & 1.7 & -1.3 & -1.3\\ 0.3 & 0.5 & 0.2 & 1 & -1\\ 0.6 & 0.1 & 0.8 & 1.5 & -1\\ 1.3 & -1.2 & -1.4 & 1.3 & -0.9\\ 1.3 & 0.3 & 0.5 & 0.4 & -1\\ \end{bmatrix}* \begin{bmatrix} 2 & 7 & 6 & 3 & 6 & 3\\ 10 & 2 & 8 & 10 & 9 & 2\\ 5 & 3 & 1 & 3 & 5 & 6\\ 3 & 3 & 6 & 1 & 6 & 10\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}= \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix}
  1. Activation hidden layer 1:

σ(Z2)=A2\sigma(Z_2)=A_2
σ([13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5])=[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99]\sigma( \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix} )= \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ \end{bmatrix}
  1. Padding output hidden 1:

A2p=[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99111111]A_{2p}= \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}
  1. Matrix multiplication hidden layer 2:

WB3∗A2p=Z3WB_3*A_{2p}=Z_3
[5.2−0.30.8−3.50.1−1.5−4.80.10.740.9−1.4]∗[0.990.000.010.990.980.000.990.990.990.990.990.990.990.990.990.990.990.990.000.990.970.000.040.990.990.990.990.990.990.99111111]=[4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24]\begin{bmatrix} 5.2 & -0.3 & 0.8 & -3.5 & 0.1 & -1.5\\ -4.8 & 0.1 & 0.7 & 4 & 0.9 & -1.4\\ \end{bmatrix}* \begin{bmatrix} 0.99 & 0.00 & 0.01 & 0.99 & 0.98 & 0.00\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 0.00 & 0.99 & 0.97 & 0.00 & 0.04 & 0.99\\ 0.99 & 0.99 & 0.99 & 0.99 & 0.99 & 0.99\\ 1 & 1 & 1 & 1 & 1 & 1\\ \end{bmatrix}= \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix}
  1. Activation hidden layer 2:

σ(Z3)=A3\sigma(Z_3)=A_3
σ([4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24])=[0.980.010.010.980.980.010.010.980.980.010.010.98]\sigma( \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix} )= \begin{bmatrix} 0.98 & 0.01 & 0.01 & 0.98 & 0.98 & 0.01\\ 0.01 & 0.98 & 0.98 & 0.01 & 0.01 & 0.98\\ \end{bmatrix}

You can compare the result matrix A3A_3 with the label from the dataset. The result should be the same.

3. Hardware Design

3.1. Basic Processing Elements

There are two basic processing elements for ANN computation: register and MAC operation.

This code is an implementation of a 16-bit register in Verilog.

module register
    #( 
        parameter WIDTH = 16
    )
    (
        input wire                    clk,
        input wire                    rst_n,
        input wire                    en,
        input wire                    clr,
        input wire signed [WIDTH-1:0] d,
        output reg signed [WIDTH-1:0] q
    );
    
    always @(posedge clk)
    begin
        if (!rst_n || clr)
        begin
            q <= 0;
        end
        else if (en)
        begin
            q <= d;
        end
    end
    
endmodule

This code is an implementation of a MAC or PE in Verilog.

module pe
    #( 
        parameter WIDTH = 16,
        parameter FRAC_BIT = 10
    )
    (
        input wire signed [WIDTH-1:0]  a_in,
        input wire signed [WIDTH-1:0]  y_in,
        input wire signed [WIDTH-1:0]  b,
        output wire signed [WIDTH-1:0] a_out,
        output wire signed [WIDTH-1:0] y_out
    );
    
    wire signed [WIDTH*2-1:0] y_out_i;
    
    assign a_out = a_in;
    assign y_out_i = a_in * b;
    assign y_out = y_in + y_out_i[WIDTH+FRAC_BIT-1:FRAC_BIT];

endmodule

3.2. Systolic Matrix Multiplication

From the basic modules register and PE, we can construct a module for matrix multiplication using systolic architecture.

This code is an implementation of a systolic module in Verilog.

module systolic
    #( 
        parameter WIDTH = 16,
        parameter FRAC_BIT = 10
    )
    (
        input wire                     clk,
        input wire                     rst_n,
        input wire                     en,
        input wire                     clr,
        input wire signed [WIDTH-1:0]  a0, a1, a2, a3, a4, a5,
        input wire                     in_valid,
        input wire signed [WIDTH-1:0]  b00, b01, b02, b03, b04, b05,
        input wire signed [WIDTH-1:0]  b10, b11, b12, b13, b14, b15,
        input wire signed [WIDTH-1:0]  b20, b21, b22, b23, b24, b25,
        input wire signed [WIDTH-1:0]  b30, b31, b32, b33, b34, b35,
        input wire signed [WIDTH-1:0]  b40, b41, b42, b43, b44, b45,
        input wire signed [WIDTH-1:0]  b50, b51, b52, b53, b54, b55,
        output wire signed [WIDTH-1:0] y0, y1, y2, y3, y4, y5,  
        output wire                    out_valid  
    );
    
    // *** Input registers ***
    wire signed [WIDTH-1:0] a0_reg0;
    wire signed [WIDTH-1:0] a1_reg0, a1_reg1;
    wire signed [WIDTH-1:0] a2_reg0, a2_reg1, a2_reg2; 
    wire signed [WIDTH-1:0] a3_reg0, a3_reg1, a3_reg2, a3_reg3;
    wire signed [WIDTH-1:0] a4_reg0, a4_reg1, a4_reg2, a4_reg3, a4_reg4;
    wire signed [WIDTH-1:0] a5_reg0, a5_reg1, a5_reg2, a5_reg3, a5_reg4, a5_reg5;
    
    // *** a in ***
    wire signed [WIDTH-1:0] a00_in, a01_in, a02_in, a03_in, a04_in, a05_in,
                            a10_in, a11_in, a12_in, a13_in, a14_in, a15_in,
                            a20_in, a21_in, a22_in, a23_in, a24_in, a25_in,
                            a30_in, a31_in, a32_in, a33_in, a34_in, a35_in,
                            a40_in, a41_in, a42_in, a43_in, a44_in, a45_in,
                            a50_in, a51_in, a52_in, a53_in, a54_in, a55_in;
    // *** y in ***
    wire signed [WIDTH-1:0] y00_in, y01_in, y02_in, y03_in, y04_in, y05_in,
                            y10_in, y11_in, y12_in, y13_in, y14_in, y15_in,
                            y20_in, y21_in, y22_in, y23_in, y24_in, y25_in,
                            y30_in, y31_in, y32_in, y33_in, y34_in, y35_in,
                            y40_in, y41_in, y42_in, y43_in, y44_in, y45_in,
                            y50_in, y51_in, y52_in, y53_in, y54_in, y55_in;
    // *** a out ***
    wire signed [WIDTH-1:0] a00_out, a01_out, a02_out, a03_out, a04_out, a05_out,
                            a10_out, a11_out, a12_out, a13_out, a14_out, a15_out,
                            a20_out, a21_out, a22_out, a23_out, a24_out, a25_out,
                            a30_out, a31_out, a32_out, a33_out, a34_out, a35_out,
                            a40_out, a41_out, a42_out, a43_out, a44_out, a45_out,
                            a50_out, a51_out, a52_out, a53_out, a54_out, a55_out;
    // *** y out ***
    wire signed [WIDTH-1:0] y00_out, y01_out, y02_out, y03_out, y04_out, y05_out,
                            y10_out, y11_out, y12_out, y13_out, y14_out, y15_out,
                            y20_out, y21_out, y22_out, y23_out, y24_out, y25_out,
                            y30_out, y31_out, y32_out, y33_out, y34_out, y35_out,
                            y40_out, y41_out, y42_out, y43_out, y44_out, y45_out,
                            y50_out, y51_out, y52_out, y53_out, y54_out, y55_out;
    
    // *** Output registers ***
    wire signed [WIDTH-1:0] y0_tmp, y1_tmp, y2_tmp, y3_tmp, y4_tmp, y5_tmp; 
    wire signed [WIDTH-1:0] y0_reg0, y0_reg1, y0_reg2, y0_reg3, y0_reg4, y0_reg5;
    wire signed [WIDTH-1:0] y1_reg0, y1_reg1, y1_reg2, y1_reg3, y1_reg4;
    wire signed [WIDTH-1:0] y2_reg0, y2_reg1, y2_reg2, y2_reg3; 
    wire signed [WIDTH-1:0] y3_reg0, y3_reg1, y3_reg2;
    wire signed [WIDTH-1:0] y4_reg0, y4_reg1;
    wire signed [WIDTH-1:0] y5_reg0;
    
    // *** Valid registers ***
    wire in_valid_reg0, in_valid_reg1, in_valid_reg2, in_valid_reg3, in_valid_reg4, in_valid_reg5, in_valid_reg6, in_valid_reg7, in_valid_reg8, in_valid_reg9, in_valid_reg10, in_valid_reg11, in_valid_reg12;
    
    // *** Input registers for systolic data setup ***
    register #(WIDTH) reg_a0_0(clk, rst_n, en, clr, a0,      a0_reg0); 
    
    register #(WIDTH) reg_a1_0(clk, rst_n, en, clr, a1,      a1_reg0); 
    register #(WIDTH) reg_a1_1(clk, rst_n, en, clr, a1_reg0, a1_reg1); 
    
    register #(WIDTH) reg_a2_0(clk, rst_n, en, clr, a2,      a2_reg0);
    register #(WIDTH) reg_a2_1(clk, rst_n, en, clr, a2_reg0, a2_reg1);
    register #(WIDTH) reg_a2_2(clk, rst_n, en, clr, a2_reg1, a2_reg2);
    
    register #(WIDTH) reg_a3_0(clk, rst_n, en, clr, a3,      a3_reg0);
    register #(WIDTH) reg_a3_1(clk, rst_n, en, clr, a3_reg0, a3_reg1);
    register #(WIDTH) reg_a3_2(clk, rst_n, en, clr, a3_reg1, a3_reg2);
    register #(WIDTH) reg_a3_3(clk, rst_n, en, clr, a3_reg2, a3_reg3);
    
    register #(WIDTH) reg_a4_0(clk, rst_n, en, clr, a4,      a4_reg0);
    register #(WIDTH) reg_a4_1(clk, rst_n, en, clr, a4_reg0, a4_reg1);
    register #(WIDTH) reg_a4_2(clk, rst_n, en, clr, a4_reg1, a4_reg2);
    register #(WIDTH) reg_a4_3(clk, rst_n, en, clr, a4_reg2, a4_reg3);
    register #(WIDTH) reg_a4_4(clk, rst_n, en, clr, a4_reg3, a4_reg4);
    
    register #(WIDTH) reg_a5_0(clk, rst_n, en, clr, a5,      a5_reg0);
    register #(WIDTH) reg_a5_1(clk, rst_n, en, clr, a5_reg0, a5_reg1);
    register #(WIDTH) reg_a5_2(clk, rst_n, en, clr, a5_reg1, a5_reg2);
    register #(WIDTH) reg_a5_3(clk, rst_n, en, clr, a5_reg2, a5_reg3);
    register #(WIDTH) reg_a5_4(clk, rst_n, en, clr, a5_reg3, a5_reg4);
    register #(WIDTH) reg_a5_5(clk, rst_n, en, clr, a5_reg4, a5_reg5);
                
    // *** First x inputs ***
    assign a00_in = a0_reg0;
    assign a10_in = a1_reg1;
    assign a20_in = a2_reg2;
    assign a30_in = a3_reg3;
    assign a40_in = a4_reg4;
    assign a50_in = a5_reg5;
    
    // *** First z inputs ***
    assign y00_in = 0;
    assign y01_in = 0;
    assign y02_in = 0;
    assign y03_in = 0;
    assign y04_in = 0;
    assign y05_in = 0;
    
    // *** 6x6 systolic array ***
    // *** Row 0 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe00(a00_in, y00_in, b00, a00_out, y00_out);
    pe #(WIDTH, FRAC_BIT) pe01(a01_in, y01_in, b01, a01_out, y01_out);
    pe #(WIDTH, FRAC_BIT) pe02(a02_in, y02_in, b02, a02_out, y02_out);
    pe #(WIDTH, FRAC_BIT) pe03(a03_in, y03_in, b03, a03_out, y03_out);
    pe #(WIDTH, FRAC_BIT) pe04(a04_in, y04_in, b04, a04_out, y04_out);
    pe #(WIDTH, FRAC_BIT) pe05(a05_in, y05_in, b05, a05_out, y05_out);
    // *** Row 1 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe10(a10_in, y10_in, b10, a10_out, y10_out);
    pe #(WIDTH, FRAC_BIT) pe11(a11_in, y11_in, b11, a11_out, y11_out);
    pe #(WIDTH, FRAC_BIT) pe12(a12_in, y12_in, b12, a12_out, y12_out);
    pe #(WIDTH, FRAC_BIT) pe13(a13_in, y13_in, b13, a13_out, y13_out);
    pe #(WIDTH, FRAC_BIT) pe14(a14_in, y14_in, b14, a14_out, y14_out);
    pe #(WIDTH, FRAC_BIT) pe15(a15_in, y15_in, b15, a15_out, y15_out);
    // *** Row 2 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe20(a20_in, y20_in, b20, a20_out, y20_out);
    pe #(WIDTH, FRAC_BIT) pe21(a21_in, y21_in, b21, a21_out, y21_out);
    pe #(WIDTH, FRAC_BIT) pe22(a22_in, y22_in, b22, a22_out, y22_out);
    pe #(WIDTH, FRAC_BIT) pe23(a23_in, y23_in, b23, a23_out, y23_out);
    pe #(WIDTH, FRAC_BIT) pe24(a24_in, y24_in, b24, a24_out, y24_out);
    pe #(WIDTH, FRAC_BIT) pe25(a25_in, y25_in, b25, a25_out, y25_out);
    // *** Row 3 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe30(a30_in, y30_in, b30, a30_out, y30_out);
    pe #(WIDTH, FRAC_BIT) pe31(a31_in, y31_in, b31, a31_out, y31_out);
    pe #(WIDTH, FRAC_BIT) pe32(a32_in, y32_in, b32, a32_out, y32_out);
    pe #(WIDTH, FRAC_BIT) pe33(a33_in, y33_in, b33, a33_out, y33_out);
    pe #(WIDTH, FRAC_BIT) pe34(a34_in, y34_in, b34, a34_out, y34_out);
    pe #(WIDTH, FRAC_BIT) pe35(a35_in, y35_in, b35, a35_out, y35_out);
    // *** Row 4 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe40(a40_in, y40_in, b40, a40_out, y40_out);
    pe #(WIDTH, FRAC_BIT) pe41(a41_in, y41_in, b41, a41_out, y41_out);
    pe #(WIDTH, FRAC_BIT) pe42(a42_in, y42_in, b42, a42_out, y42_out);
    pe #(WIDTH, FRAC_BIT) pe43(a43_in, y43_in, b43, a43_out, y43_out);
    pe #(WIDTH, FRAC_BIT) pe44(a44_in, y44_in, b44, a44_out, y44_out);
    pe #(WIDTH, FRAC_BIT) pe45(a45_in, y45_in, b45, a45_out, y45_out);
    // *** Row 5 from bottom ***
    pe #(WIDTH, FRAC_BIT) pe50(a50_in, y50_in, b50, a50_out, y50_out);
    pe #(WIDTH, FRAC_BIT) pe51(a51_in, y51_in, b51, a51_out, y51_out);
    pe #(WIDTH, FRAC_BIT) pe52(a52_in, y52_in, b52, a52_out, y52_out);
    pe #(WIDTH, FRAC_BIT) pe53(a53_in, y53_in, b53, a53_out, y53_out);
    pe #(WIDTH, FRAC_BIT) pe54(a54_in, y54_in, b54, a54_out, y54_out);
    pe #(WIDTH, FRAC_BIT) pe55(a55_in, y55_in, b55, a55_out, y55_out);
    
    // *** Internal registers ***
    // *** Row 0 from bottom ***
    register #(WIDTH) reg_a00(clk, rst_n, en, clr, a00_out, a01_in); 
    register #(WIDTH) reg_a01(clk, rst_n, en, clr, a01_out, a02_in);
    register #(WIDTH) reg_a02(clk, rst_n, en, clr, a02_out, a03_in);
    register #(WIDTH) reg_a03(clk, rst_n, en, clr, a03_out, a04_in);
    register #(WIDTH) reg_a04(clk, rst_n, en, clr, a04_out, a05_in);
    // *** Row 1 from bottom ***
    register #(WIDTH) reg_a10(clk, rst_n, en, clr, a10_out, a11_in); 
    register #(WIDTH) reg_a11(clk, rst_n, en, clr, a11_out, a12_in);
    register #(WIDTH) reg_a12(clk, rst_n, en, clr, a12_out, a13_in);
    register #(WIDTH) reg_a13(clk, rst_n, en, clr, a13_out, a14_in);
    register #(WIDTH) reg_a14(clk, rst_n, en, clr, a14_out, a15_in);
    // *** Row 2 from bottom ***
    register #(WIDTH) reg_a20(clk, rst_n, en, clr, a20_out, a21_in); 
    register #(WIDTH) reg_a21(clk, rst_n, en, clr, a21_out, a22_in);
    register #(WIDTH) reg_a22(clk, rst_n, en, clr, a22_out, a23_in);
    register #(WIDTH) reg_a23(clk, rst_n, en, clr, a23_out, a24_in);
    register #(WIDTH) reg_a24(clk, rst_n, en, clr, a24_out, a25_in);
    // *** Row 3 from bottom ***
    register #(WIDTH) reg_a30(clk, rst_n, en, clr, a30_out, a31_in); 
    register #(WIDTH) reg_a31(clk, rst_n, en, clr, a31_out, a32_in);
    register #(WIDTH) reg_a32(clk, rst_n, en, clr, a32_out, a33_in);
    register #(WIDTH) reg_a33(clk, rst_n, en, clr, a33_out, a34_in);
    register #(WIDTH) reg_a34(clk, rst_n, en, clr, a34_out, a35_in);
    // *** Row 4 from bottom ***
    register #(WIDTH) reg_a40(clk, rst_n, en, clr, a40_out, a41_in); 
    register #(WIDTH) reg_a41(clk, rst_n, en, clr, a41_out, a42_in);
    register #(WIDTH) reg_a42(clk, rst_n, en, clr, a42_out, a43_in);
    register #(WIDTH) reg_a43(clk, rst_n, en, clr, a43_out, a44_in);
    register #(WIDTH) reg_a44(clk, rst_n, en, clr, a44_out, a45_in);
    // *** Row 5 from bottom ***
    register #(WIDTH) reg_a50(clk, rst_n, en, clr, a50_out, a51_in); 
    register #(WIDTH) reg_a51(clk, rst_n, en, clr, a51_out, a52_in);
    register #(WIDTH) reg_a52(clk, rst_n, en, clr, a52_out, a53_in);
    register #(WIDTH) reg_a53(clk, rst_n, en, clr, a53_out, a54_in);
    register #(WIDTH) reg_a54(clk, rst_n, en, clr, a54_out, a55_in);
    
    // *** Column 0 from left ***
    register #(WIDTH) reg_y00(clk, rst_n, en, clr, y00_out, y10_in);
    register #(WIDTH) reg_y10(clk, rst_n, en, clr, y10_out, y20_in);
    register #(WIDTH) reg_y20(clk, rst_n, en, clr, y20_out, y30_in);
    register #(WIDTH) reg_y30(clk, rst_n, en, clr, y30_out, y40_in);
    register #(WIDTH) reg_y40(clk, rst_n, en, clr, y40_out, y50_in);
    register #(WIDTH) reg_y50(clk, rst_n, en, clr, y50_out, y0_tmp);
    // *** Column 1 from left ***
    register #(WIDTH) reg_y01(clk, rst_n, en, clr, y01_out, y11_in);
    register #(WIDTH) reg_y11(clk, rst_n, en, clr, y11_out, y21_in);
    register #(WIDTH) reg_y21(clk, rst_n, en, clr, y21_out, y31_in);
    register #(WIDTH) reg_y31(clk, rst_n, en, clr, y31_out, y41_in);
    register #(WIDTH) reg_y41(clk, rst_n, en, clr, y41_out, y51_in);
    register #(WIDTH) reg_y51(clk, rst_n, en, clr, y51_out, y1_tmp);
    // *** Column 2 from left ***
    register #(WIDTH) reg_y02(clk, rst_n, en, clr, y02_out, y12_in);
    register #(WIDTH) reg_y12(clk, rst_n, en, clr, y12_out, y22_in);
    register #(WIDTH) reg_y22(clk, rst_n, en, clr, y22_out, y32_in);
    register #(WIDTH) reg_y32(clk, rst_n, en, clr, y32_out, y42_in);
    register #(WIDTH) reg_y42(clk, rst_n, en, clr, y42_out, y52_in);
    register #(WIDTH) reg_y52(clk, rst_n, en, clr, y52_out, y2_tmp);
    // *** Column 3 from left ***
    register #(WIDTH) reg_y03(clk, rst_n, en, clr, y03_out, y13_in);
    register #(WIDTH) reg_y13(clk, rst_n, en, clr, y13_out, y23_in);
    register #(WIDTH) reg_y23(clk, rst_n, en, clr, y23_out, y33_in);
    register #(WIDTH) reg_y33(clk, rst_n, en, clr, y33_out, y43_in);
    register #(WIDTH) reg_y43(clk, rst_n, en, clr, y43_out, y53_in);
    register #(WIDTH) reg_y53(clk, rst_n, en, clr, y53_out, y3_tmp);
    // *** Column 4 from left ***
    register #(WIDTH) reg_y04(clk, rst_n, en, clr, y04_out, y14_in);
    register #(WIDTH) reg_y14(clk, rst_n, en, clr, y14_out, y24_in);
    register #(WIDTH) reg_y24(clk, rst_n, en, clr, y24_out, y34_in);
    register #(WIDTH) reg_y34(clk, rst_n, en, clr, y34_out, y44_in);
    register #(WIDTH) reg_y44(clk, rst_n, en, clr, y44_out, y54_in);
    register #(WIDTH) reg_y54(clk, rst_n, en, clr, y54_out, y4_tmp);
    // *** Column 5 from left ***
    register #(WIDTH) reg_y05(clk, rst_n, en, clr, y05_out, y15_in);
    register #(WIDTH) reg_y15(clk, rst_n, en, clr, y15_out, y25_in);
    register #(WIDTH) reg_y25(clk, rst_n, en, clr, y25_out, y35_in);
    register #(WIDTH) reg_y35(clk, rst_n, en, clr, y35_out, y45_in);
    register #(WIDTH) reg_y45(clk, rst_n, en, clr, y45_out, y55_in);
    register #(WIDTH) reg_y55(clk, rst_n, en, clr, y55_out, y5_tmp);

    // *** Output registers ***
    register #(WIDTH) reg_y0_0(clk, rst_n, en, clr, y0_tmp,  y0_reg0); 
    register #(WIDTH) reg_y0_1(clk, rst_n, en, clr, y0_reg0, y0_reg1); 
    register #(WIDTH) reg_y0_2(clk, rst_n, en, clr, y0_reg1, y0_reg2); 
    register #(WIDTH) reg_y0_3(clk, rst_n, en, clr, y0_reg2, y0_reg3);
    register #(WIDTH) reg_y0_4(clk, rst_n, en, clr, y0_reg3, y0_reg4);
    register #(WIDTH) reg_y0_5(clk, rst_n, en, clr, y0_reg4, y0_reg5);
    
    register #(WIDTH) reg_y1_0(clk, rst_n, en, clr, y1_tmp,  y1_reg0);
    register #(WIDTH) reg_y1_1(clk, rst_n, en, clr, y1_reg0, y1_reg1);
    register #(WIDTH) reg_y1_2(clk, rst_n, en, clr, y1_reg1, y1_reg2);
    register #(WIDTH) reg_y1_3(clk, rst_n, en, clr, y1_reg2, y1_reg3);
    register #(WIDTH) reg_y1_4(clk, rst_n, en, clr, y1_reg3, y1_reg4);
    
    register #(WIDTH) reg_y2_0(clk, rst_n, en, clr, y2_tmp,  y2_reg0);
    register #(WIDTH) reg_y2_1(clk, rst_n, en, clr, y2_reg0, y2_reg1);
    register #(WIDTH) reg_y2_2(clk, rst_n, en, clr, y2_reg1, y2_reg2);
    register #(WIDTH) reg_y2_3(clk, rst_n, en, clr, y2_reg2, y2_reg3);
    
    register #(WIDTH) reg_y3_0(clk, rst_n, en, clr, y3_tmp,  y3_reg0);
    register #(WIDTH) reg_y3_1(clk, rst_n, en, clr, y3_reg0, y3_reg1);
    register #(WIDTH) reg_y3_2(clk, rst_n, en, clr, y3_reg1, y3_reg2);

    register #(WIDTH) reg_y4_0(clk, rst_n, en, clr, y4_tmp,  y4_reg0);
    register #(WIDTH) reg_y4_1(clk, rst_n, en, clr, y4_reg0, y4_reg1);

    register #(WIDTH) reg_y5_0(clk, rst_n, en, clr, y5_tmp,  y5_reg0);

    // *** Valid registers ***
    register #(1) reg_valid_0(clk, rst_n, en, clr, in_valid,      in_valid_reg0); 
    register #(1) reg_valid_1(clk, rst_n, en, clr, in_valid_reg0, in_valid_reg1);
    register #(1) reg_valid_2(clk, rst_n, en, clr, in_valid_reg1, in_valid_reg2);
    register #(1) reg_valid_3(clk, rst_n, en, clr, in_valid_reg2, in_valid_reg3);
    register #(1) reg_valid_4(clk, rst_n, en, clr, in_valid_reg3, in_valid_reg4);
    register #(1) reg_valid_5(clk, rst_n, en, clr, in_valid_reg4, in_valid_reg5);
    register #(1) reg_valid_6(clk, rst_n, en, clr, in_valid_reg5, in_valid_reg6);
    register #(1) reg_valid_7(clk, rst_n, en, clr, in_valid_reg6, in_valid_reg7);
    register #(1) reg_valid_8(clk, rst_n, en, clr, in_valid_reg7, in_valid_reg8);
    register #(1) reg_valid_9(clk, rst_n, en, clr, in_valid_reg8, in_valid_reg9);
    register #(1) reg_valid_10(clk, rst_n, en, clr, in_valid_reg9, in_valid_reg10);
    register #(1) reg_valid_11(clk, rst_n, en, clr, in_valid_reg10, in_valid_reg11);
    register #(1) reg_valid_12(clk, rst_n, en, clr, in_valid_reg11, in_valid_reg12);

    // *** Outputs ***
    assign y0 = y0_reg5;
    assign y1 = y1_reg4;
    assign y2 = y2_reg3;
    assign y3 = y3_reg2;
    assign y4 = y4_reg1;
    assign y5 = y5_reg0;
    assign out_valid = in_valid_reg12;

endmodule

You can verify the matrix multiplication operation of the systolic using the testbench.

Matrix multiplication hidden layer 1:

Verification with the model. The decimal may be different due to rounding and fixed-point implementation.

Z2=[13.9−5.9−4.211.93.9−5.18.65.7116.512.312.19.710.313.25.716.520.8−13.45.53.7−11.9−3.15.28.311.412.17.814.410.5]Z_2= \begin{bmatrix} 13.9 & -5.9 & -4.2 & 11.9 & 3.9 & -5.1\\ 8.6 & 5.7 & 11 & 6.5 & 12.3 & 12.1\\ 9.7 & 10.3 & 13.2 & 5.7 & 16.5 & 20.8\\ -13.4 & 5.5 & 3.7 & -11.9 & -3.1 & 5.2\\ 8.3 & 11.4 & 12.1 & 7.8 & 14.4 & 10.5\\ \end{bmatrix}

Matrix multiplication hidden layer 2:

Verification with the model. The decimal may be different due to rounding and fixed-point implementation.

Z3=[4.29−4.37−4.234.294.04−4.34−4.504.274.13−4.50−4.234.24]Z_3= \begin{bmatrix} 4.29 & -4.37 & -4.23 & 4.29 & 4.04 & -4.34\\ -4.50 & 4.27 & 4.13 & -4.50 & -4.23 & 4.24\\ \end{bmatrix}

3.3. Sigmoid LUT

To calculate the sigmoid function, we can use the lookup table method. The following figure illustrates a basic LUT implementation of sigmoid.

This code is an implementation of a sigmoid module in Verilog.

3.4. ANN Core

We already have the systolic module and sigmoid module. The next step is to construct the ANN core computation. Additionally, we need block memories to store weight, input, and output.

This code is an implementation of the ANN core in Verilog.

module ann
    (
        input wire          clk,
        input wire          rst_n,
        input wire          en,
        input wire          clr,
        // *** Control and status port ***
        output wire         ready,
        input wire          start,
        output wire         done,
        // *** Weight port ***
        input wire          wb_ena,
        input wire [2:0]    wb_addra,
        input wire [127:0]  wb_dina,
        input wire [15:0]   wb_wea,
        // *** Data input port ***
        input wire          k_ena,
        input wire [1:0]    k_addra,
        input wire [127:0]  k_dina,
        input wire [15:0]   k_wea,
        // *** Data output port ***
        input wire          a_enb,
        input wire [1:0]    a_addrb,
        output wire [127:0] a_doutb
    );

    // Weight BRAM
    wire wb_enb;
    wire [2:0] wb_addrb;
    wire [127:0] wb_doutb;

    wire [15:0] wb_doutb_0;
    wire [15:0] wb_doutb_1;
    wire [15:0] wb_doutb_2;
    wire [15:0] wb_doutb_3;
    wire [15:0] wb_doutb_4;
    wire [15:0] wb_doutb_5;
        
    // Input BRAM
    wire k_enb;
    wire [1:0] k_addrb;
    wire [127:0] k_doutb;
    
    wire [15:0] k_doutb_0;
    wire [15:0] k_doutb_1;
    wire [15:0] k_doutb_2;
    wire [15:0] k_doutb_3;
    wire [15:0] k_doutb_4;
    wire [15:0] k_doutb_5;

    // Counter for main controller 
    reg [5:0] cnt_main_reg;

    // Multiplexer and register for systolic moving input
    wire [0:0] a0_sel, a1_sel, a2_sel, a3_sel, a4_sel, a5_sel;
    wire [15:0] a0, a1, a2, a3, a4, a5;

    // Multiplexer and register for systolic stationary input
    wire [1:0] b00_sel, b01_sel, b02_sel, b03_sel, b04_sel, b05_sel;
    wire [1:0] b10_sel, b11_sel, b12_sel, b13_sel, b14_sel, b15_sel;
    wire [1:0] b20_sel, b21_sel, b22_sel, b23_sel, b24_sel, b25_sel;
    wire [1:0] b30_sel, b31_sel, b32_sel, b33_sel, b34_sel, b35_sel;
    wire [1:0] b40_sel, b41_sel, b42_sel, b43_sel, b44_sel, b45_sel;
    wire [1:0] b50_sel, b51_sel, b52_sel, b53_sel, b54_sel, b55_sel;
    
    wire [15:0] b00_next, b01_next, b02_next, b03_next, b04_next, b05_next;
    wire [15:0] b10_next, b11_next, b12_next, b13_next, b14_next, b15_next;
    wire [15:0] b20_next, b21_next, b22_next, b23_next, b24_next, b25_next;
    wire [15:0] b30_next, b31_next, b32_next, b33_next, b34_next, b35_next;
    wire [15:0] b40_next, b41_next, b42_next, b43_next, b44_next, b45_next;
    wire [15:0] b50_next, b51_next, b52_next, b53_next, b54_next, b55_next;
    
    wire [15:0] b00_reg, b01_reg, b02_reg, b03_reg, b04_reg, b05_reg;
    wire [15:0] b10_reg, b11_reg, b12_reg, b13_reg, b14_reg, b15_reg;
    wire [15:0] b20_reg, b21_reg, b22_reg, b23_reg, b24_reg, b25_reg;
    wire [15:0] b30_reg, b31_reg, b32_reg, b33_reg, b34_reg, b35_reg;
    wire [15:0] b40_reg, b41_reg, b42_reg, b43_reg, b44_reg, b45_reg;
    wire [15:0] b50_reg, b51_reg, b52_reg, b53_reg, b54_reg, b55_reg;
    
    // Systolic
    wire sys_in_valid;
    wire [15:0] y0, y1, y2, y3, y4, y5;
    wire sys_out_valid;

    // Sigmoid
    wire [15:0] s0, s1, s2, s3, s4, s5;
    wire sig_out_valid;
    
    wire [15:0] s0_reg0, s0_reg1, s0_reg2, s0_reg3;
    wire [15:0] s1_reg0, s1_reg1, s1_reg2, s1_reg3;
    wire [15:0] s2_reg0, s2_reg1, s2_reg2, s2_reg3;
    wire [15:0] s3_reg0, s3_reg1, s3_reg2, s3_reg3;
    wire [15:0] s4_reg0, s4_reg1, s4_reg2, s4_reg3;
    wire [15:0] s5_reg0, s5_reg1, s5_reg2, s5_reg3;

    wire sig_out_valid_reg0, sig_out_valid_reg1, sig_out_valid_reg2, sig_out_valid_reg3;

    // Output BRAM
    wire a_ena;
    wire [15:0] a_wea;
    wire [1:0] a_addra;
    wire [127:0] a_dina;
            
    // *** Weight BRAM **********************************************************
    // xpm_memory_tdpram: True Dual Port RAM
    // Xilinx Parameterized Macro, version 2018.3
    xpm_memory_tdpram
    #(
        // Common module parameters
        .MEMORY_SIZE(1024),                  // DECIMAL, size: 8x128bit= 1024 bits
        .MEMORY_PRIMITIVE("auto"),           // String
        .CLOCKING_MODE("common_clock"),      // String, "common_clock"
        .MEMORY_INIT_FILE("none"),           // String
        .MEMORY_INIT_PARAM("0"),             // String      
        .USE_MEM_INIT(1),                    // DECIMAL
        .WAKEUP_TIME("disable_sleep"),       // String
        .MESSAGE_CONTROL(0),                 // DECIMAL
        .AUTO_SLEEP_TIME(0),                 // DECIMAL          
        .ECC_MODE("no_ecc"),                 // String
        .MEMORY_OPTIMIZATION("true"),        // String              
        .USE_EMBEDDED_CONSTRAINT(0),         // DECIMAL
        
        // Port A module parameters
        .WRITE_DATA_WIDTH_A(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_A(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_A(8),              // DECIMAL
        .ADDR_WIDTH_A(3),                    // DECIMAL, clog2(1024/128)=clog2(8)= 3
        .READ_RESET_VALUE_A("0"),            // String
        .READ_LATENCY_A(1),                  // DECIMAL
        .WRITE_MODE_A("write_first"),        // String
        .RST_MODE_A("SYNC"),                 // String
        
        // Port B module parameters  
        .WRITE_DATA_WIDTH_B(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_B(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_B(8),              // DECIMAL
        .ADDR_WIDTH_B(3),                    // DECIMAL, clog2(1024/128)=clog2(8)= 3
        .READ_RESET_VALUE_B("0"),            // String
        .READ_LATENCY_B(1),                  // DECIMAL
        .WRITE_MODE_B("write_first"),        // String
        .RST_MODE_B("SYNC")                  // String
    )
    xpm_memory_tdpram_wb
    (
        .sleep(1'b0),
        .regcea(1'b1), //do not change
        .injectsbiterra(1'b0), //do not change
        .injectdbiterra(1'b0), //do not change   
        .sbiterra(), //do not change
        .dbiterra(), //do not change
        .regceb(1'b1), //do not change
        .injectsbiterrb(1'b0), //do not change
        .injectdbiterrb(1'b0), //do not change              
        .sbiterrb(), //do not change
        .dbiterrb(), //do not change
        
        // Port A module ports
        .clka(clk),
        .rsta(~rst_n),
        .ena(wb_ena),
        .wea(wb_wea),
        .addra(wb_addra),
        .dina(wb_dina),
        .douta(),
        
        // Port B module ports
        .clkb(clk),
        .rstb(~rst_n),
        .enb(wb_enb),
        .web(0),
        .addrb(wb_addrb),
        .dinb(0),
        .doutb(wb_doutb)
    );
    assign wb_doutb_0 = wb_doutb[15:0];
    assign wb_doutb_1 = wb_doutb[31:16];
    assign wb_doutb_2 = wb_doutb[47:32];
    assign wb_doutb_3 = wb_doutb[63:48];
    assign wb_doutb_4 = wb_doutb[79:64];
    assign wb_doutb_5 = wb_doutb[95:80];

    // *** Input BRAM ***********************************************************  
    // xpm_memory_tdpram: True Dual Port RAM
    // Xilinx Parameterized Macro, version 2018.3
    xpm_memory_tdpram
    #(
        // Common module parameters
        .MEMORY_SIZE(512),                   // DECIMAL, size: 4x128bit= 512 bits
        .MEMORY_PRIMITIVE("auto"),           // String
        .CLOCKING_MODE("common_clock"),      // String, "common_clock"
        .MEMORY_INIT_FILE("none"),           // String
        .MEMORY_INIT_PARAM("0"),             // String      
        .USE_MEM_INIT(1),                    // DECIMAL
        .WAKEUP_TIME("disable_sleep"),       // String
        .MESSAGE_CONTROL(0),                 // DECIMAL
        .AUTO_SLEEP_TIME(0),                 // DECIMAL          
        .ECC_MODE("no_ecc"),                 // String
        .MEMORY_OPTIMIZATION("true"),        // String              
        .USE_EMBEDDED_CONSTRAINT(0),         // DECIMAL
        
        // Port A module parameters
        .WRITE_DATA_WIDTH_A(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_A(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_A(8),              // DECIMAL
        .ADDR_WIDTH_A(2),                    // DECIMAL, clog2(512/128)=clog2(4)= 2
        .READ_RESET_VALUE_A("0"),            // String
        .READ_LATENCY_A(1),                  // DECIMAL
        .WRITE_MODE_A("write_first"),        // String
        .RST_MODE_A("SYNC"),                 // String
        
        // Port B module parameters  
        .WRITE_DATA_WIDTH_B(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_B(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_B(8),              // DECIMAL
        .ADDR_WIDTH_B(2),                    // DECIMAL, clog2(512/128)=clog2(4)= 2
        .READ_RESET_VALUE_B("0"),            // String
        .READ_LATENCY_B(1),                  // DECIMAL
        .WRITE_MODE_B("write_first"),        // String
        .RST_MODE_B("SYNC")                  // String
    )
    xpm_memory_tdpram_k
    (
        .sleep(1'b0),
        .regcea(1'b1), //do not change
        .injectsbiterra(1'b0), //do not change
        .injectdbiterra(1'b0), //do not change   
        .sbiterra(), //do not change
        .dbiterra(), //do not change
        .regceb(1'b1), //do not change
        .injectsbiterrb(1'b0), //do not change
        .injectdbiterrb(1'b0), //do not change              
        .sbiterrb(), //do not change
        .dbiterrb(), //do not change
        
        // Port A module ports
        .clka(clk),
        .rsta(~rst_n),
        .ena(k_ena),
        .wea(k_wea),
        .addra(k_addra),
        .dina(k_dina),
        .douta(),
        
        // Port B module ports
        .clkb(clk),
        .rstb(~rst_n),
        .enb(k_enb),
        .web(0),
        .addrb(k_addrb),
        .dinb(0),
        .doutb(k_doutb)
    );
    assign k_doutb_0 = k_doutb[15:0];
    assign k_doutb_1 = k_doutb[31:16];
    assign k_doutb_2 = k_doutb[47:32];
    assign k_doutb_3 = k_doutb[63:48];
    assign k_doutb_4 = k_doutb[79:64];
    assign k_doutb_5 = k_doutb[95:80];

    // *** Counter for main controller ******************************************
    always @(posedge clk)
    begin
        if (!rst_n || clr)
        begin
            cnt_main_reg <= 0;
        end
        else if (start)
        begin
            cnt_main_reg <= cnt_main_reg + 1;
        end
        else if (cnt_main_reg >= 1 && cnt_main_reg <= 49)
        begin
            cnt_main_reg <= cnt_main_reg + 1;
        end
        else if (cnt_main_reg >= 50)
        begin
            cnt_main_reg <= 0;
        end
    end

    // Weight BRAM control
    assign wb_enb = ((cnt_main_reg >= 3) && (cnt_main_reg <= 7)) ? 1 :
                    ((cnt_main_reg >= 25) && (cnt_main_reg <= 26)) ? 1 : 0;
    assign wb_addrb = (cnt_main_reg == 3) ? 0 :
                      (cnt_main_reg == 4) ? 1 :
                      (cnt_main_reg == 5) ? 2 :
                      (cnt_main_reg == 6) ? 3 :
                      (cnt_main_reg == 7) ? 4 :
                      (cnt_main_reg == 25) ? 5 :
                      (cnt_main_reg == 26) ? 6 : 0;

    // Systolic moving input multiplexer control 
    assign a0_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
    assign a1_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
    assign a2_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
    assign a3_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
    assign a4_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
    assign a5_sel = ((cnt_main_reg >= 4) && (cnt_main_reg <= 8)) ? 0 :
                    ((cnt_main_reg >= 26) && (cnt_main_reg <= 27)) ? 0 : 1;
                    
    // Input BRAM control
    assign k_enb = ((cnt_main_reg >= 1) && (cnt_main_reg <= 4)) ? 1 : 0;
    assign k_addrb = (cnt_main_reg == 1) ? 0 :
                     (cnt_main_reg == 2) ? 1 :
                     (cnt_main_reg == 3) ? 2 :
                     (cnt_main_reg == 4) ? 3 : 0;
                                                  
    // Systolic stationary input multiplexer control                 
    assign b00_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    assign b01_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    assign b02_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    assign b03_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    assign b04_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    assign b05_sel = (cnt_main_reg == 2) ? 0 :
                     (cnt_main_reg == 22) ? 1 : 3;
    
    assign b10_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;
    assign b11_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;
    assign b12_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;
    assign b13_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;
    assign b14_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;
    assign b15_sel = (cnt_main_reg == 3) ? 0 :
                     (cnt_main_reg == 23) ? 1 : 3;

    assign b20_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    assign b21_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    assign b22_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    assign b23_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    assign b24_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    assign b25_sel = (cnt_main_reg == 4) ? 0 :
                     (cnt_main_reg == 24) ? 1 : 3;
    
    assign b30_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    assign b31_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    assign b32_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    assign b33_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    assign b34_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    assign b35_sel = (cnt_main_reg == 5) ? 0 :
                     (cnt_main_reg == 25) ? 1 : 3;
    
    assign b40_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;
    assign b41_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;
    assign b42_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;
    assign b43_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;
    assign b44_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;
    assign b45_sel = (cnt_main_reg == 2) ? 2 :
                     (cnt_main_reg == 26) ? 1 : 3;

    assign b50_sel = (cnt_main_reg == 22) ? 2 : 3;
    assign b51_sel = (cnt_main_reg == 22) ? 2 : 3;
    assign b52_sel = (cnt_main_reg == 22) ? 2 : 3;
    assign b53_sel = (cnt_main_reg == 22) ? 2 : 3;
    assign b54_sel = (cnt_main_reg == 22) ? 2 : 3;
    assign b55_sel = (cnt_main_reg == 22) ? 2 : 3;

    // Systolic control
    assign sys_in_valid = ((cnt_main_reg >= 4) && (cnt_main_reg <= 9)) ? 1 :
                          ((cnt_main_reg >= 26) && (cnt_main_reg <= 31)) ? 1 : 0;
    // Output BRAM control
    assign a_ena = ((cnt_main_reg >= 40) && (cnt_main_reg <= 41)) ? 1 : 0;
    assign a_wea = ((cnt_main_reg >= 40) && (cnt_main_reg <= 41)) ? 16'hffff : 0;
    assign a_addra = (cnt_main_reg == 40) ? 0 :
                     (cnt_main_reg == 41) ? 1 : 0; 

    // Status control
    assign ready = (cnt_main_reg == 0) ? 1 : 0;
    assign done = (cnt_main_reg == 50) ? 1 : 0;

    // *** Multiplexer for systolic moving input *******************
    assign a0 = (a0_sel == 0) ? wb_doutb_0 : 0;
    assign a1 = (a1_sel == 0) ? wb_doutb_1 : 0;
    assign a2 = (a2_sel == 0) ? wb_doutb_2 : 0;
    assign a3 = (a3_sel == 0) ? wb_doutb_3 : 0;
    assign a4 = (a4_sel == 0) ? wb_doutb_4 : 0;
    assign a5 = (a5_sel == 0) ? wb_doutb_5 : 0;

    // *** Multiplexer and register for systolic stationary input ***************
    assign b00_next = (b00_sel == 0) ? k_doutb_0 :
                      (b00_sel == 1) ? s0_reg3 :
                      (b00_sel == 2) ? 16'b0000010000000000 : b00_reg;
    assign b01_next = (b01_sel == 0) ? k_doutb_1 :
                      (b01_sel == 1) ? s1_reg3 :
                      (b01_sel == 2) ? 16'b0000010000000000 : b01_reg;
    assign b02_next = (b02_sel == 0) ? k_doutb_2 :
                      (b02_sel == 1) ? s2_reg3 :
                      (b02_sel == 2) ? 16'b0000010000000000 : b02_reg;
    assign b03_next = (b03_sel == 0) ? k_doutb_3 :
                      (b03_sel == 1) ? s3_reg3 :
                      (b03_sel == 2) ? 16'b0000010000000000 : b03_reg;
    assign b04_next = (b04_sel == 0) ? k_doutb_4 :
                      (b04_sel == 1) ? s4_reg3 :
                      (b04_sel == 2) ? 16'b0000010000000000 : b04_reg;
    assign b05_next = (b05_sel == 0) ? k_doutb_5 :
                      (b05_sel == 1) ? s5_reg3 :
                      (b05_sel == 2) ? 16'b0000010000000000 : b05_reg;

    register #(16) reg_b00(clk, rst_n, en, clr, b00_next, b00_reg); 
    register #(16) reg_b01(clk, rst_n, en, clr, b01_next, b01_reg); 
    register #(16) reg_b02(clk, rst_n, en, clr, b02_next, b02_reg); 
    register #(16) reg_b03(clk, rst_n, en, clr, b03_next, b03_reg);
    register #(16) reg_b04(clk, rst_n, en, clr, b04_next, b04_reg);
    register #(16) reg_b05(clk, rst_n, en, clr, b05_next, b05_reg);

    assign b10_next = (b10_sel == 0) ? k_doutb_0 :
                      (b10_sel == 1) ? s0_reg3 :
                      (b10_sel == 2) ? 16'b0000010000000000 : b10_reg;
    assign b11_next = (b11_sel == 0) ? k_doutb_1 :
                      (b11_sel == 1) ? s1_reg3 :
                      (b11_sel == 2) ? 16'b0000010000000000 : b11_reg;
    assign b12_next = (b12_sel == 0) ? k_doutb_2 :
                      (b12_sel == 1) ? s2_reg3 :
                      (b12_sel == 2) ? 16'b0000010000000000 : b12_reg;
    assign b13_next = (b13_sel == 0) ? k_doutb_3 :
                      (b13_sel == 1) ? s3_reg3 :
                      (b13_sel == 2) ? 16'b0000010000000000 : b13_reg;
    assign b14_next = (b14_sel == 0) ? k_doutb_4 :
                      (b14_sel == 1) ? s4_reg3 :
                      (b14_sel == 2) ? 16'b0000010000000000 : b14_reg;
    assign b15_next = (b15_sel == 0) ? k_doutb_5 :
                      (b15_sel == 1) ? s5_reg3 :
                      (b15_sel == 2) ? 16'b0000010000000000 : b15_reg;
                      
    register #(16) reg_b10(clk, rst_n, en, clr, b10_next, b10_reg); 
    register #(16) reg_b11(clk, rst_n, en, clr, b11_next, b11_reg); 
    register #(16) reg_b12(clk, rst_n, en, clr, b12_next, b12_reg); 
    register #(16) reg_b13(clk, rst_n, en, clr, b13_next, b13_reg); 
    register #(16) reg_b14(clk, rst_n, en, clr, b14_next, b14_reg); 
    register #(16) reg_b15(clk, rst_n, en, clr, b15_next, b15_reg); 

    assign b20_next = (b20_sel == 0) ? k_doutb_0 :
                      (b20_sel == 1) ? s0_reg3 :
                      (b20_sel == 2) ? 16'b0000010000000000 : b20_reg;
    assign b21_next = (b21_sel == 0) ? k_doutb_1 :
                      (b21_sel == 1) ? s1_reg3 :
                      (b21_sel == 2) ? 16'b0000010000000000 : b21_reg;
    assign b22_next = (b22_sel == 0) ? k_doutb_2 :
                      (b22_sel == 1) ? s2_reg3 :
                      (b22_sel == 2) ? 16'b0000010000000000 : b22_reg;
    assign b23_next = (b23_sel == 0) ? k_doutb_3 :
                      (b23_sel == 1) ? s3_reg3 :
                      (b23_sel == 2) ? 16'b0000010000000000 : b23_reg;
    assign b24_next = (b24_sel == 0) ? k_doutb_4 :
                      (b24_sel == 1) ? s4_reg3 :
                      (b24_sel == 2) ? 16'b0000010000000000 : b24_reg;
    assign b25_next = (b25_sel == 0) ? k_doutb_5 :
                      (b25_sel == 1) ? s5_reg3 :
                      (b25_sel == 2) ? 16'b0000010000000000 : b25_reg;
                      
    register #(16) reg_b20(clk, rst_n, en, clr, b20_next, b20_reg); 
    register #(16) reg_b21(clk, rst_n, en, clr, b21_next, b21_reg); 
    register #(16) reg_b22(clk, rst_n, en, clr, b22_next, b22_reg); 
    register #(16) reg_b23(clk, rst_n, en, clr, b23_next, b23_reg); 
    register #(16) reg_b24(clk, rst_n, en, clr, b24_next, b24_reg); 
    register #(16) reg_b25(clk, rst_n, en, clr, b25_next, b25_reg); 

    assign b30_next = (b30_sel == 0) ? k_doutb_0 :
                      (b30_sel == 1) ? s0_reg3 :
                      (b30_sel == 2) ? 16'b0000010000000000 : b30_reg;
    assign b31_next = (b31_sel == 0) ? k_doutb_1 :
                      (b31_sel == 1) ? s1_reg3 :
                      (b31_sel == 2) ? 16'b0000010000000000 : b31_reg;
    assign b32_next = (b32_sel == 0) ? k_doutb_2 :
                      (b32_sel == 1) ? s2_reg3 :
                      (b32_sel == 2) ? 16'b0000010000000000 : b32_reg;
    assign b33_next = (b33_sel == 0) ? k_doutb_3 :
                      (b33_sel == 1) ? s3_reg3 :
                      (b33_sel == 2) ? 16'b0000010000000000 : b33_reg;
    assign b34_next = (b34_sel == 0) ? k_doutb_4 :
                      (b34_sel == 1) ? s4_reg3 :
                      (b34_sel == 2) ? 16'b0000010000000000 : b34_reg;
    assign b35_next = (b35_sel == 0) ? k_doutb_5 :
                      (b35_sel == 1) ? s5_reg3 :
                      (b35_sel == 2) ? 16'b0000010000000000 : b35_reg;
                      
    register #(16) reg_b30(clk, rst_n, en, clr, b30_next, b30_reg); 
    register #(16) reg_b31(clk, rst_n, en, clr, b31_next, b31_reg); 
    register #(16) reg_b32(clk, rst_n, en, clr, b32_next, b32_reg); 
    register #(16) reg_b33(clk, rst_n, en, clr, b33_next, b33_reg); 
    register #(16) reg_b34(clk, rst_n, en, clr, b34_next, b34_reg); 
    register #(16) reg_b35(clk, rst_n, en, clr, b35_next, b35_reg); 

    assign b40_next = (b40_sel == 0) ? k_doutb_0 :
                      (b40_sel == 1) ? s0_reg3 :
                      (b40_sel == 2) ? 16'b0000010000000000 : b40_reg;
    assign b41_next = (b41_sel == 0) ? k_doutb_1 :
                      (b41_sel == 1) ? s1_reg3 :
                      (b41_sel == 2) ? 16'b0000010000000000 : b41_reg;
    assign b42_next = (b42_sel == 0) ? k_doutb_2 :
                      (b42_sel == 1) ? s2_reg3 :
                      (b42_sel == 2) ? 16'b0000010000000000 : b42_reg;
    assign b43_next = (b43_sel == 0) ? k_doutb_3 :
                      (b43_sel == 1) ? s3_reg3 :
                      (b43_sel == 2) ? 16'b0000010000000000 : b43_reg;
    assign b44_next = (b44_sel == 0) ? k_doutb_4 :
                      (b44_sel == 1) ? s4_reg3 :
                      (b44_sel == 2) ? 16'b0000010000000000 : b44_reg;
    assign b45_next = (b45_sel == 0) ? k_doutb_5 :
                      (b45_sel == 1) ? s5_reg3 :
                      (b45_sel == 2) ? 16'b0000010000000000 : b45_reg;
                      
    register #(16) reg_b40(clk, rst_n, en, clr, b40_next, b40_reg); 
    register #(16) reg_b41(clk, rst_n, en, clr, b41_next, b41_reg); 
    register #(16) reg_b42(clk, rst_n, en, clr, b42_next, b42_reg); 
    register #(16) reg_b43(clk, rst_n, en, clr, b43_next, b43_reg); 
    register #(16) reg_b44(clk, rst_n, en, clr, b44_next, b44_reg); 
    register #(16) reg_b45(clk, rst_n, en, clr, b45_next, b45_reg); 

    assign b50_next = (b50_sel == 0) ? k_doutb_0 :
                      (b50_sel == 1) ? s0_reg3 :
                      (b50_sel == 2) ? 16'b0000010000000000 : b50_reg;
    assign b51_next = (b51_sel == 0) ? k_doutb_1 :
                      (b51_sel == 1) ? s1_reg3 :
                      (b51_sel == 2) ? 16'b0000010000000000 : b51_reg;
    assign b52_next = (b52_sel == 0) ? k_doutb_2 :
                      (b52_sel == 1) ? s2_reg3 :
                      (b52_sel == 2) ? 16'b0000010000000000 : b52_reg;
    assign b53_next = (b53_sel == 0) ? k_doutb_3 :
                      (b53_sel == 1) ? s3_reg3 :
                      (b53_sel == 2) ? 16'b0000010000000000 : b53_reg;
    assign b54_next = (b54_sel == 0) ? k_doutb_4 :
                      (b54_sel == 1) ? s4_reg3 :
                      (b54_sel == 2) ? 16'b0000010000000000 : b54_reg;
    assign b55_next = (b55_sel == 0) ? k_doutb_5 :
                      (b55_sel == 1) ? s5_reg3 :
                      (b55_sel == 2) ? 16'b0000010000000000 : b55_reg;
                      
    register #(16) reg_b50(clk, rst_n, en, clr, b50_next, b50_reg); 
    register #(16) reg_b51(clk, rst_n, en, clr, b51_next, b51_reg); 
    register #(16) reg_b52(clk, rst_n, en, clr, b52_next, b52_reg); 
    register #(16) reg_b53(clk, rst_n, en, clr, b53_next, b53_reg); 
    register #(16) reg_b54(clk, rst_n, en, clr, b54_next, b54_reg); 
    register #(16) reg_b55(clk, rst_n, en, clr, b55_next, b55_reg); 

    // *** Systolic *************************************************************
    systolic
    #(
        .WIDTH(16),
        .FRAC_BIT(10)
    )
    systolic_0
    (
        .clk(clk),
        .rst_n(rst_n),
        .en(en),
        .clr(clr),
        .a0(a0), .a1(a1), .a2(a2), .a3(a3), .a4(a4), .a5(a5),
        .in_valid(sys_in_valid),
        .b00(b00_reg), .b01(b01_reg), .b02(b02_reg), .b03(b03_reg), .b04(b04_reg), .b05(b05_reg),
        .b10(b10_reg), .b11(b11_reg), .b12(b12_reg), .b13(b13_reg), .b14(b14_reg), .b15(b15_reg),
        .b20(b20_reg), .b21(b21_reg), .b22(b22_reg), .b23(b23_reg), .b24(b24_reg), .b25(b25_reg),
        .b30(b30_reg), .b31(b31_reg), .b32(b32_reg), .b33(b33_reg), .b34(b34_reg), .b35(b35_reg),
        .b40(b40_reg), .b41(b41_reg), .b42(b42_reg), .b43(b43_reg), .b44(b44_reg), .b45(b45_reg),
        .b50(b50_reg), .b51(b51_reg), .b52(b52_reg), .b53(b53_reg), .b54(b54_reg), .b55(b55_reg),
        .y0(y0), .y1(y1), .y2(y2), .y3(y3), .y4(y4), .y5(y5),
        .out_valid(sys_out_valid)
    );

    // *** Sigmoid **************************************************************
    sigmoid sigmoid_0(clk, rst_n, en, clr, y0, s0);
    sigmoid sigmoid_1(clk, rst_n, en, clr, y1, s1);
    sigmoid sigmoid_2(clk, rst_n, en, clr, y2, s2);
    sigmoid sigmoid_3(clk, rst_n, en, clr, y3, s3);
    sigmoid sigmoid_4(clk, rst_n, en, clr, y4, s4);
    sigmoid sigmoid_5(clk, rst_n, en, clr, y5, s5);
    
    register #(16) reg_sig_00(clk, rst_n, en, clr, s0,      s0_reg0);
    register #(16) reg_sig_01(clk, rst_n, en, clr, s0_reg0, s0_reg1);
    register #(16) reg_sig_02(clk, rst_n, en, clr, s0_reg1, s0_reg2);
    register #(16) reg_sig_03(clk, rst_n, en, clr, s0_reg2, s0_reg3);
    register #(16) reg_sig_10(clk, rst_n, en, clr, s1,      s1_reg0);
    register #(16) reg_sig_11(clk, rst_n, en, clr, s1_reg0, s1_reg1);
    register #(16) reg_sig_12(clk, rst_n, en, clr, s1_reg1, s1_reg2);
    register #(16) reg_sig_13(clk, rst_n, en, clr, s1_reg2, s1_reg3);
    register #(16) reg_sig_20(clk, rst_n, en, clr, s2,      s2_reg0);
    register #(16) reg_sig_21(clk, rst_n, en, clr, s2_reg0, s2_reg1);
    register #(16) reg_sig_22(clk, rst_n, en, clr, s2_reg1, s2_reg2);
    register #(16) reg_sig_23(clk, rst_n, en, clr, s2_reg2, s2_reg3);
    register #(16) reg_sig_30(clk, rst_n, en, clr, s3,      s3_reg0);
    register #(16) reg_sig_31(clk, rst_n, en, clr, s3_reg0, s3_reg1);
    register #(16) reg_sig_32(clk, rst_n, en, clr, s3_reg1, s3_reg2);
    register #(16) reg_sig_33(clk, rst_n, en, clr, s3_reg2, s3_reg3);
    register #(16) reg_sig_40(clk, rst_n, en, clr, s4,      s4_reg0);
    register #(16) reg_sig_41(clk, rst_n, en, clr, s4_reg0, s4_reg1);
    register #(16) reg_sig_42(clk, rst_n, en, clr, s4_reg1, s4_reg2);
    register #(16) reg_sig_43(clk, rst_n, en, clr, s4_reg2, s4_reg3);
    register #(16) reg_sig_50(clk, rst_n, en, clr, s5,      s5_reg0);
    register #(16) reg_sig_51(clk, rst_n, en, clr, s5_reg0, s5_reg1);
    register #(16) reg_sig_52(clk, rst_n, en, clr, s5_reg1, s5_reg2);
    register #(16) reg_sig_53(clk, rst_n, en, clr, s5_reg2, s5_reg3);
     
    register #(1) reg_sig_valid_0(clk, rst_n, en, clr, sys_out_valid,      sig_out_valid); 
    register #(1) reg_sig_valid_1(clk, rst_n, en, clr, sig_out_valid,      sig_out_valid_reg0); 
    register #(1) reg_sig_valid_2(clk, rst_n, en, clr, sig_out_valid_reg0, sig_out_valid_reg1); 
    register #(1) reg_sig_valid_3(clk, rst_n, en, clr, sig_out_valid_reg1, sig_out_valid_reg2); 
    register #(1) reg_sig_valid_4(clk, rst_n, en, clr, sig_out_valid_reg2, sig_out_valid_reg3); 

    // *** Output BRAM **********************************************************
    assign a_dina = {32'd0, s5, s4, s3, s2, s1, s0};
    // xpm_memory_tdpram: True Dual Port RAM
    // Xilinx Parameterized Macro, version 2018.3
    xpm_memory_tdpram
    #(
        // Common module parameters
        .MEMORY_SIZE(512),                   // DECIMAL, size: 4x128bit= 512 bits
        .MEMORY_PRIMITIVE("auto"),           // String
        .CLOCKING_MODE("common_clock"),      // String, "common_clock"
        .MEMORY_INIT_FILE("none"),           // String
        .MEMORY_INIT_PARAM("0"),             // String      
        .USE_MEM_INIT(1),                    // DECIMAL
        .WAKEUP_TIME("disable_sleep"),       // String
        .MESSAGE_CONTROL(0),                 // DECIMAL
        .AUTO_SLEEP_TIME(0),                 // DECIMAL          
        .ECC_MODE("no_ecc"),                 // String
        .MEMORY_OPTIMIZATION("true"),        // String              
        .USE_EMBEDDED_CONSTRAINT(0),         // DECIMAL
        
        // Port A module parameters
        .WRITE_DATA_WIDTH_A(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_A(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_A(8),              // DECIMAL
        .ADDR_WIDTH_A(2),                    // DECIMAL, clog2(512/128)=clog2(4)= 2
        .READ_RESET_VALUE_A("0"),            // String
        .READ_LATENCY_A(1),                  // DECIMAL
        .WRITE_MODE_A("write_first"),        // String
        .RST_MODE_A("SYNC"),                 // String
        
        // Port B module parameters  
        .WRITE_DATA_WIDTH_B(128),            // DECIMAL, data width: 128-bit
        .READ_DATA_WIDTH_B(128),             // DECIMAL, data width: 128-bit
        .BYTE_WRITE_WIDTH_B(8),              // DECIMAL
        .ADDR_WIDTH_B(2),                    // DECIMAL, clog2(512/128)=clog2(4)= 2
        .READ_RESET_VALUE_B("0"),            // String
        .READ_LATENCY_B(1),                  // DECIMAL
        .WRITE_MODE_B("write_first"),        // String
        .RST_MODE_B("SYNC")                  // String
    )
    xpm_memory_tdpram_a
    (
        .sleep(1'b0),
        .regcea(1'b1), //do not change
        .injectsbiterra(1'b0), //do not change
        .injectdbiterra(1'b0), //do not change   
        .sbiterra(), //do not change
        .dbiterra(), //do not change
        .regceb(1'b1), //do not change
        .injectsbiterrb(1'b0), //do not change
        .injectdbiterrb(1'b0), //do not change              
        .sbiterrb(), //do not change
        .dbiterrb(), //do not change
        
        // Port A module ports
        .clka(clk),
        .rsta(~rst_n),
        .ena(a_ena),
        .wea(a_wea),
        .addra(a_addra),
        .dina(a_dina),
        .douta(),
        
        // Port B module ports
        .clkb(clk),
        .rstb(~rst_n),
        .enb(a_enb),
        .web(0),
        .addrb(a_addrb),
        .dinb(0),
        .doutb(a_doutb)
    );

endmodule

You can simulate using testbench to get the ANN core timing diagram.

How it works:

  1. Start the controller FSM

  2. Read input from memory followed by weight and bias 2

  3. Systolic input stream hidden layer 1

  4. Output stream from sigmoid hidden layer 1

  5. Read weight and bias 3 from memory

  6. Systolic input stream hidden layer 2

  7. Output from sigmoid hidden layer 2

  8. Write output to memory

  9. Done signal

3.5. AXI Stream Module

Once we get the ANN core, we need to wrap this with the AXI stream module so that it is compatible with the AXI stream protocol.

This code is an implementation of the AXIS ANN in Verilog.

You can simulate using testbench to get the AXIS ANN module timing diagram.

3.6. SoC Design

At this point, we already have the AXIS ANN module. Next, you need to create a block design that consists of Zynq IP, AXI DMA, and AXIS ANN.

Configure the AXI DMA stream width to 128-bit as shown in the following figure.

The AXI DMA will read the weight, bias, and input and also write output to a specific location in DDR memory.

Data mapping inside the DDR memory for weight, bias, and input.

Data mapping inside the DDR memory for output.

4. Software Design

At this point, the required files to program the FPGA are already on the board. The next step is to create Jupyter Notebook files.

  • Open a web browser and open Jupyter Notebook on the board. Create a new file from menu New, Python 3 (pykernel).

  • Write the following code to test the design.

5. Performance

We can compare the performance of the HW-based ANN computation with SW computation using this code.

Last updated