Part 6: AXI-Stream GCD with DMA

Objective

This tutorial contains information on how to create a more complex AXI-Stream IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).

References


1. Hardware Design

1.1. RTL Design of GCD

The GCD core used in this tutorial is the same as in the previous tutorial.

gcd_core.v
module gcd_core
    (
        input wire         clk,
        input wire         rst_n,
        input wire         start,
        input wire [31:0]  a,
        input wire [31:0]  b,
        output wire        ready,
        output wire        done,
        output wire [31:0] r
    );
    
    localparam S_IDLE = 2'h0,
               S_OP = 2'h1;
    
    reg [1:0] state_reg, state_next;
    reg [31:0] a_reg, a_next;
    reg [31:0] b_reg, b_next;
    reg [4:0] n_reg, n_next;
    reg done_reg, done_next;

    always @(posedge clk)
    begin
        if (!rst_n)
        begin
            state_reg <= S_IDLE;
            a_reg <= 0;
            b_reg <= 0;
            n_reg <= 0;
            done_reg <= 0;
        end
        else
        begin
            state_reg <= state_next;
            a_reg <= a_next;
            b_reg <= b_next;
            n_reg <= n_next;
            done_reg <= done_next;
        end
    end
    
    always @(*)
    begin
        state_next = state_reg;
        a_next = a_reg;
        b_next = b_reg;
        n_next = n_reg;
        done_next = 0;
        case(state_reg)
            S_IDLE:
            begin
                if (start)
                begin
                    a_next = a;
                    b_next = b;
                    n_next = 0;
                    state_next = S_OP;
                end
            end
            S_OP:
            begin
                if (a_reg == b_reg)
                begin
                    a_next = a_reg << n_reg;
                    done_next = 1;
                    state_next = S_IDLE;
                end
                else
                begin
                    if (!a_reg[0])       // a even
                    begin
                        a_next = {1'b0, a_reg[31:1]};
                        if (!b_reg[0])   // a and b even
                        begin
                            b_next = {1'b0, b_reg[31:1]};
                            n_next = n_reg + 1;
                        end
                    end
                    else                // a odd
                    begin
                        if (!b_reg[0])  // b even
                        begin
                            b_next = {1'b0, b_reg[31:1]};
                        end
                        else            // a and b odd
                        begin
                            if (a_reg > b_reg)
                                a_next = a_reg - b_reg;
                            else
                                b_next = b_reg - a_reg;
                        end
                    end
                end
            end
        endcase
    end
 
    assign ready = (state_reg == S_IDLE) ? 1 : 0;
    assign done = done_reg; 
    assign r = a_reg;
    
endmodule

1.2. AXI-Stream Wrapper

Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the GCD core can get access to the PS DRAM via the AXI DMA.

This is the Verilog code for the wrapper module.

`timescale 1ns / 1ps

module axis_gcd
    (
        // ### Clock and reset signals #########################################
        input  wire        aclk,
        input  wire        aresetn,
        // ### AXI4-stream slave signals #######################################
        output wire        s_axis_tready,
        input wire [63:0]  s_axis_tdata,
        input wire         s_axis_tvalid,
        input wire         s_axis_tlast,
        // ### AXI4-stream master signals ######################################
        input wire         m_axis_tready,
        output wire [31:0] m_axis_tdata,
        output wire        m_axis_tvalid,
        output wire        m_axis_tlast
    );
    
    reg [3:0] state_reg, state_next;
    reg s_axis_tready_reg, s_axis_tready_next;
    reg s_axis_tlast_reg, s_axis_tlast_next;
    
    reg start_reg, start_next;
    reg [31:0] a_reg, a_next;
    reg [31:0] b_reg, b_next;
    wire ready;
    wire done;
    wire [31:0 ]r;
    
    // GCD core
    gcd_core gcd_core_0
    (
        .clk(aclk),
        .rst_n(aresetn),
        .start(start_reg),
        .a(a_reg),
        .b(b_reg),
        .ready(ready),
        .done(done),
        .r(r)
    );
    
    // State machine register
    always @(posedge aclk)
    begin
        if (!aresetn)
        begin
            state_reg <= 0;
            s_axis_tready_reg <= 0;
            s_axis_tlast_reg <= 0;
            start_reg <= 0;
            a_reg <= 0;
            b_reg <= 0;
        end
        else
        begin
            state_reg <= state_next;
            s_axis_tready_reg <= s_axis_tready_next;
            s_axis_tlast_reg <= s_axis_tlast_next;
            start_reg <= start_next;
            a_reg <= a_next;
            b_reg <= b_next;
        end
    end

    // State machine next value    
    always @(*)
    begin
        state_next = state_reg;
        s_axis_tready_next = 0;
        s_axis_tlast_next = s_axis_tlast_reg;
        start_next = 0;
        a_next = a_reg;
        b_next = b_reg;
        case (state_reg)
            0: // Wait for start condition
            begin
                // Input data is available, and output FIFO is able to receive data 
                if (s_axis_tvalid && m_axis_tready)
                begin
                    state_next = 1;
                end
            end
            1: // Read data from AXI stream and write to local register
            begin
                state_next = 2;
                s_axis_tready_next = 1;
                s_axis_tlast_next = s_axis_tlast;
                start_next = 1;
                a_next = s_axis_tdata[31:0];
                b_next = s_axis_tdata[63:32];
            end
            2: // GCD core starts running
            begin
                state_next = 3;
            end
            3: // Wait until GCD is ready
            begin
                // GCD ready condition
                if (ready)
                begin
                    state_next = 0;
                end
            end
        endcase
    end
    
    // Output
    assign s_axis_tready = s_axis_tready_reg;
    assign m_axis_tdata = r;
    assign m_axis_tvalid = done;
    assign m_axis_tlast = s_axis_tlast_reg;
    
endmodule

1.3. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream GCD module. Our AXI-Stream GCD module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.

The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.

The following figure shows the AXI DMA IP configuration. The read channel data width is 64-bit, and the write channel data width is 32-bit.

This is the final block design diagram as shown in Vivado.

2. Software Design

First, we need to create DMA, DMA send channel, and DMA receive channel objects.

# Access to AXI DMA
dma = overlay.axi_dma_0
dma_send = overlay.axi_dma_0.sendchannel
dma_recv = overlay.axi_dma_0.recvchannel

Then, we need to allocate the buffer. We use allocate() function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 64-bit for input and unsigned int 32-bit for output.

# Allocate physical memory for AXI DMA
data_size = 1
input_buffer = allocate(shape=(data_size,), dtype=np.uint64)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

After that, we write the input to be calculated to the input_buffer.

# Write GCD input a and b to be calculated
input_buffer[0] = ((72 << 32) | (128));

We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the input_buffer and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the output_buffer.

# Do AXI DMA MM2S and S2MM transfer
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)

We print the multiplication result from the output_buffer.

# Print GCD result
print(output_buffer[0])
8

We can create a function to do a GCD calculation like this:

# Function to calculate GCD
def calc_gcd_hw_axi_dma(ab, r):
    dma_send.transfer(ab)
    dma_recv.transfer(r)

We can calculate the time required to do 1 million GCD calculation. Then, we can compare this with the AXI-Lite GCD module.

data_size = 1000000
input_buffer = allocate(shape=(data_size,), dtype=np.uint64)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

# Measure the time required to calculate 1 million GCD calculation
t1 = time()
for i in range(data_size):
    input_buffer[i] = ((3578129 << 32) | (2391065));
calc_gcd_hw_axi_dma(input_buffer, output_buffer)
t2 = time()
t_diff = t2 - t1
print('Time used for HW GCD with AXI DMA: {}s'.format(t_diff))
Time used for HW GCD with AXI DMA: 1.1621017456054688s

The result from AXI-Lite GCD in the previous tutorial:

Time used for HW GCD: 29.42424488067627s

Compared to the software GCD, the hardware GCD speeds up the calculation by 25.31.

Don’t forget to free the memory buffers to avoid memory leaks!

# Delete buffer to prevent memory leak
del input_buffer, output_buffer

3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

4. Conclusion

In this tutorial, we covered a more complex AXI-Stream based IP core creation integrated with AXI DMA.

Last updated