Part 4: AXI-Stream Multiplier with DMA

Objective

This tutorial contains information on how to create a simple AXI-Stream IP core in Verilog. The IP core does a simple multiplication operation. We use AXI DMA to write to and read from the AXI-Stream multiplier. Then, we are going to compare the performance result of the AXI DMA multiplier in this tutorial with the AXI-Lite multiplier in the previous tutorial.

1. Hardware Design

1.1. Stream Interface

Unlike the memory map interface, the stream interface does not have an address. This interface is mainly used for point-to-point data transfer between IP modules in the FPGA. In the AXI-Stream interface, the sender is known as a master and the receiver a slave. The data moves only in one direction, from master port to slave port.

1.2. RTL Design of Multiplier

This is the same RTL module that did simple multiplication in the previous tutorial. It has one 32-bit input a and one 32-bit output r. Every input will be multiplied by 8 to produce the output.

mult_core.v

module mult_core
    (
        input wire [31:0]  a,
        output wire [31:0] r    
    );
    
    assign r = a * 8;
    
endmodule

1.3. AXI-Stream Wrapper

Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the multiplier core can get access to the PS DRAM via the AXI DMA.

This is the Verilog code for the wrapper module.

axis_mult.v

module axis_mult
    (
        // ### Clock and reset signals #########################################
        input  wire        aclk,
        input  wire        aresetn,
        // ### AXI4-stream slave signals #######################################
        output wire        s_axis_tready,
        input wire [31:0]  s_axis_tdata,
        input wire         s_axis_tvalid,
        input wire         s_axis_tlast,
        // ### AXI4-stream master signals ######################################
        input wire         m_axis_tready,
        output wire [31:0] m_axis_tdata,
        output wire        m_axis_tvalid,
        output wire        m_axis_tlast
    );
    
    assign s_axis_tready = m_axis_tready;
    assign m_axis_tvalid = s_axis_tvalid;
    assign m_axis_tlast = s_axis_tlast;
    
    mult_core mult_core_0
    (
        .a(s_axis_tdata),
        .r(m_axis_tdata)    
    );
   
endmodule

The AXI-Stream slave port is the input port for the data to the multiplier core, and the AXI-Stream master port is the output port for the data from the multiplier core. The s_axis_* indicates the slave port, and the m_axis_* indicates the master port. Every AXI-Stream port usually has the following signals:

The tready signal indicates that a receiver can accept a transfer.
The tdata is the primary signal used to provide the data that is passing across the interface.
The tvalid signal indicates the sender is driving a valid transfer. A transfer takes place when both tvalid and tready are one.
The tlast signal indicates the boundary of a packet.

Since our multiplier is a simple combinational circuit, we can just connect the control signal (tready, tvalid, tlast) from the slave to the master port.

1.4. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream multiplier module. Our AXI-Stream multiplier module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.

The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.

The following figure shows the AXI DMA IP configuration. The read and write channel data width is 32-bit.

This is the final block design diagram as shown in Vivado.

2. Software Design

First, we need to create DMA, DMA send channel, and DMA receive channel objects.

# Access to AXI DMA
dma = overlay.axi_dma_0
dma_send = overlay.axi_dma_0.sendchannel
dma_recv = overlay.axi_dma_0.recvchannel

Then, we need to allocate the buffer. We use allocate() function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 32-bit in this case.

# Allocate physical memory for AXI DMA
data_size = 1
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

Next, write the input to be multiplied to the input_buffer.

# Write input to be multiplied
input_buffer[0] = 10

We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the input_buffer and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the output_buffer.

# Do AXI DMA MM2S and S2MM transfer
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)

We print the multiplication result from the output_buffer.

# Print multiplication result
print(output_buffer[0])

We can create a function to do a multiplication like this:

# Function to calculate multiplication
def calc_mult_axi_dma(a, r):
    dma_send.transfer(a)
    dma_recv.transfer(r)

We can calculate the time required to do 1 million multiplications. Then, we can compare this with the AXI-Lite multiplier module.

data_size = 1000000
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

# Measure the time required to calculate 1 million multiplication
t1 = time()
for i in range(data_size):
    input_buffer[i] = 3578129
calc_mult_axi_dma(input_buffer, output_buffer)
t2 = time()
t_diff = t2 - t1
print('Time used for AXI DMA multiplier: {}s'.format(t_diff))

Time used for AXI DMA multiplier: 1.1429970264434814s

The result from AXI-Lite multiplier in the previous tutorial:

Time used for AXI lite multiplier: 15.767791271209717s

Compared to the AXI-Lite multiplier, the AXI-Stream multiplier (with AXI DMA) speeds up the calculation by 13.79.

Don’t forget to free the memory buffers to avoid memory leaks!

# Delete buffer to prevent memory leak
del input_buffer, output_buffer

3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

4. Conclusion

In this tutorial, we covered some of the basics of AXI-Stream based IP core creation integrated with AXI DMA.

PreviousPart 3: AXI-Lite Multiplier NextPart 5: AXI-Lite GCD

Last updated 5 months ago