This tutorial contains information on how to create a more complex AXI-Stream IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).
Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the GCD core can get access to the PS DRAM via the AXI DMA.
This is the Verilog code for the wrapper module.
`timescale 1ns / 1psmoduleaxis_gcd (// ### Clock and reset signals #########################################inputwire aclk,inputwire aresetn,// ### AXI4-stream slave signals #######################################outputwire s_axis_tready,inputwire [63:0] s_axis_tdata,inputwire s_axis_tvalid,inputwire s_axis_tlast,// ### AXI4-stream master signals ######################################inputwire m_axis_tready,outputwire [31:0] m_axis_tdata,outputwire m_axis_tvalid,outputwire m_axis_tlast );reg [3:0] state_reg, state_next;reg s_axis_tready_reg, s_axis_tready_next;reg s_axis_tlast_reg, s_axis_tlast_next;reg start_reg, start_next;reg [31:0] a_reg, a_next;reg [31:0] b_reg, b_next;wire ready;wire done;wire [31:0 ]r;// GCD coregcd_coregcd_core_0 ( .clk(aclk), .rst_n(aresetn), .start(start_reg), .a(a_reg), .b(b_reg), .ready(ready), .done(done), .r(r) );// State machine registeralways @(posedge aclk)beginif (!aresetn)begin state_reg <=0; s_axis_tready_reg <=0; s_axis_tlast_reg <=0; start_reg <=0; a_reg <=0; b_reg <=0;endelsebegin state_reg <= state_next; s_axis_tready_reg <= s_axis_tready_next; s_axis_tlast_reg <= s_axis_tlast_next; start_reg <= start_next; a_reg <= a_next; b_reg <= b_next;endend// State machine next value always @(*)begin state_next = state_reg; s_axis_tready_next =0; s_axis_tlast_next = s_axis_tlast_reg; start_next =0; a_next = a_reg; b_next = b_reg;case (state_reg)0: // Wait for start conditionbegin// Input data is available, and output FIFO is able to receive data if (s_axis_tvalid && m_axis_tready)begin state_next =1;endend1: // Read data from AXI stream and write to local registerbegin state_next =2; s_axis_tready_next =1; s_axis_tlast_next = s_axis_tlast; start_next =1; a_next = s_axis_tdata[31:0]; b_next = s_axis_tdata[63:32];end2: // GCD core starts runningbegin state_next =3;end3: // Wait until GCD is readybegin// GCD ready conditionif (ready)begin state_next =0;endendendcaseend// Outputassign s_axis_tready = s_axis_tready_reg;assign m_axis_tdata = r;assign m_axis_tvalid = done;assign m_axis_tlast = s_axis_tlast_reg;endmodule
1.3. System Design
This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream GCD module. Our AXI-Stream GCD module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.
The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.
The following figure shows the AXI DMA IP configuration. The read channel data width is 64-bit, and the write channel data width is 32-bit.
This is the final block design diagram as shown in Vivado.
2. Software Design
First, we need to create DMA, DMA send channel, and DMA receive channel objects.
Then, we need to allocate the buffer. We use allocate() function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 64-bit for input and unsigned int 32-bit for output.
After that, we write the input to be calculated to the input_buffer.
# Write GCD input a and b to be calculatedinput_buffer[0]= ((72<<32) | (128));
We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the input_buffer and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the output_buffer.
# Do AXI DMA MM2S and S2MM transferdma_send.transfer(input_buffer)dma_recv.transfer(output_buffer)
We print the multiplication result from the output_buffer.
# Print GCD resultprint(output_buffer[0])
8
We can create a function to do a GCD calculation like this:
# Function to calculate GCDdefcalc_gcd_hw_axi_dma(ab,r): dma_send.transfer(ab) dma_recv.transfer(r)
We can calculate the time required to do 1 million GCD calculation. Then, we can compare this with the AXI-Lite GCD module.
data_size =1000000input_buffer =allocate(shape=(data_size,), dtype=np.uint64)output_buffer =allocate(shape=(data_size,), dtype=np.uint32)# Measure the time required to calculate 1 million GCD calculationt1 =time()for i inrange(data_size): input_buffer[i]= ((3578129<<32) | (2391065));calc_gcd_hw_axi_dma(input_buffer, output_buffer)t2 =time()t_diff = t2 - t1print('Time used for HW GCD with AXI DMA: {}s'.format(t_diff))
Time used for HW GCD with AXI DMA: 1.1621017456054688s
The result from AXI-Lite GCD in the previous tutorial:
Time used for HW GCD: 29.42424488067627s
Compared to the software GCD, the hardware GCD speeds up the calculation by 25.31.
Don’t forget to free the memory buffers to avoid memory leaks!
# Delete buffer to prevent memory leakdel input_buffer, output_buffer
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered a more complex AXI-Stream based IP core creation integrated with AXI DMA.