This tutorial contains information on how to create a more complex AXI-Stream IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).
The GCD core used in this tutorial is the same as in the previous tutorial.
gcd_core.v
module gcd_core
(
input wire clk,
input wire rst_n,
input wire start,
input wire [31:0] a,
input wire [31:0] b,
output wire ready,
output wire done,
output wire [31:0] r
);
localparam S_IDLE = 2'h0,
S_OP = 2'h1;
reg [1:0] state_reg, state_next;
reg [31:0] a_reg, a_next;
reg [31:0] b_reg, b_next;
reg [4:0] n_reg, n_next;
reg done_reg, done_next;
always @(posedge clk)
begin
if (!rst_n)
begin
state_reg <= S_IDLE;
a_reg <= 0;
b_reg <= 0;
n_reg <= 0;
done_reg <= 0;
end
else
begin
state_reg <= state_next;
a_reg <= a_next;
b_reg <= b_next;
n_reg <= n_next;
done_reg <= done_next;
end
end
always @(*)
begin
state_next = state_reg;
a_next = a_reg;
b_next = b_reg;
n_next = n_reg;
done_next = 0;
case(state_reg)
S_IDLE:
begin
if (start)
begin
a_next = a;
b_next = b;
n_next = 0;
state_next = S_OP;
end
end
S_OP:
begin
if (a_reg == b_reg)
begin
a_next = a_reg << n_reg;
done_next = 1;
state_next = S_IDLE;
end
else
begin
if (!a_reg[0]) // a even
begin
a_next = {1'b0, a_reg[31:1]};
if (!b_reg[0]) // a and b even
begin
b_next = {1'b0, b_reg[31:1]};
n_next = n_reg + 1;
end
end
else // a odd
begin
if (!b_reg[0]) // b even
begin
b_next = {1'b0, b_reg[31:1]};
end
else // a and b odd
begin
if (a_reg > b_reg)
a_next = a_reg - b_reg;
else
b_next = b_reg - a_reg;
end
end
end
end
endcase
end
assign ready = (state_reg == S_IDLE) ? 1 : 0;
assign done = done_reg;
assign r = a_reg;
endmodule
1.2. AXI-Stream Wrapper
Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the GCD core can get access to the PS DRAM via the AXI DMA.
This is the Verilog code for the wrapper module.
`timescale 1ns / 1ps
module axis_gcd
(
// ### Clock and reset signals #########################################
input wire aclk,
input wire aresetn,
// ### AXI4-stream slave signals #######################################
output wire s_axis_tready,
input wire [63:0] s_axis_tdata,
input wire s_axis_tvalid,
input wire s_axis_tlast,
// ### AXI4-stream master signals ######################################
input wire m_axis_tready,
output wire [31:0] m_axis_tdata,
output wire m_axis_tvalid,
output wire m_axis_tlast
);
reg [3:0] state_reg, state_next;
reg s_axis_tready_reg, s_axis_tready_next;
reg s_axis_tlast_reg, s_axis_tlast_next;
reg start_reg, start_next;
reg [31:0] a_reg, a_next;
reg [31:0] b_reg, b_next;
wire ready;
wire done;
wire [31:0 ]r;
// GCD core
gcd_core gcd_core_0
(
.clk(aclk),
.rst_n(aresetn),
.start(start_reg),
.a(a_reg),
.b(b_reg),
.ready(ready),
.done(done),
.r(r)
);
// State machine register
always @(posedge aclk)
begin
if (!aresetn)
begin
state_reg <= 0;
s_axis_tready_reg <= 0;
s_axis_tlast_reg <= 0;
start_reg <= 0;
a_reg <= 0;
b_reg <= 0;
end
else
begin
state_reg <= state_next;
s_axis_tready_reg <= s_axis_tready_next;
s_axis_tlast_reg <= s_axis_tlast_next;
start_reg <= start_next;
a_reg <= a_next;
b_reg <= b_next;
end
end
// State machine next value
always @(*)
begin
state_next = state_reg;
s_axis_tready_next = 0;
s_axis_tlast_next = s_axis_tlast_reg;
start_next = 0;
a_next = a_reg;
b_next = b_reg;
case (state_reg)
0: // Wait for start condition
begin
// Input data is available, and output FIFO is able to receive data
if (s_axis_tvalid && m_axis_tready)
begin
state_next = 1;
end
end
1: // Read data from AXI stream and write to local register
begin
state_next = 2;
s_axis_tready_next = 1;
s_axis_tlast_next = s_axis_tlast;
start_next = 1;
a_next = s_axis_tdata[31:0];
b_next = s_axis_tdata[63:32];
end
2: // GCD core starts running
begin
state_next = 3;
end
3: // Wait until GCD is ready
begin
// GCD ready condition
if (ready)
begin
state_next = 0;
end
end
endcase
end
// Output
assign s_axis_tready = s_axis_tready_reg;
assign m_axis_tdata = r;
assign m_axis_tvalid = done;
assign m_axis_tlast = s_axis_tlast_reg;
endmodule
1.3. System Design
This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream GCD module. Our AXI-Stream GCD module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.
The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.
The following figure shows the AXI DMA IP configuration. The read channel data width is 64-bit, and the write channel data width is 32-bit.
This is the final block design diagram as shown in Vivado.
2. Software Design
First, we need to create DMA, DMA send channel, and DMA receive channel objects.
Then, we need to allocate the buffer. We use allocate() function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 64-bit for input and unsigned int 32-bit for output.
After that, we write the input to be calculated to the input_buffer.
# Write GCD input a and b to be calculated
input_buffer[0] = ((72 << 32) | (128));
We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the input_buffer and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the output_buffer.
# Do AXI DMA MM2S and S2MM transfer
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)
We print the multiplication result from the output_buffer.
# Print GCD result
print(output_buffer[0])
8
We can create a function to do a GCD calculation like this:
# Function to calculate GCD
def calc_gcd_hw_axi_dma(ab, r):
dma_send.transfer(ab)
dma_recv.transfer(r)
We can calculate the time required to do 1 million GCD calculation. Then, we can compare this with the AXI-Lite GCD module.
data_size = 1000000
input_buffer = allocate(shape=(data_size,), dtype=np.uint64)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
# Measure the time required to calculate 1 million GCD calculation
t1 = time()
for i in range(data_size):
input_buffer[i] = ((3578129 << 32) | (2391065));
calc_gcd_hw_axi_dma(input_buffer, output_buffer)
t2 = time()
t_diff = t2 - t1
print('Time used for HW GCD with AXI DMA: {}s'.format(t_diff))
Time used for HW GCD with AXI DMA: 1.1621017456054688s
The result from AXI-Lite GCD in the previous tutorial:
Time used for HW GCD: 29.42424488067627s
Compared to the software GCD, the hardware GCD speeds up the calculation by 25.31.
Don’t forget to free the memory buffers to avoid memory leaks!
# Delete buffer to prevent memory leak
del input_buffer, output_buffer
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered a more complex AXI-Stream based IP core creation integrated with AXI DMA.