Part 6: AXI-Stream GCD with DMA
Objective
This tutorial contains information on how to create a more complex AXI-Stream IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).
References
Pong P. Chu, Embedded SoPC Design with Nios II Processor and Verilog Examples, https://onlinelibrary.wiley.com/doi/book/10.1002/9781118309728
Source Code
This repository contains all of the code required in order to follow this tutorial.
1. Hardware Design
1.1. RTL Design of GCD
The GCD core used in this tutorial is the same as in the previous tutorial.
1.2. AXI-Stream Wrapper
Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the GCD core can get access to the PS DRAM via the AXI DMA.
This is the Verilog code for the wrapper module.
1.3. System Design
This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream GCD module. Our AXI-Stream GCD module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.

The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.

The following figure shows the AXI DMA IP configuration. The read channel data width is 64-bit, and the write channel data width is 32-bit.

This is the final block design diagram as shown in Vivado.

2. Software Design
First, we need to create DMA, DMA send channel, and DMA receive channel objects.
Then, we need to allocate the buffer. We use allocate() function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 64-bit for input and unsigned int 32-bit for output.
After that, we write the input to be calculated to the input_buffer.
We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the input_buffer and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the output_buffer.
We print the multiplication result from the output_buffer.
We can create a function to do a GCD calculation like this:
We can calculate the time required to do 1 million GCD calculation. Then, we can compare this with the AXI-Lite GCD module.
The result from AXI-Lite GCD in the previous tutorial:
Compared to the software GCD, the hardware GCD speeds up the calculation by 25.31.
Donβt forget to free the memory buffers to avoid memory leaks!
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered a more complex AXI-Stream based IP core creation integrated with AXI DMA.
Last updated