# Part 4: AXI-Stream Multiplier with DMA

## Objective

This tutorial contains information on how to create a simple AXI-Stream IP core in Verilog. The IP core does a simple multiplication operation. We use AXI DMA to write to and read from the AXI-Stream multiplier. Then, we are going to compare the performance result of the AXI DMA multiplier in this tutorial with the AXI-Lite multiplier in the previous tutorial.

## Source Code

This repository contains all of the code required in order to follow this tutorial.

{% embed url="<https://github.com/weenslab/pynq102/tree/main>" %}

***

## 1. Hardware Design

### 1.1. Stream Interface

Unlike the memory map interface, the stream interface does not have an address. This interface is mainly used for point-to-point data transfer between IP modules in the FPGA. In the AXI-Stream interface, the sender is known as a master and the receiver a slave. The data moves only in one direction, from master port to slave port.

<figure><img src="https://4146991827-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIsb2SAYKLkGlVOGOY0EE%2Fuploads%2F95VwDGBfYS2pK6CQyecO%2Faxi_stream_master_slave.jpg?alt=media&#x26;token=40f3fe13-8f65-48a6-b82e-6616ddc90456" alt="" width="563"><figcaption></figcaption></figure>

### 1.2. RTL Design of Multiplier

This is the same RTL module that did simple multiplication in the previous tutorial. It has one 32-bit input `a` and one 32-bit output `r`. Every input will be multiplied by `8` to produce the output.

{% code title="mult\_core.v" lineNumbers="true" %}

```verilog
module mult_core
    (
        input wire [31:0]  a,
        output wire [31:0] r    
    );
    
    assign r = a * 8;
    
endmodule
```

{% endcode %}

### 1.3. AXI-Stream Wrapper

Now, we are going to make a wrapper for the AXI-Stream interface. Later, this stream interface will be connected to the AXI DMA. So, the multiplier core can get access to the PS DRAM via the AXI DMA.

This is the Verilog code for the wrapper module.

{% code title="axis\_mult.v" lineNumbers="true" %}

```verilog
module axis_mult
    (
        // ### Clock and reset signals #########################################
        input  wire        aclk,
        input  wire        aresetn,
        // ### AXI4-stream slave signals #######################################
        output wire        s_axis_tready,
        input wire [31:0]  s_axis_tdata,
        input wire         s_axis_tvalid,
        input wire         s_axis_tlast,
        // ### AXI4-stream master signals ######################################
        input wire         m_axis_tready,
        output wire [31:0] m_axis_tdata,
        output wire        m_axis_tvalid,
        output wire        m_axis_tlast
    );
    
    assign s_axis_tready = m_axis_tready;
    assign m_axis_tvalid = s_axis_tvalid;
    assign m_axis_tlast = s_axis_tlast;
    
    mult_core mult_core_0
    (
        .a(s_axis_tdata),
        .r(m_axis_tdata)    
    );
   
endmodule
```

{% endcode %}

The AXI-Stream slave port is the input port for the data to the multiplier core, and the AXI-Stream master port is the output port for the data from the multiplier core. The `s_axis_*` indicates the slave port, and the `m_axis_*` indicates the master port. Every AXI-Stream port usually has the following signals:

* The `tready` signal indicates that a receiver can accept a transfer.
* The `tdata` is the primary signal used to provide the data that is passing across the interface.
* The `tvalid` signal indicates the sender is driving a valid transfer. A transfer takes place when both `tvalid` and `tready` are one.
* The `tlast` signal indicates the boundary of a packet.

Since our multiplier is a simple combinational circuit, we can just connect the control signal (`tready`, `tvalid`, `tlast`) from the slave to the master port.

### 1.4. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, AXI DMA, and our AXI-Stream multiplier module. Our AXI-Stream multiplier module is connected to the AXI DMA. Between the multiplier module and AXI DMA, we also add AXI-Stream FIFO IP.

<figure><img src="https://4146991827-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIsb2SAYKLkGlVOGOY0EE%2Fuploads%2FIsqIq9rtl9l3njtPBle3%2Faxi_dma_multiplier_block_diagram.jpg?alt=media&#x26;token=e78caf5f-a9f3-4271-8897-a216787d6d90" alt=""><figcaption></figcaption></figure>

The following figure shows the Zynq IP high-performance port configuration. There are two pots enabled, which are AXI HP0 and AXI HP2.

<figure><img src="https://4146991827-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIsb2SAYKLkGlVOGOY0EE%2Fuploads%2FdVO6KzmndeRnDRIZ6Xbm%2Fzynq-ip-config-axi-dma-multiplier.png?alt=media&#x26;token=f488378c-6a25-4875-b694-e4a1470804b3" alt=""><figcaption></figcaption></figure>

The following figure shows the AXI DMA IP configuration. The read and write channel data width is 32-bit.

<figure><img src="https://4146991827-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIsb2SAYKLkGlVOGOY0EE%2Fuploads%2FWybbQ9KSexQRcOgAADF4%2Faxi-dma-ip-config-axi-dma-multiplier.png?alt=media&#x26;token=88d3c999-8eee-498b-8c74-2f345471de54" alt=""><figcaption></figcaption></figure>

This is the final block design diagram as shown in Vivado.

<figure><img src="https://4146991827-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FIsb2SAYKLkGlVOGOY0EE%2Fuploads%2FxVNGIsb1UnZf9fCKOT83%2Fblock-design-axi-dma-multiplier.png?alt=media&#x26;token=ea6d1dcd-ecbf-4870-880f-ddc2a744eef0" alt=""><figcaption></figcaption></figure>

## 2. Software Design

First, we need to create DMA, DMA send channel, and DMA receive channel objects.

```python
# Access to AXI DMA
dma = overlay.axi_dma_0
dma_send = overlay.axi_dma_0.sendchannel
dma_recv = overlay.axi_dma_0.recvchannel
```

Then, we need to allocate the buffer. We use `allocate()` function to allocate the buffer, and NumPy will be used to specify the type of the buffer, which is unsigned int 32-bit in this case.

```python
# Allocate physical memory for AXI DMA
data_size = 1
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)
```

Next, write the input to be multiplied to the `input_buffer`.

```python
# Write input to be multiplied
input_buffer[0] = 10
```

We do the MM2S and S2MM DMA transfers. The MM2S DMA reads the `input_buffer` and then sends it to the multiplier. The S2MM DMA reads the multiplier output and then sends it to the `output_buffer`.

```python
# Do AXI DMA MM2S and S2MM transfer
dma_send.transfer(input_buffer)
dma_recv.transfer(output_buffer)
```

We print the multiplication result from the `output_buffer`.

```python
# Print multiplication result
print(output_buffer[0])
```

```
80
```

We can create a function to do a multiplication like this:

```python
# Function to calculate multiplication
def calc_mult_axi_dma(a, r):
    dma_send.transfer(a)
    dma_recv.transfer(r)
```

We can calculate the time required to do 1 million multiplications. Then, we can compare this with the AXI-Lite multiplier module.

```python
data_size = 1000000
input_buffer = allocate(shape=(data_size,), dtype=np.uint32)
output_buffer = allocate(shape=(data_size,), dtype=np.uint32)

# Measure the time required to calculate 1 million multiplication
t1 = time()
for i in range(data_size):
    input_buffer[i] = 3578129
calc_mult_axi_dma(input_buffer, output_buffer)
t2 = time()
t_diff = t2 - t1
print('Time used for AXI DMA multiplier: {}s'.format(t_diff))
```

```
Time used for AXI DMA multiplier: 1.1429970264434814s
```

The result from AXI-Lite multiplier in the previous tutorial:

```
Time used for AXI lite multiplier: 15.767791271209717s
```

Compared to the AXI-Lite multiplier, the AXI-Stream multiplier (with AXI DMA) speeds up the calculation by 13.79.

Don’t forget to free the memory buffers to avoid memory leaks!

```python
# Delete buffer to prevent memory leak
del input_buffer, output_buffer
```

## 3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

{% embed url="<https://youtu.be/ePo1Rr8jbWA>" %}

## 4. Conclusion

In this tutorial, we covered some of the basics of AXI-Stream based IP core creation integrated with AXI DMA.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://weenslab.gitbook.io/pages/fpga-tutorials/pynq-fpga-tutorial-102/part-4-axi-stream-multiplier-with-dma.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
