Part 5: AXI-Lite GCD

Objective

This tutorial contains information on how to create a more complex AXI-Lite IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).

References

Source Code

This repository contains all of the code required in order to follow this tutorial.


1. Hardware Design

1.1. RTL Design of GCD

GCD Algorithm

The GCD algorithm in this tutorial is based on the code from the book Embedded SoPC Design with Nios II Processor and Verilog Examples, page 663-679, by Pong P. Chu.

The GCD between two numbers is the largest number that divides them without remainder. We have gcd(a, b). This gcd() function returns the largest number that divides both a and b without remainder. For example, gcd(35, 25) is 5, and gcd(128, 72) is 8.

We will use the binary GCD algorithm. It is shown in the following equations. This algorithm uses only subtraction and divide-by-2 operations. It has six equations that should be applied repetitively until a is equal to b (equation 1).

For example, this is step-by-step how to calculate gcd(24, 15):

  1. gcd(24, 15), apply equation 4, then the result is,

  2. gcd(12, 15), apply equation 4, then the result is,

  3. gcd(6, 15), apply equation 4, then the result is,

  4. gcd(3, 15), apply equation 6, then the result is,

  5. gcd(3, 12), apply equation 3, then the result is,

  6. gcd(3, 6), apply equation 3, then the result is,

  7. gcd(3, 3), apply equation 1, then the result is,

  8. 3

Software Implementation

Now that we have the GCD algorithm, the next step is to implement this in Python code. Later on, we can use this Python code for verification of the hardware implementation and performance comparison between hardware and software.

In equation 2, we should multiply the GCD result by 2. In Python code, it is implemented by counting the occurrences this condition, saved in a variable, n. At the end, in line 20, we should multiply the result by 2n. Note that multiplying by 2 can be done by shifting it to the left once. Then multiplying by 2n is equal to left shifting n times.

Hardware GCD Core

Now that we have the Python program for the GCD algorithm. Based on this program, we can build the Verilog code. The ASMD chart of the GCD algorithm is shown in the following figure.

The state machine has two states: S_IDLE and S_OP. In S_IDLE state, it waits until the start signal is equal to 1. In this state the ready signal is also 1 meaning that it is ready to accept new inputs. In S_OP state, it calculates the GCD using the same algorithm as in the Python code. After the calculation is finished, it goes back to the S_IDLE state.

This is the Verilog code for the GCD core.

Simulation Result

The timing diagram of the GCD core is shown in the following figure. In the S_IDLE state, the ready signal is 1. It indicates that the GCD core is ready to accept new inputs. When the GCD core is in the process of calculating GCD, the ready signal is 0, and it goes back to 1 when the process is finished.

The testbench file of this simulation is shown below:

1.2. AXI-Lite Wrapper

Now that we already have the GCD core module. The next step is to create an AXI-Lite wrapper. The block diagram of the wrapper is shown in the following figure.

This is the Verilog code for the wrapper module.

1.3. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI-Lite GCD module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.

This is the final block design diagram as shown in Vivado.

We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:

2. Software Design

Our AXI-Lite GCD module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO object from the PYNQ library.

To write data to inputs a and b of the GCD module, we can use the write() method from the MMIO object.

We start the GCD calculation by writing one to the start bit. Then, we read and wait until the ready flag is one, indicating that the calculation is complete.

To read data from output r of the GCD module, we can use the read() method from the MMIO object.

We can create a function to do a GCD calculation like this:

We can calculate the time required to do 1 million calculations. Later, we can compare this with the AXI-Stream multiplier (with DMA) module.

Comparison with software GCD:

Compared to the software GCD, the hardware GCD speeds up the calculation by 1.84.

3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

4. Conclusion

In this tutorial, we covered a more complex AXI-Lite based IP core creation.

Last updated