This tutorial contains information on how to create a more complex AXI-Lite IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).
The GCD algorithm in this tutorial is based on the code from the book Embedded SoPC Design with Nios II Processor and Verilog Examples, page 663-679, by Pong P. Chu.
The GCD between two numbers is the largest number that divides them without remainder. We have gcd(a, b). This gcd() function returns the largest number that divides both a and b without remainder. For example, gcd(35, 25) is 5, and gcd(128, 72) is 8.
We will use the binary GCD algorithm. It is shown in the following equations. This algorithm uses only subtraction and divide-by-2 operations. It has six equations that should be applied repetitively until a is equal to b (equation 1).
For example, this is step-by-step how to calculate gcd(24, 15):
gcd(24, 15), apply equation 4, then the result is,
gcd(12, 15), apply equation 4, then the result is,
gcd(6, 15), apply equation 4, then the result is,
gcd(3, 15), apply equation 6, then the result is,
gcd(3, 12), apply equation 3, then the result is,
gcd(3, 6), apply equation 3, then the result is,
gcd(3, 3), apply equation 1, then the result is,
3
Software Implementation
Now that we have the GCD algorithm, the next step is to implement this in Python code. Later on, we can use this Python code for verification of the hardware implementation and performance comparison between hardware and software.
# Function to calculate GCD with SWdefcalc_gcd_sw(a,b): n =0whileTrue:if (a == b):breakif ((a %2) ==0):# a even a = a >>1if ((b %2) ==0):# b even b = b >>1 n = n +1else:# a oddif ((b %2) ==0):# b even b = b >>1else:# b oddif (a > b): a = a - belse: b = b - a a = a << nreturn a
In equation 2, we should multiply the GCD result by 2. In Python code, it is implemented by counting the occurrences this condition, saved in a variable, n. At the end, in line 20, we should multiply the result by 2n. Note that multiplying by 2 can be done by shifting it to the left once. Then multiplying by 2n is equal to left shifting n times.
Hardware GCD Core
Now that we have the Python program for the GCD algorithm. Based on this program, we can build the Verilog code. The ASMD chart of the GCD algorithm is shown in the following figure.
The state machine has two states: S_IDLE and S_OP. In S_IDLE state, it waits until the start signal is equal to 1. In this state the ready signal is also 1 meaning that it is ready to accept new inputs. In S_OP state, it calculates the GCD using the same algorithm as in the Python code. After the calculation is finished, it goes back to the S_IDLE state.
The timing diagram of the GCD core is shown in the following figure. In the S_IDLE state, the ready signal is 1. It indicates that the GCD core is ready to accept new inputs. When the GCD core is in the process of calculating GCD, the ready signal is 0, and it goes back to 1 when the process is finished.
The testbench file of this simulation is shown below:
gcd_core_tb.v
`timescale 1ns / 1psmodulegcd_core_tb();// Clock periodlocalparam T =10;reg clk;reg rst_n;reg start;reg [31:0] a;reg [31:0] b;wire done;wire ready;wire [31:0] r;gcd_coredut ( .clk(clk), .rst_n(rst_n), .start(start), .a(a), .b(b), .done(done), .ready(ready), .r(r) );alwaysbegin clk =0; #(T/2); clk =1; #(T/2);endinitialbegin// Initial value a =0; b =0; start =0;// Reset rst_n =0; #T; rst_n =1; #T;// gcd(35, 25) a =35; b =25; start =1; #T; start =0; #(T*10);// gcd(128, 72) a =128; b =72; start =1; #T; start =0; #(T*15);// gcd(24, 15) a =24; b =15; start =1; #T; start =0; #(T*10); endendmodule
1.2. AXI-Lite Wrapper
Now that we already have the GCD core module. The next step is to create an AXI-Lite wrapper. The block diagram of the wrapper is shown in the following figure.
This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI-Lite GCD module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.
This is the final block design diagram as shown in Vivado.
We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:
2. Software Design
Our AXI-Lite GCD module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO object from the PYNQ library.
# Access to memory map of the AXI GCDADDR_BASE =0xA0000000ADDR_RANGE =0x80gcd_obj =MMIO(ADDR_BASE, ADDR_RANGE)
To write data to inputs a and b of the GCD module, we can use the write() method from the MMIO object.
# Write input A and B to the AXI GCDgcd_obj.write(0x8, 128)gcd_obj.write(0x10, 72)
We start the GCD calculation by writing one to the start bit. Then, we read and wait until the ready flag is one, indicating that the calculation is complete.
# Start main controllergcd_obj.write(0x0, 0x1)# Wait until ready flag is 1while ((gcd_obj.read(0x0)& (1<<1)) ==0):pass
To read data from output r of the GCD module, we can use the read() method from the MMIO object.
# Read GCD resultgcd_obj.read(0x18)
80
We can create a function to do a GCD calculation like this:
# Function to calculate GCD with HW coredefcalc_gcd_hw(a,b): gcd_obj.write(0x8, a) gcd_obj.write(0x10, b) gcd_obj.write(0x0, 0x1) r = gcd_obj.read(0x18)return r
We can calculate the time required to do 1 million calculations. Later, we can compare this with the AXI-Stream multiplier (with DMA) module.
# Measure the time required to calculate 1 million operationt1 =time()for i inrange(1000000):calc_gcd_hw(2391065, 3578129)t2 =time()t_hw = t2 - t1print('Time used for HW GCD: {}s'.format(t_hw))
Time used for HW GCD: 29.42424488067627s
Comparison with software GCD:
# Measure the time required to calculate 1 million operationt1 =time()for i inrange(1000000):calc_gcd_sw(2391065, 3578129)t2 =time()t_sw = t2 - t1print('Time used for SW GCD: {}s'.format(t_sw))
Time used for SW GCD: 54.3738739490509s
Compared to the software GCD, the hardware GCD speeds up the calculation by 1.84.
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered a more complex AXI-Lite based IP core creation.