This tutorial contains information on how to create a more complex AXI-Lite IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).
The GCD algorithm in this tutorial is based on the code from the book Embedded SoPC Design with Nios II Processor and Verilog Examples, page 663-679, by Pong P. Chu.
The GCD between two numbers is the largest number that divides them without remainder. We have gcd(a, b). This gcd() function returns the largest number that divides both a and b without remainder. For example, gcd(35, 25) is 5, and gcd(128, 72) is 8.
We will use the binary GCD algorithm. It is shown in the following equations. This algorithm uses only subtraction and divide-by-2 operations. It has six equations that should be applied repetitively until a is equal to b (equation 1).
For example, this is step-by-step how to calculate gcd(24, 15):
gcd(24, 15), apply equation 4, then the result is,
gcd(12, 15), apply equation 4, then the result is,
gcd(6, 15), apply equation 4, then the result is,
gcd(3, 15), apply equation 6, then the result is,
gcd(3, 12), apply equation 3, then the result is,
gcd(3, 6), apply equation 3, then the result is,
gcd(3, 3), apply equation 1, then the result is,
3
Software Implementation
Now that we have the GCD algorithm, the next step is to implement this in Python code. Later on, we can use this Python code for verification of the hardware implementation and performance comparison between hardware and software.
# Function to calculate GCD with SW
def calc_gcd_sw(a, b):
n = 0
while True:
if (a == b):
break
if ((a % 2) == 0): # a even
a = a >> 1
if ((b % 2) == 0): # b even
b = b >> 1
n = n + 1
else: # a odd
if ((b % 2) == 0): # b even
b = b >> 1
else: # b odd
if (a > b):
a = a - b
else:
b = b - a
a = a << n
return a
In equation 2, we should multiply the GCD result by 2. In Python code, it is implemented by counting the occurrences this condition, saved in a variable, n. At the end, in line 20, we should multiply the result by 2n. Note that multiplying by 2 can be done by shifting it to the left once. Then multiplying by 2n is equal to left shifting n times.
Hardware GCD Core
Now that we have the Python program for the GCD algorithm. Based on this program, we can build the Verilog code. The ASMD chart of the GCD algorithm is shown in the following figure.
The state machine has two states: S_IDLE and S_OP. In S_IDLE state, it waits until the start signal is equal to 1. In this state the ready signal is also 1 meaning that it is ready to accept new inputs. In S_OP state, it calculates the GCD using the same algorithm as in the Python code. After the calculation is finished, it goes back to the S_IDLE state.
This is the Verilog code for the GCD core.
gcd_core.v
module gcd_core
(
input wire clk,
input wire rst_n,
input wire start,
input wire [31:0] a,
input wire [31:0] b,
output wire ready,
output wire done,
output wire [31:0] r
);
localparam S_IDLE = 2'h0,
S_OP = 2'h1;
reg [1:0] state_reg, state_next;
reg [31:0] a_reg, a_next;
reg [31:0] b_reg, b_next;
reg [4:0] n_reg, n_next;
reg done_reg, done_next;
always @(posedge clk)
begin
if (!rst_n)
begin
state_reg <= S_IDLE;
a_reg <= 0;
b_reg <= 0;
n_reg <= 0;
done_reg <= 0;
end
else
begin
state_reg <= state_next;
a_reg <= a_next;
b_reg <= b_next;
n_reg <= n_next;
done_reg <= done_next;
end
end
always @(*)
begin
state_next = state_reg;
a_next = a_reg;
b_next = b_reg;
n_next = n_reg;
done_next = 0;
case(state_reg)
S_IDLE:
begin
if (start)
begin
a_next = a;
b_next = b;
n_next = 0;
state_next = S_OP;
end
end
S_OP:
begin
if (a_reg == b_reg)
begin
a_next = a_reg << n_reg;
done_next = 1;
state_next = S_IDLE;
end
else
begin
if (!a_reg[0]) // a even
begin
a_next = {1'b0, a_reg[31:1]};
if (!b_reg[0]) // a and b even
begin
b_next = {1'b0, b_reg[31:1]};
n_next = n_reg + 1;
end
end
else // a odd
begin
if (!b_reg[0]) // b even
begin
b_next = {1'b0, b_reg[31:1]};
end
else // a and b odd
begin
if (a_reg > b_reg)
a_next = a_reg - b_reg;
else
b_next = b_reg - a_reg;
end
end
end
end
endcase
end
assign ready = (state_reg == S_IDLE) ? 1 : 0;
assign done = done_reg;
assign r = a_reg;
endmodule
Simulation Result
The timing diagram of the GCD core is shown in the following figure. In the S_IDLE state, the ready signal is 1. It indicates that the GCD core is ready to accept new inputs. When the GCD core is in the process of calculating GCD, the ready signal is 0, and it goes back to 1 when the process is finished.
The testbench file of this simulation is shown below:
gcd_core_tb.v
`timescale 1ns / 1ps
module gcd_core_tb();
// Clock period
localparam T = 10;
reg clk;
reg rst_n;
reg start;
reg [31:0] a;
reg [31:0] b;
wire done;
wire ready;
wire [31:0] r;
gcd_core dut
(
.clk(clk),
.rst_n(rst_n),
.start(start),
.a(a),
.b(b),
.done(done),
.ready(ready),
.r(r)
);
always
begin
clk = 0;
#(T/2);
clk = 1;
#(T/2);
end
initial
begin
// Initial value
a = 0;
b = 0;
start = 0;
// Reset
rst_n = 0;
#T;
rst_n = 1;
#T;
// gcd(35, 25)
a = 35;
b = 25;
start = 1;
#T;
start = 0;
#(T*10);
// gcd(128, 72)
a = 128;
b = 72;
start = 1;
#T;
start = 0;
#(T*15);
// gcd(24, 15)
a = 24;
b = 15;
start = 1;
#T;
start = 0;
#(T*10);
end
endmodule
1.2. AXI-Lite Wrapper
Now that we already have the GCD core module. The next step is to create an AXI-Lite wrapper. The block diagram of the wrapper is shown in the following figure.
This is the Verilog code for the wrapper module.
axi_gcd.v
module axi_gcd
(
// ### Clock and reset signals #########################################
input wire aclk,
input wire aresetn,
// ### AXI4-lite slave signals #########################################
// *** Write address signals ***
output wire s_axi_awready,
input wire [31:0] s_axi_awaddr,
input wire s_axi_awvalid,
// *** Write data signals ***
output wire s_axi_wready,
input wire [31:0] s_axi_wdata,
input wire [3:0] s_axi_wstrb,
input wire s_axi_wvalid,
// *** Write response signals ***
input wire s_axi_bready,
output wire [1:0] s_axi_bresp,
output wire s_axi_bvalid,
// *** Read address signals ***
output wire s_axi_arready,
input wire [31:0] s_axi_araddr,
input wire s_axi_arvalid,
// *** Read data signals ***
input wire s_axi_rready,
output wire [31:0] s_axi_rdata,
output wire [1:0] s_axi_rresp,
output wire s_axi_rvalid
// ### User signals ####################################################
);
// ### Register map ########################################################
// 0x00: start and ready
// bit 0 = START (R/W)
// bit 1 = READY (R)
// 0x04: input a
// bit 31~0 = A[31:0] (R/W)
// 0x08: input b
// bit 31~0 = B[31:0] (R/W)
// 0x0c: output r
// bit 31~0 = R[31:0] (R)
localparam C_ADDR_BITS = 8;
// // *** Address (32-bit) ***
// localparam C_ADDR_CTRL = 8'h00,
// C_ADDR_INPA = 8'h04,
// C_ADDR_INPB = 8'h08,
// C_ADDR_OUTR = 8'h0c;
// *** Address (40-bit) ***
localparam C_ADDR_CTRL = 8'h00,
C_ADDR_INPA = 8'h08,
C_ADDR_INPB = 8'h10,
C_ADDR_OUTR = 8'h18;
// *** AXI write FSM ***
localparam S_WRIDLE = 2'd0,
S_WRDATA = 2'd1,
S_WRRESP = 2'd2;
// *** AXI read FSM ***
localparam S_RDIDLE = 2'd0,
S_RDDATA = 2'd1;
// *** AXI write ***
reg [1:0] wstate_cs, wstate_ns;
reg [C_ADDR_BITS-1:0] waddr;
wire [31:0] wmask;
wire aw_hs, w_hs;
// *** AXI read ***
reg [1:0] rstate_cs, rstate_ns;
wire [C_ADDR_BITS-1:0] raddr;
reg [31:0] rdata;
wire ar_hs;
// *** Control registers ***
reg start_reg;
wire ready_w;
reg [31:0] a_reg;
reg [31:0] b_reg;
wire [31:0] r_w;
// ### AXI write ###########################################################
assign s_axi_awready = (wstate_cs == S_WRIDLE);
assign s_axi_wready = (wstate_cs == S_WRDATA);
assign s_axi_bresp = 2'b00; // OKAY
assign s_axi_bvalid = (wstate_cs == S_WRRESP);
assign wmask = {{8{s_axi_wstrb[3]}}, {8{s_axi_wstrb[2]}}, {8{s_axi_wstrb[1]}}, {8{s_axi_wstrb[0]}}};
assign aw_hs = s_axi_awvalid & s_axi_awready;
assign w_hs = s_axi_wvalid & s_axi_wready;
// *** Write state register ***
always @(posedge aclk)
begin
if (!aresetn)
wstate_cs <= S_WRIDLE;
else
wstate_cs <= wstate_ns;
end
// *** Write state next ***
always @(*)
begin
case (wstate_cs)
S_WRIDLE:
if (s_axi_awvalid)
wstate_ns = S_WRDATA;
else
wstate_ns = S_WRIDLE;
S_WRDATA:
if (s_axi_wvalid)
wstate_ns = S_WRRESP;
else
wstate_ns = S_WRDATA;
S_WRRESP:
if (s_axi_bready)
wstate_ns = S_WRIDLE;
else
wstate_ns = S_WRRESP;
default:
wstate_ns = S_WRIDLE;
endcase
end
// *** Write address register ***
always @(posedge aclk)
begin
if (aw_hs)
waddr <= s_axi_awaddr[C_ADDR_BITS-1:0];
end
// ### AXI read ############################################################
assign s_axi_arready = (rstate_cs == S_RDIDLE);
assign s_axi_rdata = rdata;
assign s_axi_rresp = 2'b00; // OKAY
assign s_axi_rvalid = (rstate_cs == S_RDDATA);
assign ar_hs = s_axi_arvalid & s_axi_arready;
assign raddr = s_axi_araddr[C_ADDR_BITS-1:0];
// *** Read state register ***
always @(posedge aclk)
begin
if (!aresetn)
rstate_cs <= S_RDIDLE;
else
rstate_cs <= rstate_ns;
end
// *** Read state next ***
always @(*)
begin
case (rstate_cs)
S_RDIDLE:
if (s_axi_arvalid)
rstate_ns = S_RDDATA;
else
rstate_ns = S_RDIDLE;
S_RDDATA:
if (s_axi_rready)
rstate_ns = S_RDIDLE;
else
rstate_ns = S_RDDATA;
default:
rstate_ns = S_RDIDLE;
endcase
end
// *** Read data register ***
always @(posedge aclk)
begin
if (!aresetn)
rdata <= 0;
else if (ar_hs)
case (raddr)
C_ADDR_CTRL:
rdata <= {{30{1'b0}}, ready_w, start_reg};
C_ADDR_INPA:
rdata <= a_reg[31:0];
C_ADDR_INPB:
rdata <= b_reg[31:0];
C_ADDR_OUTR:
rdata <= r_w[31:0];
endcase
end
// ### User design #########################################################
// *** Start register ***
always @(posedge aclk)
begin
if (!aresetn)
begin
start_reg <= 0;
end
else if (w_hs && waddr == C_ADDR_CTRL && s_axi_wdata[0])
begin
start_reg <= 1;
end
else
begin
start_reg <= 0;
end
end
// *** Register a and b ***
always @(posedge aclk)
begin
if (!aresetn)
begin
a_reg[31:0] <= 0;
b_reg[31:0] <= 0;
end
else if (w_hs && waddr == C_ADDR_INPA)
begin
a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
end
else if (w_hs && waddr == C_ADDR_INPB)
begin
b_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (b_reg[31:0] & ~wmask);
end
end
// *** GCD core ***
gcd_core gcd_core_0
(
.clk(aclk),
.rst_n(aresetn),
.start(start_reg),
.a(a_reg),
.b(b_reg),
.done(),
.ready(ready_w),
.r(r_w)
);
endmodule
1.3. System Design
This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI-Lite GCD module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.
This is the final block design diagram as shown in Vivado.
We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:
2. Software Design
Our AXI-Lite GCD module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO object from the PYNQ library.
# Access to memory map of the AXI GCD
ADDR_BASE = 0xA0000000
ADDR_RANGE = 0x80
gcd_obj = MMIO(ADDR_BASE, ADDR_RANGE)
To write data to inputs a and b of the GCD module, we can use the write() method from the MMIO object.
# Write input A and B to the AXI GCD
gcd_obj.write(0x8, 128)
gcd_obj.write(0x10, 72)
We start the GCD calculation by writing one to the start bit. Then, we read and wait until the ready flag is one, indicating that the calculation is complete.
# Start main controller
gcd_obj.write(0x0, 0x1)
# Wait until ready flag is 1
while ((gcd_obj.read(0x0) & (1 << 1)) == 0):
pass
To read data from output r of the GCD module, we can use the read() method from the MMIO object.
# Read GCD result
gcd_obj.read(0x18)
80
We can create a function to do a GCD calculation like this:
# Function to calculate GCD with HW core
def calc_gcd_hw(a, b):
gcd_obj.write(0x8, a)
gcd_obj.write(0x10, b)
gcd_obj.write(0x0, 0x1)
r = gcd_obj.read(0x18)
return r
We can calculate the time required to do 1 million calculations. Later, we can compare this with the AXI-Stream multiplier (with DMA) module.
# Measure the time required to calculate 1 million operation
t1 = time()
for i in range(1000000):
calc_gcd_hw(2391065, 3578129)
t2 = time()
t_hw = t2 - t1
print('Time used for HW GCD: {}s'.format(t_hw))
Time used for HW GCD: 29.42424488067627s
Comparison with software GCD:
# Measure the time required to calculate 1 million operation
t1 = time()
for i in range(1000000):
calc_gcd_sw(2391065, 3578129)
t2 = time()
t_sw = t2 - t1
print('Time used for SW GCD: {}s'.format(t_sw))
Time used for SW GCD: 54.3738739490509s
Compared to the software GCD, the hardware GCD speeds up the calculation by 1.84.
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered a more complex AXI-Lite based IP core creation.