Part 5: AXI-Lite GCD

Objective

This tutorial contains information on how to create a more complex AXI-Lite IP core in Verilog. The IP core does a greatest common divisor (GCD) calculation between two numbers. Later, we can compare the performance between AXI-Lite GCD and AXI-Stream GCD (with AXI DMA).

References

Pong P. Chu, Embedded SoPC Design with Nios II Processor and Verilog Examples, https://onlinelibrary.wiley.com/doi/book/10.1002/9781118309728

1. Hardware Design

1.1. RTL Design of GCD

GCD Algorithm

The GCD algorithm in this tutorial is based on the code from the book Embedded SoPC Design with Nios II Processor and Verilog Examples, page 663-679, by Pong P. Chu.

The GCD between two numbers is the largest number that divides them without remainder. We have gcd(a, b). This gcd() function returns the largest number that divides both a and b without remainder. For example, gcd(35, 25) is 5, and gcd(128, 72) is 8.

We will use the binary GCD algorithm. It is shown in the following equations. This algorithm uses only subtraction and divide-by-2 operations. It has six equations that should be applied repetitively until a is equal to b (equation 1).

For example, this is step-by-step how to calculate gcd(24, 15):

gcd(24, 15), apply equation 4, then the result is,
gcd(12, 15), apply equation 4, then the result is,
gcd(6, 15), apply equation 4, then the result is,
gcd(3, 15), apply equation 6, then the result is,
gcd(3, 12), apply equation 3, then the result is,
gcd(3, 6), apply equation 3, then the result is,
gcd(3, 3), apply equation 1, then the result is,
3

Software Implementation

Now that we have the GCD algorithm, the next step is to implement this in Python code. Later on, we can use this Python code for verification of the hardware implementation and performance comparison between hardware and software.

# Function to calculate GCD with SW
def calc_gcd_sw(a, b):
    n = 0
    while True:
        if (a == b):
            break
        if ((a % 2) == 0): # a even
            a = a >> 1
            if ((b % 2) == 0): # b even
                b = b >> 1
                n = n + 1
        else: # a odd
            if ((b % 2) == 0): # b even
                b = b >> 1
            else: # b odd
                if (a > b):
                    a = a - b
                else:
                    b = b - a
    a = a << n
    return a

In equation 2, we should multiply the GCD result by 2. In Python code, it is implemented by counting the occurrences this condition, saved in a variable, n. At the end, in line 20, we should multiply the result by 2n. Note that multiplying by 2 can be done by shifting it to the left once. Then multiplying by 2n is equal to left shifting n times.

Hardware GCD Core

Now that we have the Python program for the GCD algorithm. Based on this program, we can build the Verilog code. The ASMD chart of the GCD algorithm is shown in the following figure.

The state machine has two states: S_IDLE and S_OP. In S_IDLE state, it waits until the start signal is equal to 1. In this state the ready signal is also 1 meaning that it is ready to accept new inputs. In S_OP state, it calculates the GCD using the same algorithm as in the Python code. After the calculation is finished, it goes back to the S_IDLE state.

This is the Verilog code for the GCD core.

gcd_core.v

module gcd_core
    (
        input wire         clk,
        input wire         rst_n,
        input wire         start,
        input wire [31:0]  a,
        input wire [31:0]  b,
        output wire        ready,
        output wire        done,
        output wire [31:0] r
    );
    
    localparam S_IDLE = 2'h0,
               S_OP = 2'h1;
    
    reg [1:0] state_reg, state_next;
    reg [31:0] a_reg, a_next;
    reg [31:0] b_reg, b_next;
    reg [4:0] n_reg, n_next;
    reg done_reg, done_next;

    always @(posedge clk)
    begin
        if (!rst_n)
        begin
            state_reg <= S_IDLE;
            a_reg <= 0;
            b_reg <= 0;
            n_reg <= 0;
            done_reg <= 0;
        end
        else
        begin
            state_reg <= state_next;
            a_reg <= a_next;
            b_reg <= b_next;
            n_reg <= n_next;
            done_reg <= done_next;
        end
    end
    
    always @(*)
    begin
        state_next = state_reg;
        a_next = a_reg;
        b_next = b_reg;
        n_next = n_reg;
        done_next = 0;
        case(state_reg)
            S_IDLE:
            begin
                if (start)
                begin
                    a_next = a;
                    b_next = b;
                    n_next = 0;
                    state_next = S_OP;
                end
            end
            S_OP:
            begin
                if (a_reg == b_reg)
                begin
                    a_next = a_reg << n_reg;
                    done_next = 1;
                    state_next = S_IDLE;
                end
                else
                begin
                    if (!a_reg[0])       // a even
                    begin
                        a_next = {1'b0, a_reg[31:1]};
                        if (!b_reg[0])   // a and b even
                        begin
                            b_next = {1'b0, b_reg[31:1]};
                            n_next = n_reg + 1;
                        end
                    end
                    else                // a odd
                    begin
                        if (!b_reg[0])  // b even
                        begin
                            b_next = {1'b0, b_reg[31:1]};
                        end
                        else            // a and b odd
                        begin
                            if (a_reg > b_reg)
                                a_next = a_reg - b_reg;
                            else
                                b_next = b_reg - a_reg;
                        end
                    end
                end
            end
        endcase
    end
 
    assign ready = (state_reg == S_IDLE) ? 1 : 0;
    assign done = done_reg; 
    assign r = a_reg;
    
endmodule

Simulation Result

The timing diagram of the GCD core is shown in the following figure. In the S_IDLE state, the ready signal is 1. It indicates that the GCD core is ready to accept new inputs. When the GCD core is in the process of calculating GCD, the ready signal is 0, and it goes back to 1 when the process is finished.

The testbench file of this simulation is shown below:

gcd_core_tb.v

`timescale 1ns / 1ps

module gcd_core_tb();
    // Clock period
    localparam T = 10;
    
    reg clk;
    reg rst_n;
    reg start;
    reg [31:0] a;
    reg [31:0] b;
    wire done;
    wire ready;
    wire [31:0] r;
    
    gcd_core dut
    (
        .clk(clk),
        .rst_n(rst_n),
        .start(start),
        .a(a),
        .b(b),
        .done(done),
        .ready(ready),
        .r(r)
    );
    
    always
    begin
        clk = 0;
        #(T/2);
        clk = 1;
        #(T/2);
    end
    
    initial
    begin
        // Initial value
        a = 0;
        b = 0;
        start = 0;
        
        // Reset
        rst_n = 0;
        #T;
        rst_n = 1;
        #T;
        
        // gcd(35, 25)
        a = 35;
        b = 25;
        start = 1;
        #T;
        start = 0;
        #(T*10);

        // gcd(128, 72)
        a = 128;
        b = 72;
        start = 1;
        #T;
        start = 0;
        #(T*15);

        // gcd(24, 15)
        a = 24;
        b = 15;
        start = 1;
        #T;
        start = 0;
        #(T*10);  
    end
    
endmodule

1.2. AXI-Lite Wrapper

Now that we already have the GCD core module. The next step is to create an AXI-Lite wrapper. The block diagram of the wrapper is shown in the following figure.

This is the Verilog code for the wrapper module.

axi_gcd.v

module axi_gcd
    (
        // ### Clock and reset signals #########################################
        input  wire        aclk,
        input  wire        aresetn,
        // ### AXI4-lite slave signals #########################################
        // *** Write address signals ***
        output wire        s_axi_awready,
        input  wire [31:0] s_axi_awaddr,
        input  wire        s_axi_awvalid,
        // *** Write data signals ***
        output wire        s_axi_wready,
        input  wire [31:0] s_axi_wdata,
        input  wire [3:0]  s_axi_wstrb,
        input  wire        s_axi_wvalid,
        // *** Write response signals ***
        input  wire        s_axi_bready,
        output wire [1:0]  s_axi_bresp,
        output wire        s_axi_bvalid,
        // *** Read address signals ***
        output wire        s_axi_arready,
        input  wire [31:0] s_axi_araddr,
        input  wire        s_axi_arvalid,
        // *** Read data signals ***	
        input  wire        s_axi_rready,
        output wire [31:0] s_axi_rdata,
        output wire [1:0]  s_axi_rresp,
        output wire        s_axi_rvalid
        // ### User signals ####################################################
    );

    // ### Register map ########################################################
    // 0x00: start and ready
    //       bit 0 = START (R/W)
    //       bit 1 = READY (R)
    // 0x04: input a
    //       bit 31~0 = A[31:0] (R/W)
    // 0x08: input b
    //       bit 31~0 = B[31:0] (R/W)
    // 0x0c: output r
    //       bit 31~0 = R[31:0] (R)
    localparam C_ADDR_BITS = 8;
//    // *** Address (32-bit) ***
//    localparam C_ADDR_CTRL = 8'h00,
//               C_ADDR_INPA = 8'h04,
//               C_ADDR_INPB = 8'h08,
//               C_ADDR_OUTR = 8'h0c;
    // *** Address (40-bit) ***
    localparam C_ADDR_CTRL = 8'h00,
               C_ADDR_INPA = 8'h08,
               C_ADDR_INPB = 8'h10,
               C_ADDR_OUTR = 8'h18;
    // *** AXI write FSM ***
    localparam S_WRIDLE = 2'd0,
               S_WRDATA = 2'd1,
               S_WRRESP = 2'd2;
    // *** AXI read FSM ***
    localparam S_RDIDLE = 2'd0,
               S_RDDATA = 2'd1;
    
    // *** AXI write ***
    reg [1:0] wstate_cs, wstate_ns;
    reg [C_ADDR_BITS-1:0] waddr;
    wire [31:0] wmask;
    wire aw_hs, w_hs;
    // *** AXI read ***
    reg [1:0] rstate_cs, rstate_ns;
    wire [C_ADDR_BITS-1:0] raddr;
    reg [31:0] rdata;
    wire ar_hs;
    // *** Control registers ***
    reg start_reg;
    wire ready_w;
    reg [31:0] a_reg;
    reg [31:0] b_reg;
    wire [31:0] r_w;
    
    // ### AXI write ###########################################################
    assign s_axi_awready = (wstate_cs == S_WRIDLE);
    assign s_axi_wready = (wstate_cs == S_WRDATA);
    assign s_axi_bresp = 2'b00;    // OKAY
    assign s_axi_bvalid = (wstate_cs == S_WRRESP);
    assign wmask = {{8{s_axi_wstrb[3]}}, {8{s_axi_wstrb[2]}}, {8{s_axi_wstrb[1]}}, {8{s_axi_wstrb[0]}}};
    assign aw_hs = s_axi_awvalid & s_axi_awready;
    assign w_hs = s_axi_wvalid & s_axi_wready;

    // *** Write state register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            wstate_cs <= S_WRIDLE;
        else
            wstate_cs <= wstate_ns;
    end
    
    // *** Write state next ***
    always @(*)
    begin
        case (wstate_cs)
            S_WRIDLE:
                if (s_axi_awvalid)
                    wstate_ns = S_WRDATA;
                else
                    wstate_ns = S_WRIDLE;
            S_WRDATA:
                if (s_axi_wvalid)
                    wstate_ns = S_WRRESP;
                else
                    wstate_ns = S_WRDATA;
            S_WRRESP:
                if (s_axi_bready)
                    wstate_ns = S_WRIDLE;
                else
                    wstate_ns = S_WRRESP;
            default:
                wstate_ns = S_WRIDLE;
        endcase
    end
    
    // *** Write address register ***
    always @(posedge aclk)
    begin
        if (aw_hs)
            waddr <= s_axi_awaddr[C_ADDR_BITS-1:0];
    end
    
    // ### AXI read ############################################################
    assign s_axi_arready = (rstate_cs == S_RDIDLE);
    assign s_axi_rdata = rdata;
    assign s_axi_rresp = 2'b00;    // OKAY
    assign s_axi_rvalid = (rstate_cs == S_RDDATA);
    assign ar_hs = s_axi_arvalid & s_axi_arready;
    assign raddr = s_axi_araddr[C_ADDR_BITS-1:0];
    
    // *** Read state register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            rstate_cs <= S_RDIDLE;
        else
            rstate_cs <= rstate_ns;
    end

    // *** Read state next ***
    always @(*) 
    begin
        case (rstate_cs)
            S_RDIDLE:
                if (s_axi_arvalid)
                    rstate_ns = S_RDDATA;
                else
                    rstate_ns = S_RDIDLE;
            S_RDDATA:
                if (s_axi_rready)
                    rstate_ns = S_RDIDLE;
                else
                    rstate_ns = S_RDDATA;
            default:
                rstate_ns = S_RDIDLE;
        endcase
    end
    
    // *** Read data register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            rdata <= 0;
        else if (ar_hs)
            case (raddr)
                C_ADDR_CTRL: 
                    rdata <= {{30{1'b0}}, ready_w, start_reg};
                C_ADDR_INPA:
                    rdata <= a_reg[31:0];
                C_ADDR_INPB:
                    rdata <= b_reg[31:0];
                C_ADDR_OUTR:
                    rdata <= r_w[31:0];	
            endcase
    end
    
    // ### User design #########################################################
    // *** Start register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
        begin
            start_reg <= 0;
        end
        else if (w_hs && waddr == C_ADDR_CTRL && s_axi_wdata[0])
        begin
            start_reg <= 1;
        end
        else
        begin
            start_reg <= 0;
        end
    end

    // *** Register a and b ***
    always @(posedge aclk)
    begin
        if (!aresetn)
        begin
            a_reg[31:0] <= 0;
            b_reg[31:0] <= 0;
        end
        else if (w_hs && waddr == C_ADDR_INPA)
        begin
            a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
        end
        else if (w_hs && waddr == C_ADDR_INPB)
        begin
            b_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (b_reg[31:0] & ~wmask);
        end
    end

    // *** GCD core *** 
    gcd_core gcd_core_0
    (
        .clk(aclk),
        .rst_n(aresetn),
        .start(start_reg),
        .a(a_reg),
        .b(b_reg),
        .done(),
        .ready(ready_w),
        .r(r_w)
    );

endmodule

1.3. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI-Lite GCD module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.

This is the final block design diagram as shown in Vivado.

We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:

2. Software Design

Our AXI-Lite GCD module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO object from the PYNQ library.

# Access to memory map of the AXI GCD
ADDR_BASE = 0xA0000000
ADDR_RANGE = 0x80
gcd_obj = MMIO(ADDR_BASE, ADDR_RANGE)

To write data to inputs a and b of the GCD module, we can use the write() method from the MMIO object.

# Write input A and B to the AXI GCD
gcd_obj.write(0x8, 128)
gcd_obj.write(0x10, 72)

We start the GCD calculation by writing one to the start bit. Then, we read and wait until the ready flag is one, indicating that the calculation is complete.

# Start main controller
gcd_obj.write(0x0, 0x1)
# Wait until ready flag is 1
while ((gcd_obj.read(0x0) & (1 << 1)) == 0):
    pass

To read data from output r of the GCD module, we can use the read() method from the MMIO object.

# Read GCD result
gcd_obj.read(0x18)

We can create a function to do a GCD calculation like this:

# Function to calculate GCD with HW core
def calc_gcd_hw(a, b):
    gcd_obj.write(0x8, a)
    gcd_obj.write(0x10, b)
    gcd_obj.write(0x0, 0x1)
    r = gcd_obj.read(0x18)
    return r

We can calculate the time required to do 1 million calculations. Later, we can compare this with the AXI-Stream multiplier (with DMA) module.

# Measure the time required to calculate 1 million operation
t1 = time()
for i in range(1000000):
    calc_gcd_hw(2391065, 3578129)
t2 = time()
t_hw = t2 - t1
print('Time used for HW GCD: {}s'.format(t_hw))

Time used for HW GCD: 29.42424488067627s

Comparison with software GCD:

# Measure the time required to calculate 1 million operation
t1 = time()
for i in range(1000000):
    calc_gcd_sw(2391065, 3578129)
t2 = time()
t_sw = t2 - t1
print('Time used for SW GCD: {}s'.format(t_sw))

Time used for SW GCD: 54.3738739490509s

Compared to the software GCD, the hardware GCD speeds up the calculation by 1.84.

3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

4. Conclusion

In this tutorial, we covered a more complex AXI-Lite based IP core creation.

PreviousPart 4: AXI-Stream Multiplier with DMA NextPart 6: AXI-Stream GCD with DMA

Last updated 5 months ago