Part 3: AXI-Lite Multiplier

Objective

This tutorial contains information on how to create a simple AXI-Lite IP core in Verilog. The IP core does a simple multiplication operation. Then, we are going to compare the performance result of AXI-Lite multiplier in this tutorial with AXI-Stream multiplier (with AXI DMA) in the next tutorial.


1. Hardware Design

1.1. RTL Design of Multiplier

This RTL module does a simple multiplication. It has one 32-bit input a and one 32-bit output r. Every input will be multiplied by 8 to produce the output.

mult_core.v
module mult_core
    (
        input wire [31:0]  a,
        output wire [31:0] r    
    );
    
    assign r = a * 8;
    
endmodule

1.2. AXI-Lite Wrapper

This figure shows the Xilinx Zynq Ultrascale+ MPSoC block diagram. The RTL design of the multiplier module will be implemented inside the programmable logic. Now, the question is: how does the multiplier module communicate with the ARM processor?

Well, to be able to communicate with the ARM processor, we have to connect the multiplier module to the AMBA interconnect via the General-Purpose AXI Ports, which are based on the AXI4 protocol.

To be able to connect our multiplier module to one of these ports, we have to create a kind of wrapper module that translates the AXI4 protocol to our multiplier's I/O. This RTL wrapper module does the translation of the AXI-Lite protocol to our multiplier I/O.

The AXI-Lite bus that connects the ARM processor and the wrapper module can be modeled as a master-slave connection. The ARM processor is master, and the wrapper module is slave. The AXI-Lite bus is a collection of I/O signals that can be categorized into five types:

  • Write address channel

  • Write data channel

  • Write data response channel

  • Read address channel

  • Read data channel

The sub-module AXI Write is a state machine that translates the AXI-Lite protocol when the ARM processor wants to write data to the addressable register a. The sub-module AXI Read is a state machine that translates the AXI-Lite protocol when the ARM processor wants to read data from the addressable register r.

Every register in the wrapper module has an address. For Zynq Ultrascale+, this address is incremented by 8, while for Zynq-7000 this address is incremented by 4.

This is the Verilog code for the wrapper module.

axi_mult.v
module axi_mult
    (
        // ### Clock and reset signals #########################################
        input  wire        aclk,
        input  wire        aresetn,
        // ### AXI4-lite slave signals #########################################
        // *** Write address signals ***
        output wire        s_axi_awready,
        input  wire [31:0] s_axi_awaddr,
        input  wire        s_axi_awvalid,
        // *** Write data signals ***
        output wire        s_axi_wready,
        input  wire [31:0] s_axi_wdata,
        input  wire [3:0]  s_axi_wstrb,
        input  wire        s_axi_wvalid,
        // *** Write response signals ***
        input  wire        s_axi_bready,
        output wire [1:0]  s_axi_bresp,
        output wire        s_axi_bvalid,
        // *** Read address signals ***
        output wire        s_axi_arready,
        input  wire [31:0] s_axi_araddr,
        input  wire        s_axi_arvalid,
        // *** Read data signals ***	
        input  wire        s_axi_rready,
        output wire [31:0] s_axi_rdata,
        output wire [1:0]  s_axi_rresp,
        output wire        s_axi_rvalid
        // ### User signals ####################################################
    );

    // ### Register map ########################################################
    // 0x00: input a
    //       bit 31~0 = A[31:0] (R/W)
    // 0x04: output r
    //       bit 31~0 = R[31:0] (R)
    localparam C_ADDR_BITS = 8;
//    // *** Address (32-bit) ***
//    localparam C_ADDR_INPA = 8'h00,
//               C_ADDR_OUTR = 8'h04;
    localparam C_ADDR_INPA = 8'h00,
               C_ADDR_OUTR = 8'h08;
    // *** AXI write FSM ***
    localparam S_WRIDLE = 2'd0,
               S_WRDATA = 2'd1,
               S_WRRESP = 2'd2;
    // *** AXI read FSM ***
    localparam S_RDIDLE = 2'd0,
               S_RDDATA = 2'd1;
    
    // *** AXI write ***
    reg [1:0] wstate_cs, wstate_ns;
    reg [C_ADDR_BITS-1:0] waddr;
    wire [31:0] wmask;
    wire aw_hs, w_hs;
    // *** AXI read ***
    reg [1:0] rstate_cs, rstate_ns;
    wire [C_ADDR_BITS-1:0] raddr;
    reg [31:0] rdata;
    wire ar_hs;
    // *** Control registers ***
    reg [31:0] a_reg;
    wire [31:0] r_w;
    
    // ### AXI write ###########################################################
    assign s_axi_awready = (wstate_cs == S_WRIDLE);
    assign s_axi_wready = (wstate_cs == S_WRDATA);
    assign s_axi_bresp = 2'b00;    // OKAY
    assign s_axi_bvalid = (wstate_cs == S_WRRESP);
    assign wmask = {{8{s_axi_wstrb[3]}}, {8{s_axi_wstrb[2]}}, {8{s_axi_wstrb[1]}}, {8{s_axi_wstrb[0]}}};
    assign aw_hs = s_axi_awvalid & s_axi_awready;
    assign w_hs = s_axi_wvalid & s_axi_wready;

    // *** Write state register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            wstate_cs <= S_WRIDLE;
        else
            wstate_cs <= wstate_ns;
    end
    
    // *** Write state next ***
    always @(*)
    begin
        case (wstate_cs)
            S_WRIDLE:
                if (s_axi_awvalid)
                    wstate_ns = S_WRDATA;
                else
                    wstate_ns = S_WRIDLE;
            S_WRDATA:
                if (s_axi_wvalid)
                    wstate_ns = S_WRRESP;
                else
                    wstate_ns = S_WRDATA;
            S_WRRESP:
                if (s_axi_bready)
                    wstate_ns = S_WRIDLE;
                else
                    wstate_ns = S_WRRESP;
            default:
                wstate_ns = S_WRIDLE;
        endcase
    end
    
    // *** Write address register ***
    always @(posedge aclk)
    begin
        if (aw_hs)
            waddr <= s_axi_awaddr[C_ADDR_BITS-1:0];
    end
    
    // ### AXI read ############################################################
    assign s_axi_arready = (rstate_cs == S_RDIDLE);
    assign s_axi_rdata = rdata;
    assign s_axi_rresp = 2'b00;    // OKAY
    assign s_axi_rvalid = (rstate_cs == S_RDDATA);
    assign ar_hs = s_axi_arvalid & s_axi_arready;
    assign raddr = s_axi_araddr[C_ADDR_BITS-1:0];
    
    // *** Read state register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            rstate_cs <= S_RDIDLE;
        else
            rstate_cs <= rstate_ns;
    end

    // *** Read state next ***
    always @(*) 
    begin
        case (rstate_cs)
            S_RDIDLE:
                if (s_axi_arvalid)
                    rstate_ns = S_RDDATA;
                else
                    rstate_ns = S_RDIDLE;
            S_RDDATA:
                if (s_axi_rready)
                    rstate_ns = S_RDIDLE;
                else
                    rstate_ns = S_RDDATA;
            default:
                rstate_ns = S_RDIDLE;
        endcase
    end
    
    // *** Read data register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            rdata <= 0;
        else if (ar_hs)
            case (raddr)
                C_ADDR_INPA:
                    rdata <= a_reg[31:0];
                C_ADDR_OUTR:
                    rdata <= r_w[31:0];	
            endcase
    end
    
    // ### User design #########################################################
    // *** Register a ***
    always @(posedge aclk)
    begin
        if (!aresetn)
        begin
            a_reg[31:0] <= 0;
        end
        else if (w_hs && waddr == C_ADDR_INPA)
        begin
            a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
        end
    end

    // *** Multiplier core *** 
    mult_core mult_core_0
    (
        .a(a_reg),
        .r(r_w)
    );

endmodule

This part of the code is taken from lines 165–176. This is the Verilog implementation of register a. The register a_reg will be updated when the handshake signal w_hs is 1 and the address waddr is 0.

    // *** Register a ***
    always @(posedge aclk)
    begin
        if (!aresetn)
        begin
            a_reg[31:0] <= 0;
        end
        else if (w_hs && waddr == C_ADDR_INPA)
        begin
            a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
        end
    end

This part of the code is taken from lines 150–163. This is the Verilog implementation of the AXI read register. The register's address raddr is checked by using a case block, and then the register rdata is loaded with appropriate data, either with input a or result r.

    // *** Read data register ***
    always @(posedge aclk)
    begin
        if (!aresetn)
            rdata <= 0;
        else if (ar_hs)
            case (raddr)
                C_ADDR_INPA:
                    rdata <= a_reg[31:0];
                C_ADDR_OUTR:
                    rdata <= r_w[31:0];	
            endcase
    end

This part of the code is taken from lines 178–183. This is the instantiation of the multiplier module.

    // *** Multiplier core *** 
    mult_core mult_core_0
    (
        .a(a_reg),
        .r(r_w)
    );

1.3. System Design

This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI Lite multiplier module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.

This is the final block design diagram as shown in Vivado.

We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:

2. Software Design

Our AXI-Lite multiplier module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO object from the PYNQ library.

# Access to memory map of the AXI multiplier
ADDR_BASE = 0xA0000000
ADDR_RANGE = 0x80
mult_obj = MMIO(ADDR_BASE, ADDR_RANGE)

To write data to input a of multiplier module, we can use write() method from the MMIO object. To read data from output r of multiplier module, we can use read() method from the MMIO object.

# Write input and read multiplication result
mult_obj.write(0x0, 10)
mult_obj.read(0x8)
80

We can create a function to do a multiplication like this:

# Function to calculate multiplication
def calc_mult_axi_lite(a):
    mult_obj.write(0x0, a)
    r = mult_obj.read(0x8)
    return r

We can calculate the time required to do 1 million multiplications. Later, we can compare this with the AXI-Stream multiplier (with AXI DMA) module.

# Measure the time required to calculate 1 million multiplication
t1 = time()
for i in range(1000000):
    calc_mult_axi_lite(3578129)
t2 = time()
t_diff = t2 - t1
print('Time used for AXI lite multiplier: {}s'.format(t_diff))
Time used for AXI lite multiplier: 15.767791271209717s

3. Full Step-by-Step Tutorial

This video contains detailed steps for making this project.

4. Conclusion

In this tutorial, we covered some of the basics of AXI-Lite based IP core creation.

Last updated