Part 3: AXI-Lite Multiplier
Objective
This tutorial contains information on how to create a simple AXI-Lite IP core in Verilog. The IP core does a simple multiplication operation. Then, we are going to compare the performance result of AXI-Lite multiplier in this tutorial with AXI-Stream multiplier (with AXI DMA) in the next tutorial.
Source Code
This repository contains all of the code required in order to follow this tutorial.
1. Hardware Design
1.1. RTL Design of Multiplier
This RTL module does a simple multiplication. It has one 32-bit input a
and one 32-bit output r
. Every input will be multiplied by 8
to produce the output.
module mult_core
(
input wire [31:0] a,
output wire [31:0] r
);
assign r = a * 8;
endmodule
1.2. AXI-Lite Wrapper
This figure shows the Xilinx Zynq Ultrascale+ MPSoC block diagram. The RTL design of the multiplier module will be implemented inside the programmable logic. Now, the question is: how does the multiplier module communicate with the ARM processor?

Well, to be able to communicate with the ARM processor, we have to connect the multiplier module to the AMBA interconnect via the General-Purpose AXI Ports, which are based on the AXI4 protocol.
To be able to connect our multiplier module to one of these ports, we have to create a kind of wrapper module that translates the AXI4 protocol to our multiplier's I/O. This RTL wrapper module does the translation of the AXI-Lite protocol to our multiplier I/O.

The AXI-Lite bus that connects the ARM processor and the wrapper module can be modeled as a master-slave connection. The ARM processor is master, and the wrapper module is slave. The AXI-Lite bus is a collection of I/O signals that can be categorized into five types:
Write address channel
Write data channel
Write data response channel
Read address channel
Read data channel

The sub-module AXI Write is a state machine that translates the AXI-Lite protocol when the ARM processor wants to write data to the addressable register a
. The sub-module AXI Read is a state machine that translates the AXI-Lite protocol when the ARM processor wants to read data from the addressable register r
.
Every register in the wrapper module has an address. For Zynq Ultrascale+, this address is incremented by 8, while for Zynq-7000 this address is incremented by 4.
This is the Verilog code for the wrapper module.
module axi_mult
(
// ### Clock and reset signals #########################################
input wire aclk,
input wire aresetn,
// ### AXI4-lite slave signals #########################################
// *** Write address signals ***
output wire s_axi_awready,
input wire [31:0] s_axi_awaddr,
input wire s_axi_awvalid,
// *** Write data signals ***
output wire s_axi_wready,
input wire [31:0] s_axi_wdata,
input wire [3:0] s_axi_wstrb,
input wire s_axi_wvalid,
// *** Write response signals ***
input wire s_axi_bready,
output wire [1:0] s_axi_bresp,
output wire s_axi_bvalid,
// *** Read address signals ***
output wire s_axi_arready,
input wire [31:0] s_axi_araddr,
input wire s_axi_arvalid,
// *** Read data signals ***
input wire s_axi_rready,
output wire [31:0] s_axi_rdata,
output wire [1:0] s_axi_rresp,
output wire s_axi_rvalid
// ### User signals ####################################################
);
// ### Register map ########################################################
// 0x00: input a
// bit 31~0 = A[31:0] (R/W)
// 0x04: output r
// bit 31~0 = R[31:0] (R)
localparam C_ADDR_BITS = 8;
// // *** Address (32-bit) ***
// localparam C_ADDR_INPA = 8'h00,
// C_ADDR_OUTR = 8'h04;
localparam C_ADDR_INPA = 8'h00,
C_ADDR_OUTR = 8'h08;
// *** AXI write FSM ***
localparam S_WRIDLE = 2'd0,
S_WRDATA = 2'd1,
S_WRRESP = 2'd2;
// *** AXI read FSM ***
localparam S_RDIDLE = 2'd0,
S_RDDATA = 2'd1;
// *** AXI write ***
reg [1:0] wstate_cs, wstate_ns;
reg [C_ADDR_BITS-1:0] waddr;
wire [31:0] wmask;
wire aw_hs, w_hs;
// *** AXI read ***
reg [1:0] rstate_cs, rstate_ns;
wire [C_ADDR_BITS-1:0] raddr;
reg [31:0] rdata;
wire ar_hs;
// *** Control registers ***
reg [31:0] a_reg;
wire [31:0] r_w;
// ### AXI write ###########################################################
assign s_axi_awready = (wstate_cs == S_WRIDLE);
assign s_axi_wready = (wstate_cs == S_WRDATA);
assign s_axi_bresp = 2'b00; // OKAY
assign s_axi_bvalid = (wstate_cs == S_WRRESP);
assign wmask = {{8{s_axi_wstrb[3]}}, {8{s_axi_wstrb[2]}}, {8{s_axi_wstrb[1]}}, {8{s_axi_wstrb[0]}}};
assign aw_hs = s_axi_awvalid & s_axi_awready;
assign w_hs = s_axi_wvalid & s_axi_wready;
// *** Write state register ***
always @(posedge aclk)
begin
if (!aresetn)
wstate_cs <= S_WRIDLE;
else
wstate_cs <= wstate_ns;
end
// *** Write state next ***
always @(*)
begin
case (wstate_cs)
S_WRIDLE:
if (s_axi_awvalid)
wstate_ns = S_WRDATA;
else
wstate_ns = S_WRIDLE;
S_WRDATA:
if (s_axi_wvalid)
wstate_ns = S_WRRESP;
else
wstate_ns = S_WRDATA;
S_WRRESP:
if (s_axi_bready)
wstate_ns = S_WRIDLE;
else
wstate_ns = S_WRRESP;
default:
wstate_ns = S_WRIDLE;
endcase
end
// *** Write address register ***
always @(posedge aclk)
begin
if (aw_hs)
waddr <= s_axi_awaddr[C_ADDR_BITS-1:0];
end
// ### AXI read ############################################################
assign s_axi_arready = (rstate_cs == S_RDIDLE);
assign s_axi_rdata = rdata;
assign s_axi_rresp = 2'b00; // OKAY
assign s_axi_rvalid = (rstate_cs == S_RDDATA);
assign ar_hs = s_axi_arvalid & s_axi_arready;
assign raddr = s_axi_araddr[C_ADDR_BITS-1:0];
// *** Read state register ***
always @(posedge aclk)
begin
if (!aresetn)
rstate_cs <= S_RDIDLE;
else
rstate_cs <= rstate_ns;
end
// *** Read state next ***
always @(*)
begin
case (rstate_cs)
S_RDIDLE:
if (s_axi_arvalid)
rstate_ns = S_RDDATA;
else
rstate_ns = S_RDIDLE;
S_RDDATA:
if (s_axi_rready)
rstate_ns = S_RDIDLE;
else
rstate_ns = S_RDDATA;
default:
rstate_ns = S_RDIDLE;
endcase
end
// *** Read data register ***
always @(posedge aclk)
begin
if (!aresetn)
rdata <= 0;
else if (ar_hs)
case (raddr)
C_ADDR_INPA:
rdata <= a_reg[31:0];
C_ADDR_OUTR:
rdata <= r_w[31:0];
endcase
end
// ### User design #########################################################
// *** Register a ***
always @(posedge aclk)
begin
if (!aresetn)
begin
a_reg[31:0] <= 0;
end
else if (w_hs && waddr == C_ADDR_INPA)
begin
a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
end
end
// *** Multiplier core ***
mult_core mult_core_0
(
.a(a_reg),
.r(r_w)
);
endmodule
This part of the code is taken from lines 165–176. This is the Verilog implementation of register a
. The register a_reg
will be updated when the handshake signal w_hs
is 1 and the address waddr
is 0.
// *** Register a ***
always @(posedge aclk)
begin
if (!aresetn)
begin
a_reg[31:0] <= 0;
end
else if (w_hs && waddr == C_ADDR_INPA)
begin
a_reg[31:0] <= (s_axi_wdata[31:0] & wmask) | (a_reg[31:0] & ~wmask);
end
end
This part of the code is taken from lines 150–163. This is the Verilog implementation of the AXI read register. The register's address raddr
is checked by using a case
block, and then the register rdata
is loaded with appropriate data, either with input a
or result r
.
// *** Read data register ***
always @(posedge aclk)
begin
if (!aresetn)
rdata <= 0;
else if (ar_hs)
case (raddr)
C_ADDR_INPA:
rdata <= a_reg[31:0];
C_ADDR_OUTR:
rdata <= r_w[31:0];
endcase
end
This part of the code is taken from lines 178–183. This is the instantiation of the multiplier module.
// *** Multiplier core ***
mult_core mult_core_0
(
.a(a_reg),
.r(r_w)
);
1.3. System Design
This diagram shows our system. It consists of an ARM CPU, DRAM, and our AXI Lite multiplier module. Our AXI-Lite module is connected to the ARM CPU via the AXI interconnect.

This is the final block design diagram as shown in Vivado.

We can change the memory-mapped base address of this AXI-Lite multiplier in the Address Editor:

2. Software Design
Our AXI-Lite multiplier module is connected to the ARM CPU. The CPU can access the module using memory mapping. This is done using the MMIO
object from the PYNQ library.
# Access to memory map of the AXI multiplier
ADDR_BASE = 0xA0000000
ADDR_RANGE = 0x80
mult_obj = MMIO(ADDR_BASE, ADDR_RANGE)
To write data to input a
of multiplier module, we can use write()
method from the MMIO
object. To read data from output r
of multiplier module, we can use read()
method from the MMIO
object.
# Write input and read multiplication result
mult_obj.write(0x0, 10)
mult_obj.read(0x8)
80
We can create a function to do a multiplication like this:
# Function to calculate multiplication
def calc_mult_axi_lite(a):
mult_obj.write(0x0, a)
r = mult_obj.read(0x8)
return r
We can calculate the time required to do 1 million multiplications. Later, we can compare this with the AXI-Stream multiplier (with AXI DMA) module.
# Measure the time required to calculate 1 million multiplication
t1 = time()
for i in range(1000000):
calc_mult_axi_lite(3578129)
t2 = time()
t_diff = t2 - t1
print('Time used for AXI lite multiplier: {}s'.format(t_diff))
Time used for AXI lite multiplier: 15.767791271209717s
3. Full Step-by-Step Tutorial
This video contains detailed steps for making this project.
4. Conclusion
In this tutorial, we covered some of the basics of AXI-Lite based IP core creation.
Last updated