This tutorial contains information on how to use block memory generator. As an example, we are going to use the PE module for testing. The input and output for this module are from the BRAM.
Source Code
This repository contains all of the code required in order to follow this tutorial.
Block memory generator is a dedicated memory block on the FPGA. This means that BRAM does not use flip-flop or LUT resources. This core has two fully independent ports that access a shared memory space. Both A and B ports have a write and a read interface.
Block memory has a limited size. Even on the high-end Zynq chip, the size is only 38.0 Mb (4.75 MB). On the Z7010, it is only 2.1 Mb (0.2625 MB).
Block memory can be added to the design using block design (GUI) or with Verilog/VHDL (Xilinx Parameterized Macros, XPM) code.
Block memory has two operating modes as shown in Figure 1.
BRAM controller
Stand Alone
The Block Memory Generator core uses embedded block RAM to generate five types of memories as shown in Figure 2.
Single-port RAM
Simple Dual-port RAM
True Dual-port RAM
Single-port ROM
Dual-port ROM
In this tutorial, we are going to use standalone mode and true dual-port RAM type.
1.2. BRAM Controller Mode
In this mode, the block memory should be used together with the AXI BRAM controller IP. In this mode, most of the block memory settings are grayed out as shown in Figure 3.
The size of the memory can be configured from Address Editor of the AXI BRAM Controller IP, instead of the block memory configuration wizard.
The relationship between address range and data depth (32-bit or 64-bit) is shown in the table below.
Range
Depth (32-bit)
Depth (64-bit)
4K
1024
512
8K
2048
1024
16K
4096
2048
1.3. BRAM Standalone Mode
In this mode, we can change the block memory data width and size, as well as other settings. This memory type can be used for internal use within our RTL module, which is not connected directly to the PS.
1.4. BRAM Timing Diagram
The following figure shows the BRAM write timing diagram for the BRAM controller mode. Every address is incremented every 4 because the address is 32-bit. Every piece of data is byte-addressable, as indicated by the we signal.
The following figure shows the BRAM read timing diagram for the BRAM controller mode. The address for output latency is one clock cycle.
The reset type for BRAM is active-high.
1.5. Accessing BRAM from PS
From the software side, we can write and read the data to and from BRAM using a simple memory map program. We initialize a pointer mem_p to the base address of the BRAM. Then, we can use this pointer to write and read the data.
#include <stdio.h>
#include <stdint.h>
#define MEM_BASE 0x40000000
uint32_t *mem_p;
int main()
{
mem_p = (uint32_t *)MEM_BASE;
// Write to block memory
for (int i = 0; i <= 4; i++)
*(mem_p+i) = 0xFFFFFFFF;
// Read from block memory
for (int i = 0; i <= 4; i++)
printf("%d\n", (unsigned int)*(mem_p+i));
return 0;
}
2. PE Module
2.1. Control and Datapath
The following figure shows the PE module. It consists of a multiplier and an adder. The circuit is a combinational circuit.
Next, we have to design the top module for this PE module, as shown in the following figure. We add pipeline registers to the input and output of the PE module. Then, we add a control unit, which is implemented as a counter. This module controls the BRAM input and output. For the start signal, we use a rising edge detector that is implemented with a register, a not gate, and an and gate.
This is the Verilog implementation of the PE top module.
pe_top.v
`timescale 1ns / 1ps
module pe_top
(
input wire clk,
input wire rst_n,
// *** Control and status port ***
output wire ready,
input wire start,
// *** Data input port ***
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA CLK" *) output wire clka,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA RST" *) output wire rsta,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA EN" *) output wire ena,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA ADDR" *) output wire [31:0] addra,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA DIN" *) output wire [31:0] dina,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA WE" *) output wire [3:0] wea,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA DOUT" *) input wire [31:0] douta,
// *** Data output port ***
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB CLK" *) output wire clkb,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB RST" *) output wire rstb,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB EN" *) output wire enb,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB ADDR" *) output wire [31:0] addrb,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB DIN" *) output wire [31:0] dinb,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB WE" *) output wire [3:0] web,
(* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB DOUT" *) input wire [31:0] doutb
);
// Main controller
wire start_reg, start_rising;
reg [5:0] cnt_main_reg;
reg [31:0] cnt_addra_reg, cnt_addrb_reg;
wire start_cnt_addra, start_cnt_addrb;
// PE datapath
wire signed [7:0] a_in, a_in_reg;
wire signed [7:0] y_in, y_in_reg;
wire signed [7:0] b_in, b_in_reg;
wire signed [7:0] y_out, y_out_reg;
// *** Main controller ******************************************************
register #(1) reg_start_0(clk, rst_n, 1'b1, 1'b0, start, start_reg);
assign start_rising = start & ~start_reg;
// Counter for FSM
always @(posedge clk)
begin
if (!rst_n)
begin
cnt_main_reg <= 0;
end
else if (start_rising)
begin
cnt_main_reg <= cnt_main_reg + 1;
end
else if (cnt_main_reg >= 1 && cnt_main_reg <= 10)
begin
cnt_main_reg <= cnt_main_reg + 1;
end
else if (cnt_main_reg >= 11)
begin
cnt_main_reg <= 0;
end
end
// Input port control
assign clka = clk;
assign rsta = ~rst_n;
assign ena = ((cnt_main_reg >= 1) && (cnt_main_reg <= 8)) ? 1 : 0;
assign addra = cnt_addra_reg;
assign dina = 0;
assign wea = 0;
// Address counter for BRAM input
always @(posedge clk)
begin
if (!rst_n)
begin
cnt_addra_reg <= 0;
end
else if (start_cnt_addra)
begin
cnt_addra_reg <= cnt_addra_reg + 4;
end
else if (cnt_addra_reg >= 32'h00000004 && cnt_addra_reg <= 32'h00000018)
begin
cnt_addra_reg <= cnt_addra_reg + 4;
end
else if (cnt_addra_reg >= 32'h0000001c)
begin
cnt_addra_reg <= 0;
end
end
assign start_cnt_addra = (cnt_main_reg == 1);
// Output port control
assign clkb = clk;
assign rstb = ~rst_n;
assign enb = ((cnt_main_reg >= 4) && (cnt_main_reg <= 11)) ? 1 : 0;
assign addrb = cnt_addrb_reg;
assign dinb = y_out_reg;
assign web = ((cnt_main_reg >= 4) && (cnt_main_reg <= 11)) ? 4'hf : 4'h0;
// Address counter for BRAM output
always @(posedge clk)
begin
if (!rst_n)
begin
cnt_addrb_reg <= 0;
end
else if (start_cnt_addrb)
begin
cnt_addrb_reg <= cnt_addrb_reg + 4;
end
else if (cnt_addrb_reg >= 32'h00000004 && cnt_addrb_reg <= 32'h00000018)
begin
cnt_addrb_reg <= cnt_addrb_reg + 4;
end
else if (cnt_addrb_reg >= 32'h0000001c)
begin
cnt_addrb_reg <= 0;
end
end
assign start_cnt_addrb = (cnt_main_reg == 4);
// Status control
assign ready = (cnt_main_reg == 0) ? 1 : 0;
// *** PE datapath **********************************************************
// Input from BRAM
assign a_in = douta[7:0];
assign b_in = douta[15:8];
assign y_in = douta[23:16];
// Pipeline registers for input
register #(8) reg_0(clk, rst_n, 1'b1, 1'b0, a_in, a_in_reg);
register #(8) reg_1(clk, rst_n, 1'b1, 1'b0, y_in, y_in_reg);
register #(8) reg_2(clk, rst_n, 1'b1, 1'b0, b_in, b_in_reg);
// PE
pe #(8, 0) pe_0
(
.a_in(a_in_reg),
.y_in(y_in_reg),
.b(b_in_reg),
.a_out(),
.y_out(y_out)
);
// Pipeline registers for output
register #(8) reg_3(clk, rst_n, 1'b1, 1'b0, y_out, y_out_reg);
endmodule
2.2. Testbench
Now, we already have the PE top module. We can test this module using a testbench file. The block design that we are going to test is shown in the following figure. The PE top module is connected to the BRAM input and output.
This is the Verilog testbench for the PE top module.
The following figure shows the timing diagram of the PE top module. First, the data is written to the BRAM input. Then, the module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the PE top module is busy. Finally, the PE result is stored in the output BRAM.
3. System Design
The following figure shows the overall SoC system design for the PE top module. We use the AXI BRAM controller IP to connect PS and BRAM. We use AXI GPIO for the control signal, the start signal, and the ready signal.
The following figure shows the block design in Vivado.
4. Software Design
For the software design, first we write the input data to BRAM input. Then, we start the PE module by writing 1 to the AXI GPIO channel 0. After that, we read the ready signal from the AXI GPIO channel 1 and wait until it is 1. Finally, we can read the output data from the BRAM output.
helloworld.c
#include <stdio.h>
#define MEM_INP_BASE 0x40000000
#define MEM_GPIO_BASE 0x41200000
#define MEM_OUT_BASE 0x42000000
uint32_t *mem_inp_p, *mem_gpio_p, *mem_out_p;
int main()
{
// Initialization
mem_inp_p = (uint32_t *)MEM_INP_BASE;
mem_gpio_p = (uint32_t *)MEM_GPIO_BASE;
mem_out_p = (uint32_t *)MEM_OUT_BASE;
// Write input
printf("Input:\n");
for (int i = 0; i <= 7; i++)
{
uint8_t a = i + 1;
uint8_t b = 8 - i;
uint8_t y = i + 1;
*(mem_inp_p+i) = (y << 16) | (b << 8) | a;
printf(" a=%d, b=%d, y=%d\n", a, b, y);
}
// Start module
*(mem_gpio_p+0) = 0x1;
*(mem_gpio_p+0) = 0x0;
// Wait until ready
while (!(*(mem_gpio_p+2) & (1 << 0)));
// Read input
printf("Output:\n");
for (int i = 0; i <= 7; i++)
printf(" %ld\n", (uint32_t)*(mem_out_p+i));
return 0;
}
5. Result
The following figure shows the result on the serial terminal. We can verify the result manually.
6. Conclusion
In this tutorial, we covered a project on how to use BRAM with an example of a PE module.