This tutorial contains information on how to use block memory generator. As an example, we are going to use the PE module for testing. The input and output for this module are from the BRAM.
Source Code
This repository contains all of the code required in order to follow this tutorial.
Block memory generator is a dedicated memory block on the FPGA. This means that BRAM does not use flip-flop or LUT resources. This core has two fully independent ports that access a shared memory space. Both A and B ports have a write and a read interface.
Block memory has a limited size. Even on the high-end Zynq chip, the size is only 38.0 Mb (4.75 MB). On the Z7010, it is only 2.1 Mb (0.2625 MB).
Block memory can be added to the design using block design (GUI) or with Verilog/VHDL (Xilinx Parameterized Macros, XPM) code.
Block memory has two operating modes as shown in Figure 1.
BRAM controller
Stand Alone
The Block Memory Generator core uses embedded block RAM to generate five types of memories as shown in Figure 2.
Single-port RAM
Simple Dual-port RAM
True Dual-port RAM
Single-port ROM
Dual-port ROM
In this tutorial, we are going to use standalone mode and true dual-port RAM type.
1.2. BRAM Controller Mode
In this mode, the block memory should be used together with the AXI BRAM controller IP. In this mode, most of the block memory settings are grayed out as shown in Figure 3.
The size of the memory can be configured from Address Editor of the AXI BRAM Controller IP, instead of the block memory configuration wizard.
The relationship between address range and data depth (32-bit or 64-bit) is shown in the table below.
Range
Depth (32-bit)
Depth (64-bit)
4K
1024
512
8K
2048
1024
16K
4096
2048
1.3. BRAM Standalone Mode
In this mode, we can change the block memory data width and size, as well as other settings. This memory type can be used for internal use within our RTL module, which is not connected directly to the PS.
1.4. BRAM Timing Diagram
The following figure shows the BRAM write timing diagram for the BRAM controller mode. Every address is incremented every 4 because the address is 32-bit. Every piece of data is byte-addressable, as indicated by the we signal.
The following figure shows the BRAM read timing diagram for the BRAM controller mode. The address for output latency is one clock cycle.
The reset type for BRAM is active-high.
1.5. Accessing BRAM from PS
From the software side, we can write and read the data to and from BRAM using a simple memory map program. We initialize a pointer mem_p to the base address of the BRAM. Then, we can use this pointer to write and read the data.
#include <stdio.h>
#include <stdint.h>
#define MEM_BASE 0x40000000
uint32_t *mem_p;
int main()
{
mem_p = (uint32_t *)MEM_BASE;
// Write to block memory
for (int i = 0; i <= 4; i++)
*(mem_p+i) = 0xFFFFFFFF;
// Read from block memory
for (int i = 0; i <= 4; i++)
printf("%d\n", (unsigned int)*(mem_p+i));
return 0;
}
2. PE Module
2.1. Control and Datapath
The following figure shows the PE module. It consists of a multiplier and an adder. The circuit is a combinational circuit.
Next, we have to design the top module for this PE module, as shown in the following figure. We add pipeline registers to the input and output of the PE module. Then, we add a control unit, which is implemented as a counter. This module controls the BRAM input and output. For the start signal, we use a rising edge detector that is implemented with a register, a not gate, and an and gate.
This is the Verilog implementation of the PE top module.
Now, we already have the PE top module. We can test this module using a testbench file. The block design that we are going to test is shown in the following figure. The PE top module is connected to the BRAM input and output.
This is the Verilog testbench for the PE top module.
The following figure shows the timing diagram of the PE top module. First, the data is written to the BRAM input. Then, the module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the PE top module is busy. Finally, the PE result is stored in the output BRAM.
3. System Design
The following figure shows the overall SoC system design for the PE top module. We use the AXI BRAM controller IP to connect PS and BRAM. We use AXI GPIO for the control signal, the start signal, and the ready signal.
The following figure shows the block design in Vivado.
4. Software Design
For the software design, first we write the input data to BRAM input. Then, we start the PE module by writing 1 to the AXI GPIO channel 0. After that, we read the ready signal from the AXI GPIO channel 1 and wait until it is 1. Finally, we can read the output data from the BRAM output.
helloworld.c
#include <stdio.h>
#define MEM_INP_BASE 0x40000000
#define MEM_GPIO_BASE 0x41200000
#define MEM_OUT_BASE 0x42000000
uint32_t *mem_inp_p, *mem_gpio_p, *mem_out_p;
int main()
{
// Initialization
mem_inp_p = (uint32_t *)MEM_INP_BASE;
mem_gpio_p = (uint32_t *)MEM_GPIO_BASE;
mem_out_p = (uint32_t *)MEM_OUT_BASE;
// Write input
printf("Input:\n");
for (int i = 0; i <= 7; i++)
{
uint8_t a = i + 1;
uint8_t b = 8 - i;
uint8_t y = i + 1;
*(mem_inp_p+i) = (y << 16) | (b << 8) | a;
printf(" a=%d, b=%d, y=%d\n", a, b, y);
}
// Start module
*(mem_gpio_p+0) = 0x1;
*(mem_gpio_p+0) = 0x0;
// Wait until ready
while (!(*(mem_gpio_p+2) & (1 << 0)));
// Read input
printf("Output:\n");
for (int i = 0; i <= 7; i++)
printf(" %ld\n", (uint32_t)*(mem_out_p+i));
return 0;
}
5. Result
The following figure shows the result on the serial terminal. We can verify the result manually.
6. Conclusion
In this tutorial, we covered a project on how to use BRAM with an example of a PE module.