Part 7: FPGA Memory

Objective

This tutorial contains information on how to use block memory generator. As an example, we are going to use the PE module for testing. The input and output for this module are from the BRAM.

Source Code

This repository contains all of the code required in order to follow this tutorial.

zybo_tutorial/part_7 at main · weenslab/zybo_tutorialGitHub

References

Block Memory Generator v8.4 Product Guide, https://docs.amd.com/v/u/en-US/pg058-blk-mem-gen

1. Overview

1.1. Block Memory Generator

Block memory generator is a dedicated memory block on the FPGA. This means that BRAM does not use flip-flop or LUT resources. This core has two fully independent ports that access a shared memory space. Both A and B ports have a write and a read interface.

Block memory has a limited size. Even on the high-end Zynq chip, the size is only 38.0 Mb (4.75 MB). On the Z7010, it is only 2.1 Mb (0.2625 MB).

Block memory can be added to the design using block design (GUI) or with Verilog/VHDL (Xilinx Parameterized Macros, XPM) code.

Block memory has two operating modes as shown in Figure 1.

BRAM controller
Stand Alone

The Block Memory Generator core uses embedded block RAM to generate five types of memories as shown in Figure 2.

Single-port RAM
Simple Dual-port RAM
True Dual-port RAM
Single-port ROM
Dual-port ROM

In this tutorial, we are going to use standalone mode and true dual-port RAM type.

1.2. BRAM Controller Mode

In this mode, the block memory should be used together with the AXI BRAM controller IP. In this mode, most of the block memory settings are grayed out as shown in Figure 3.

The size of the memory can be configured from Address Editor of the AXI BRAM Controller IP, instead of the block memory configuration wizard.

The relationship between address range and data depth (32-bit or 64-bit) is shown in the table below.

Range

Depth (32-bit)

Depth (64-bit)

1024

512

2048

1024

16K

4096

2048

1.3. BRAM Standalone Mode

In this mode, we can change the block memory data width and size, as well as other settings. This memory type can be used for internal use within our RTL module, which is not connected directly to the PS.

1.4. BRAM Timing Diagram

The following figure shows the BRAM write timing diagram for the BRAM controller mode. Every address is incremented every 4 because the address is 32-bit. Every piece of data is byte-addressable, as indicated by the we signal.

The following figure shows the BRAM read timing diagram for the BRAM controller mode. The address for output latency is one clock cycle.

The reset type for BRAM is active-high.

1.5. Accessing BRAM from PS

From the software side, we can write and read the data to and from BRAM using a simple memory map program. We initialize a pointer mem_p to the base address of the BRAM. Then, we can use this pointer to write and read the data.

#include <stdio.h>
#include <stdint.h>

#define MEM_BASE 0x40000000

uint32_t *mem_p;

int main()
{
	mem_p = (uint32_t *)MEM_BASE;

	// Write to block memory
	for (int i = 0; i <= 4; i++)
		*(mem_p+i) = 0xFFFFFFFF;

	// Read from block memory
	for (int i = 0; i <= 4; i++)
		printf("%d\n", (unsigned int)*(mem_p+i));

	return 0;
}

2. PE Module

2.1. Control and Datapath

The following figure shows the PE module. It consists of a multiplier and an adder. The circuit is a combinational circuit.

Next, we have to design the top module for this PE module, as shown in the following figure. We add pipeline registers to the input and output of the PE module. Then, we add a control unit, which is implemented as a counter. This module controls the BRAM input and output. For the start signal, we use a rising edge detector that is implemented with a register, a not gate, and an and gate.

This is the Verilog implementation of the PE top module.

pe_top.v

`timescale 1ns / 1ps

module pe_top
    (
        input wire         clk,
        input wire         rst_n,
        // *** Control and status port ***
        output wire        ready,
        input wire         start,
        // *** Data input port ***
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA CLK" *)  output wire        clka,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA RST" *)  output wire        rsta,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA EN" *)   output wire        ena,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA ADDR" *) output wire [31:0] addra,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA DIN" *)  output wire [31:0] dina,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA WE" *)   output wire [3:0]  wea,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTA DOUT" *) input wire  [31:0] douta,
        // *** Data output port ***
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB CLK" *)  output wire        clkb,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB RST" *)  output wire        rstb,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB EN" *)   output wire        enb,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB ADDR" *) output wire [31:0] addrb,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB DIN" *)  output wire [31:0] dinb,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB WE" *)   output wire [3:0]  web,
        (* X_INTERFACE_INFO = "xilinx.com:interface:bram:1.0 BRAM_PORTB DOUT" *) input wire  [31:0] doutb
    );
    
    // Main controller 
    wire start_reg, start_rising;
    reg [5:0] cnt_main_reg;
    reg [31:0] cnt_addra_reg, cnt_addrb_reg;
    wire start_cnt_addra, start_cnt_addrb;
    
    // PE datapath
    wire signed [7:0] a_in, a_in_reg; 
    wire signed [7:0] y_in, y_in_reg;
    wire signed [7:0] b_in, b_in_reg;
    wire signed [7:0] y_out, y_out_reg;   

    // *** Main controller ******************************************************
    register #(1) reg_start_0(clk, rst_n, 1'b1, 1'b0, start, start_reg);
    assign start_rising = start & ~start_reg;
    
    // Counter for FSM
    always @(posedge clk)
    begin
        if (!rst_n)
        begin
            cnt_main_reg <= 0;
        end
        else if (start_rising)
        begin
            cnt_main_reg <= cnt_main_reg + 1;
        end
        else if (cnt_main_reg >= 1 && cnt_main_reg <= 10)
        begin
            cnt_main_reg <= cnt_main_reg + 1;
        end
        else if (cnt_main_reg >= 11)
        begin
            cnt_main_reg <= 0;
        end
    end

    // Input port control
    assign clka = clk;
    assign rsta = ~rst_n;
    assign ena = ((cnt_main_reg >= 1) && (cnt_main_reg <= 8)) ? 1 : 0;
    assign addra = cnt_addra_reg;
    assign dina = 0;
    assign wea = 0;

    // Address counter for BRAM input
    always @(posedge clk)
    begin
        if (!rst_n)
        begin
            cnt_addra_reg <= 0;
        end
        else if (start_cnt_addra)
        begin
            cnt_addra_reg <= cnt_addra_reg + 4;
        end
        else if (cnt_addra_reg >= 32'h00000004 && cnt_addra_reg <= 32'h00000018)
        begin
            cnt_addra_reg <= cnt_addra_reg + 4;
        end
        else if (cnt_addra_reg >= 32'h0000001c)
        begin
            cnt_addra_reg <= 0;
        end
    end
    assign start_cnt_addra = (cnt_main_reg == 1);

    // Output port control
    assign clkb = clk;
    assign rstb = ~rst_n;
    assign enb = ((cnt_main_reg >= 4) && (cnt_main_reg <= 11)) ? 1 : 0;
    assign addrb = cnt_addrb_reg;
    assign dinb = y_out_reg;
    assign web = ((cnt_main_reg >= 4) && (cnt_main_reg <= 11)) ? 4'hf : 4'h0;

    // Address counter for BRAM output
    always @(posedge clk)
    begin
        if (!rst_n)
        begin
            cnt_addrb_reg <= 0;
        end
        else if (start_cnt_addrb)
        begin
            cnt_addrb_reg <= cnt_addrb_reg + 4;
        end
        else if (cnt_addrb_reg >= 32'h00000004 && cnt_addrb_reg <= 32'h00000018)
        begin
            cnt_addrb_reg <= cnt_addrb_reg + 4;
        end
        else if (cnt_addrb_reg >= 32'h0000001c)
        begin
            cnt_addrb_reg <= 0;
        end
    end
    assign start_cnt_addrb = (cnt_main_reg == 4);
    
    // Status control
    assign ready = (cnt_main_reg == 0) ? 1 : 0;
    
    // *** PE datapath **********************************************************
    // Input from BRAM
    assign a_in = douta[7:0];
    assign b_in = douta[15:8];
    assign y_in = douta[23:16];
    
    // Pipeline registers for input
    register #(8) reg_0(clk, rst_n, 1'b1, 1'b0, a_in, a_in_reg);
    register #(8) reg_1(clk, rst_n, 1'b1, 1'b0, y_in, y_in_reg);
    register #(8) reg_2(clk, rst_n, 1'b1, 1'b0, b_in, b_in_reg);
    
    // PE
    pe #(8, 0) pe_0
    (
        .a_in(a_in_reg),
        .y_in(y_in_reg),
        .b(b_in_reg),
        .a_out(),
        .y_out(y_out)
    );
    
    // Pipeline registers for output
    register #(8) reg_3(clk, rst_n, 1'b1, 1'b0, y_out, y_out_reg);
    
endmodule

2.2. Testbench

Now, we already have the PE top module. We can test this module using a testbench file. The block design that we are going to test is shown in the following figure. The PE top module is connected to the BRAM input and output.

This is the Verilog testbench for the PE top module.

design_1_wrapper_tb.v

`timescale 1ns / 1ps

module design_1_wrapper_tb();
    localparam T = 10;
    
    reg clk;
    reg rst_n;
    reg start;
    wire ready;

    reg [31:0] BRAM_PORTA_addr;
    reg BRAM_PORTA_clk;
    reg [31:0] BRAM_PORTA_din;
    wire [31:0] BRAM_PORTA_dout;
    reg BRAM_PORTA_en;
    reg BRAM_PORTA_rst;
    reg [3:0] BRAM_PORTA_we;
    reg [31:0] BRAM_PORTB_addr;
    reg BRAM_PORTB_clk;
    reg [31:0] BRAM_PORTB_din;
    wire [31:0] BRAM_PORTB_dout;
    reg BRAM_PORTB_en;
    reg BRAM_PORTB_rst;
    reg [3:0] BRAM_PORTB_we;
    
    integer i = 0;
    
    design_1_wrapper dut
    (
        .clk(clk),
        .rst_n(rst_n),
        .ready(ready),
        .start(start),
        .BRAM_PORTA_addr(BRAM_PORTA_addr),
        .BRAM_PORTA_clk(BRAM_PORTA_clk),
        .BRAM_PORTA_din(BRAM_PORTA_din),
        .BRAM_PORTA_dout(BRAM_PORTA_dout),
        .BRAM_PORTA_en(BRAM_PORTA_en),
        .BRAM_PORTA_rst(BRAM_PORTA_rst),
        .BRAM_PORTA_we(BRAM_PORTA_we),
        .BRAM_PORTB_addr(BRAM_PORTB_addr),
        .BRAM_PORTB_clk(BRAM_PORTB_clk),
        .BRAM_PORTB_din(BRAM_PORTB_din),
        .BRAM_PORTB_dout(BRAM_PORTB_dout),
        .BRAM_PORTB_en(BRAM_PORTB_en),
        .BRAM_PORTB_rst(BRAM_PORTB_rst),
        .BRAM_PORTB_we(BRAM_PORTB_we)
    );
    
    always
    begin
        clk = 0;
        BRAM_PORTA_clk = 0;
        BRAM_PORTB_clk = 0;
        #(T/2);
        clk = 1;
        BRAM_PORTA_clk = 1;
        BRAM_PORTB_clk = 1;
        #(T/2);
    end
    
    initial
    begin
        start = 0;

        BRAM_PORTA_addr = 0;
        BRAM_PORTA_din = 0;
        BRAM_PORTA_en = 1;
        BRAM_PORTA_we = 0;
        BRAM_PORTB_addr = 0;
        BRAM_PORTB_din = 0;
        BRAM_PORTB_en = 1;
        BRAM_PORTB_we = 0;
        
        rst_n = 0;
        BRAM_PORTA_rst = 1;
        BRAM_PORTB_rst = 1;
        #(T*10);
        rst_n = 1;
        BRAM_PORTA_rst = 0;
        BRAM_PORTB_rst = 0;
        #(T*10);
        
        BRAM_PORTA_we = 4'hf;
        BRAM_PORTA_addr = 32'h00000000;
        BRAM_PORTA_din = {8'd0, 8'd1, 8'd8, 8'd1};
        #T;
        BRAM_PORTA_addr = 32'h00000004;
        BRAM_PORTA_din = {8'd0, 8'd2, 8'd7, 8'd2};
        #T;
        BRAM_PORTA_addr = 32'h00000008;
        BRAM_PORTA_din = {8'd0, 8'd3, 8'd6, 8'd3};
        #T;
        BRAM_PORTA_addr = 32'h0000000c;
        BRAM_PORTA_din = {8'd0, 8'd4, 8'd5, 8'd4};
        #T;
        BRAM_PORTA_addr = 32'h00000010;
        BRAM_PORTA_din = {8'd0, 8'd5, 8'd4, 8'd5};
        #T;
        BRAM_PORTA_addr = 32'h00000014;
        BRAM_PORTA_din = {8'd0, 8'd6, 8'd3, 8'd6};
        #T;
        BRAM_PORTA_addr = 32'h00000018;
        BRAM_PORTA_din = {8'd0, 8'd7, 8'd2, 8'd7};
        #T;
        BRAM_PORTA_addr = 32'h0000001c;
        BRAM_PORTA_din = {8'd0, 8'd8, 8'd1, 8'd8};
        #T;
        BRAM_PORTA_we = 4'h0;
        BRAM_PORTA_addr = 32'h00000000;
        BRAM_PORTA_din = 0;
                
        start = 1;
        #T;
        start = 0;
        #T;
        
        #(T*10);
        
        for (i = 0; i <= 28; i = i+4)
        begin
            BRAM_PORTB_addr = i;
            #T;
        end
    end    
    
endmodule

The following figure shows the timing diagram of the PE top module. First, the data is written to the BRAM input. Then, the module starts when the start signal is one. For the duration of the computation, the ready signal is zero, indicating that the PE top module is busy. Finally, the PE result is stored in the output BRAM.

3. System Design

The following figure shows the overall SoC system design for the PE top module. We use the AXI BRAM controller IP to connect PS and BRAM. We use AXI GPIO for the control signal, the start signal, and the ready signal.

The following figure shows the block design in Vivado.

4. Software Design

For the software design, first we write the input data to BRAM input. Then, we start the PE module by writing 1 to the AXI GPIO channel 0. After that, we read the ready signal from the AXI GPIO channel 1 and wait until it is 1. Finally, we can read the output data from the BRAM output.

helloworld.c

#include <stdio.h>

#define MEM_INP_BASE 	0x40000000
#define MEM_GPIO_BASE 	0x41200000
#define MEM_OUT_BASE 	0x42000000

uint32_t *mem_inp_p, *mem_gpio_p, *mem_out_p;

int main()
{
    // Initialization
    mem_inp_p = (uint32_t *)MEM_INP_BASE;
    mem_gpio_p = (uint32_t *)MEM_GPIO_BASE;
    mem_out_p = (uint32_t *)MEM_OUT_BASE;

    // Write input
    printf("Input:\n");
    for (int i = 0; i <= 7; i++)
    {
    	uint8_t a = i + 1;
    	uint8_t b = 8 - i;
    	uint8_t y = i + 1;
    	*(mem_inp_p+i) = (y << 16) | (b << 8) | a;
    	printf(" a=%d, b=%d, y=%d\n", a, b, y);
    }

    // Start module
    *(mem_gpio_p+0) = 0x1;
    *(mem_gpio_p+0) = 0x0;

    // Wait until ready
    while (!(*(mem_gpio_p+2) & (1 << 0)));

    // Read input
    printf("Output:\n");
    for (int i = 0; i <= 7; i++)
    	printf(" %ld\n", (uint32_t)*(mem_out_p+i));

    return 0;
}

5. Result

The following figure shows the result on the serial terminal. We can verify the result manually.

6. Conclusion

In this tutorial, we covered a project on how to use BRAM with an example of a PE module.

PreviousPart 6: ARM CPU and FPGA Module NextPart 8: Hardware Accelerator for Neural Networks

Last updated 6 months ago