Note: This is part 5 of a series on working with FPGAs and in particular the Xilinx Zynq-7000S Programmable System-on-Chip with ARM Cortex-A9 processing core. For part 1, click here: Xilinx ZYNQ System-on-Chip - Getting to know the MiniZed Board
For all parts, click here: Path to Programmable
The is a compact coaster- or minidisc-sized board packed with a lot of functionality. It has an ARM processor (Cortex-A9) that can run Linux, embedded inside a Xilinx chip called Zynq, that also contains programmable logic (a field-programmable gate array - FPGA). The board has Arduino headers for plugging on shields too. There is also on-board wireless (802.11 WiFi and Bluetooth 4.0) and some sensors too! The photo below shows the board in a 3D-printed enclosure kindly provided by Fred27
In terms of memory, the board has lots of RAM and Flash memory soldered on (512Mbytes and 8Gbytes respectively) that is used by the Cortex-A9 processor inside the Xilinx Zynq chip.
When it comes to creating projects with the MiniZed, the designer can decide to implement some functionality in software (for the processor) but also some functionality in hardware (for the FPGA that is on the same chip). In Xilinx Zynq terminology, the processor portion is known as the Processing System (PS) and the FPGA portion is the Programmable Logic (PL) portion.
Why would one want to do this, rather than code in software? One popular reason is, the PL portion can be used to accelerate things! There could be other reasons too, such as requiring actions with certain precise timing (often that can be hard to do purely in software) or if power efficiency is needed.
The FPGA or PL portion of the chip has its own embedded memory too, known as Block RAM. This blog post discusses it further, and how to configure it using the Xilinx development envirionment called Vivado.
What is Block RAM?
To explore block memory, we need to go small (but not quite subatomic) and see what’s in the chip. The FPGA portion of the Xilinx chip consists of thousands of configurable logic blocks (CLB). Xilinx uses terminology called tiles and slices to define an amount of blocks. A CLB contains one slice, and the slice contains some look-up tables, multiplexers, flip-flops and circuitry for fast addition/subtraction. The look-up tables are used to create arbitrary combinational logic, but technically they could be considered as a type of memory too. The flip-flops can also be considered as types of memory. However, it could be inefficient to use slices as just memory, and therefore there are additional blocks of dedicated memory interspersed throughout the chip – this is Block RAM.
In Vivado, the Device View can be used to see what resources are available graphically. The light-blue shaded parts are allocated, but can be moved around if desired; the routing (green lines) determines which parts of the chip are being used. By zooming in the device view, it is possible to see down to the logic gate level as shown in the diagram above.
Every entity in the diagram has a reference that uses co-ordinates from the bottom-left of the diagram. So, the slice X43Y99 happens to be positioned at the top-right of the FPGA portion of the chip.
The block RAM is interspersed through the array; there are 50 pieces of it, each consisting of 36kbits, for a total of 1.8Mbits in the particular Zynq chip (Z-7007S) on the MiniZed board.
How can Block RAM be used by the Processing System?
Although the Cortex-A9 processor usually uses the RAM outside of the chip, soldered on the MiniZed board, there could be a desire to use the block RAM too. For instance, dual-port RAM is a very convenient way of sharing data between hardware running in the programmable logic, and the software on the Cortex processor!
The challenge becomes, how can the processing system (PS) access the block RAM that is interspersed in the PL portion of the chip? The answer lies in something known as AXI. When ARM developed their microprocessor/microcontrollers, they also developed on-chip interfaces which could be used by semiconductor manufacturers to connect other bits of functionality inside the same chip. So, for instance, if you purchase an ARM Cortex-M microcontroller from Texas Instruments, then it may use the Advanced High-Performance Bus (AHP) and Advanced Peripheral Bus (APB) highlighted in the diagram below, to talk to the integrated peripherals on the chip.
Image source: Texas Instruments website
AXI (Advanced eXtensible Interface) is a very popular bus for ARM based chips. It is the one used by Xilinx in the Zynq chips. The bus has a lot of signals (hundreds) but in a nutshell the main ones are separate address lines for read and write operations (known as AR and AW groups of signals respectively), separate data buses for read and write, many handshake signals, and a clock (ACLK) and reset (ARESETn).
When we wish to access functionality created in the PL portion of the chip, then the AXI interface is very useful. The ARM processor contains AXI master functionality. The PL portion would need to have an AXI slave to translate to/from AXI and the desired functionality.
Using the AXI Interconnect and Block RAM
There are slight implementation differences with AXI; it is possible to have 32-bit wide data busses for instance, but some peripherals may need a 64-bit wide bus. The AXI Interconnect is a piece of Xilinx intellectual property (IP) that can connect to the ARM processor and translate to the different variant slave devices. The diagram below shows the detail. As mentioned there are a lot of signals, but the write data bus is highlighted in red. As can be seen, the processor side (in green) is using a 32-bit wide write data bus, but the AXI Interconnect is being used in this example to translate to a 64-bit wide bus.
Another piece of Xilinx IP, the AXI BRAM Controller, acts as an AXI slave to translate from AXI to the interface used by the block RAM in the programmable logic (PL) portion of the chip. Finally, the Block Memory Generator is the IP used to configure as many of the 50 Block RAM elements that are required, from those interspersed across the chip, into a bus width and size specified by the user in Vivado.
Creating Block RAM in Vivado
The steps to create block RAM primarily consist of adding some IP, in particular the AXI slave device called the AXI BRAM Controller. From the block design diagram view (if it isn’t already opened, click on Open Block Design from the left-hand side Flow Navigator in Vivado), you can right-click in any of the white space in the diagram, and select Add IP to do this. Once it is added, double-clicking on it will open up a Re-customize IP window which lets you set properties. A data bus width of 64 bits was set here as shown in the steps diagram below. As with any IP block, once Run Connection Automation is run, the blocks will automatically connect as required. It will also automatically create the Block Memory Generator, which can be double-clicked on to examine if desired. Also, the connected signals can be explored as shown in the earlier diagrams in this blog post!
Creating AXI Interconnect in Vivado
The AXI slave device, the Block RAM Controller created in the previous section, has an unconnected S_AXI interface which ultimately needs to connect somehow back to the processor. To do that, AXI Interconnect IP is required which will interwork the 64-bit wide block RAM that was created, to the processor. Also, from the block diagram, it can be seen that the AXI interface also requires a clock and reset signal (all this is highlighted in red in the block diagram below).
To connect all this up to the processor, double-click the Zynq processing system (PS) block, and then at the bottom, click the green box labelled 32b GP AXI Master Ports. This will bring up a configuration box from where you can enable the master AXI interface for the processor, and then enable the clock for it too. See the steps in the diagram below. Once Run Connection Automation is executed, the connections will automatically be made between the processor and the AXI interface, and the block RAM controller!
The diagram below shows the connected signals between all the modules (click on the Regenerate Layout icon in the diagram graphical toolbar, to tidy it up). The thick blue connections represent many signals, which can be expanded to explore the detail if desired.
Finally, save the block design by either clicking on the save icon, or typing save_bd_design in the TCL window. The steps to re-create the VHDL wrapper content prior to generating the bitstream are shown below. After Generate Bitstream has been clicked on in the left side Flow Navigator pane, just click through until the bitstream is successfully created!
As with the previous blog posts, once the bitstream has been created, the task now moves to the creating the software board support package or BSP (which provides the startup code and some low-level basic functions) and any software development!
Creating the Board Support Package
To prepare the handoff to the SDK for software development, first, in Vivado, the hardware design needs to be exported. The steps are as in earlier blog posts, except that a new folder was created (called lab6 in this example) as shown below.
Once the SDK is launched, the steps to create the board support package are the same as in blog 2’s diagram titled Creating the Board Support Package, which guides you through the File->New->Board Support Package menu item.
The SDK can be used to see the BSP drivers that are included, and also details about the memory map, and what PL IP blocks were used, as a summary. As can be seen, the BSP included specific drivers for the block RAM. For the curious, the folder path standalone_bsp_0/ps7_cortexa9_0/libsrc/bram_v4_2 contains the source code.
The memory-mapped peripherals and address mapping information is also interesting. The block RAM addressing is visible here.
Creating a New Application
For testing the block RAM, some code was provided as part of the training bundle. It isn’t reproduced here, but it relies on the functions available in the BSP, to do things like set up interrupts and timing functions, and to use direct memory access (DMA) for acceleration, since AXI supports that. The snippet of code below shows how transfers are done in non-DMA mode (simply by copying from source to destination one 32-bit word at a time), and in DMA mode using a BSP supplied function called XDmaPs_Start once the structure of type XDmaPs_Cmd has been populated with details such as the source and destination addresses.
Just as a reminder (it was covered in blog 2 in the diagram titled Creating a Software Project in C), to create an application, go to File->New->Application Project and give the project a name, and select the board support package called standalone_bsp_0, and then click Next, and then you can choose a template such as Hello World, or Empty Application. If you have existing source code such as the dma_test.c file shown in the screenshot above, then it can be subsequently imported in by right-clicking on the src folder in the project explorer, and selecting Import followed by General->File System. As mentioned, for this blog post, the dma_test.c file was supplied as part of training material.
Running the Project
Connect up the MiniZed board to the PC as shown in the photo below using a mini USB cable.
Next, click on the Program FPGA icon as shown in the steps diagram below. Within seconds the programmable logic should be configured! The code is executed as shown below, by right-clicking on the project name and selecting Run As -> Launch on Hardware (System Debugger).
The code provides options to test and compare the block RAM speed with the external DDR3 RAM, and also to compare non-DMA and DMA speeds. By opening up a serial connection at 115200 baud on the PC, a simple menu was displayed. I selected 1024 words and option 1 (BRAM to BRAM transfer), and observed the significant difference in speed from non-DMA to DMA.
Selecting option 3 (DDR3 to DDR3 transfer) yielded these results:
DDR3 transfers were quicker, primarily because the DDR3 clock speed is higher. Nevertheless, block RAM was still fairly speedy, taking just over twice as many clock cycles compared to DDR3, in DMA mode. It would be possible to do a lot with such speed.
By exploring the programmable logic capabilities it is possible to see that there is a fair amount of memory interspersed throughout the chip. The ability to read and write to it is extremely important, and despite the underlying complexity of an internal on-chip peripheral bus such as AXI, it is in fact quite easy to get going with it. Adding the relevant IP to connect it all together was very straightforward! Such capability could be important whenever it is desired to interface hardware with the ARM processor’s memory map. Possible use-cases include LCD screens, cameras and high-speed analog-to-digital and digital-to-analog converters.
A while back, I experimented with connecting a LCD screen to a microprocessor. It was a Motorola/Freescale 68000 series chip, and I needed some way of getting video data from the external RAM to the screen. I decided to use an external dual-port RAM to act as the video memory. That way, the 68000 could write to it at any time, and I would use programmable logic in a Xilinx CPLD device to read the memory and write to the screen. It was extremely complicated! I wish I’d had a Zynq chip instead…
Thanks for reading!
Image source: Ant-Man from YouTube