We know SPI as a 4*-wire protocol. But it doesn't have to be.

I'm checking high speed SPI data transfer with buffers, DMA and parallel data lines.



This time, I'm preparing a test bed for SPI with Buffers and DMA. Some hardcore memory probing going on.


Buffers and DMA


Before we look for hardware performance boosters, let's check what we can do on the firmware side.

A typical speed kicker is using buffers and direct memory access (DMA).

We all typically start of our SPI communication with a loop that sends values byte per byte (or word per word).

Many controllers can do better. I'm using a Hercules controller that supports both buffered SPI and DMA.


Buffered SPI means that you can tell the SPI peripheral where your data sits, and how big it is.

The SPI module will then independently (and often without taking up processor power) physically stream the data.

That's already a good thing. It decreases the amount of context switching between controller and SPI module by the number of data elements you manage to buffer.


The second option, DMA, adds another win.

Your SPI module doesn't even need the controller to get at the data in that case.

Even though the data won't be delivered to the destination at the time you post it, you can walk away and do other things.


Use case: Draw a Bitmap on an LCD


I'm going to try this out on a 128 x 128 colour LCD.

I have a naive implementation that sends pixel per pixel via SPI.

I'll try to boost performance first by using a line buffer.

The SPI module can then move this bitmap line as a single operation to the LCD.

The second exercise is to enable DMA, and have the SPI do the exchange all by itself.

In a last phase, I'll try to work with two buffers, and fill the data for line 2 while line 1 is being drawn.


Naive Implementation of the BitMap Draw


The image will be available in our code as a big array of pixels.

The array contains 16-bit values. Each position just contains the RGB colour of that particular pixel.

My LCD is 128*128. My array has 16384 positions.



I used Vladimir Rioson's LCD Image Converter - I'll have to spend another blog on the process to get the data in firmware


I started off with the simplest draw function possible.

For each horizontal line, I iterate through all the pixels and draw them on the correct position.



    for (yy = 0; yy < bmp->height; yy++) {
        for (xx = 0; xx < bmp->width; xx++) {
            point(xx, yy, (bmp->data)[(yy)*bmp->height + (xx) ]);


Each time we call the point() function, we execute 8 SPI calls. That results in 16K * 8 = 131K SPI calls.

It's almost a miracle that it draws in less than a second.

(I'm using my own Hercules Educational BoosterPack MKII LCD library, ported from the Energia one for that display)


Preparation: Special Memory for MDA and Filling the Buffer Line


DMA needs some preparation. On the Hercules, the memory is by default configured to maximise caching.

It only writes info from cache to the real mem when it's needed by the core.

That's cool as long as it's only the ARM core that talks to that memory. But in our DMA set-up, that's not the case.

The SPI module will hit that memory in parallel with the core.

Any cashing scheme that doesn't immediately write through to the physical memory, fails. So we have to make special provisionings.


It would be stupid to make all memory write-through. That would impact the performance of our controller badly.

What we can do instead is to reserve a specific piece of RAM for the buffers, and make that shared.

We'll call this memory area SHAREDRAM. We'll steal this are from the normal configured RAM .


I will create a buffer for one bitmap line. That's 128 colour values.

Because SPI exchanges data both ways during each clock tick, we'll also have to provide a receive button - even though our LCD display will never send data back.

One line is 128 values.

So we'll need one buffer of 128 positions for send data  (that's 0x0100 in HEX for a 16-bit array - handy to know that for later). Automatically, we will need to provision a same sized receive buffer.

In total, we'll need the double to accommodate receive data too (although it 'll stay unused).


Here's a snippet of the memory map. This file is generated during the build process, but showing it now will prove that we're good (you can also see that I've reserved 0x1000 in stead of 2 x 0x0100 positions).

         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  SHAREDRAM             0802f000   00001000  00000100  00000f00  RW 


From Standard Memory Config to Dedicated DMA Area


When we create a Hercules project for the RM46 controller (the one I'm using here), all RAM is in one chunk.


    RAM     (RW) : origin=0x08001500 length=0x0002EB00


And all that RAM has the following attributes:



For our DMA, we'll need better than shareable. So we'll steal 256 positions from the RAM, and make it write-trough.

You'll need knowledge of your linker command file for that. I've blogged about that;


So we'll first nick room for those 256 positions from our RAM (I'm taking room for more here) , and call that chunk SHAREDRAM.


    RAM       (RW) : origin=0x08001500 length=0x0002DB00
    SHAREDRAM (RW) : origin=0x0802F000 length=0x00001000


(see how we decreased the RAM part by 0x01000?)


Let's now refine the cache properties of that little chunk, and make it write-trough.



Both ARM core and DMA will have the same view of the memory from now on. Guaranteed.


Getting our Buffers in Shared Memory


Now we'll have to place our buffers inside that SHAREDRAM.
We'll first add the directives for the linker in the linker command file.


    .sharedRAM : {} > SHAREDRAM


That will instruct the linker to put all variables that are flagged as sharedRAM to go in our special shared memory location.

In our code, we use a #pragma to flag this special condition.


#pragma SET_DATA_SECTION(".sharedRAM")
uint16_t TXDATA[D_SIZE];         /* transmit buffer in sys ram */
uint16_t RXDATA[D_SIZE]= {0};    /* receive  buffer in sys ram */


Everything after the first #pragma goes to the special area. Everything after the second #pragma to the default memory space for that type of source object.

Cool he? We've defined a special 4KB area with special cache handling. And we've managed to move our buffer into that area.

In the memory map file (created by the linker during the build, edited by me to only show the highlights), you'll find the proof.



         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  RAM                   08001500   0002db00  0000a13c  000239c4  RW  
  SHAREDRAM             0802f000   00001000  00000100  00000f00  RW  

run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
0802f000    0802f000    00000100   00000000    rw-
  0802f000    0802f000    00000100   00000000    rw- .sharedRAM

 output                                  attributes/
section   page    origin      length       input sections
--------  ----  ----------  ----------   ----------------
*          0    0802f000    00000100     UNINITIALIZED
                  0802f000    00000100     bitmaputils.obj (.sharedRAM:uninit)

address   name                                
-------   ----                                
0802f000  TXDATA  



One thing that strikes out is: why is there only room allocated for the TXDATA buffer, and none for the RXDATA one?

That's because my code isn't finished, and I've not used the RXDATA buffer in my code yet. The linker kindly (and correctly) discards it.


Final Prep Step: Efficiently Filling the Buffer


Our code has to move a line of bitmap into the buffer efficiently. For us c aficionados, that means using memcpy().

We'll zap one line at a time from the full bitmap into the line array. Here's the code to do that.


    memcpy(&TXDATA, (const void *)(bmp->data + (bmp->width) * row), 2 * D_SIZE);


Not much to be said here. We're copying a number of bytes (2 * D_SIZE) from the address of beginning of a row (bmp->data + (bmp->width) * row) to the buffer (&TXDATA).


Why 2 * D_SIZE? Because memcpy() copies a number of bytes. Our array exists of 16 bit values - and 16 bits is 2 bytes. it takes 2 times the array count to get all data moved.


What have we Achieved Now?


We have a buffer ready to optimise SPI traffic. The buffer is configured with the correct cache behaviour.

And we have a fast method to move bitmap chunks (lines) into that buffer.

We're all set up for the next stage, where we'll try to turn this into a fast SPI funnel.


Disclaimer: I haven't checked yet if my display actually accepts a series of values and paints them one after the other. We mai be in for a surprise if it doesn't.

Anyways, that doesn't impact the approach described here. I'll just not be able to show it for real then ...


The Series
0 - Buffers and Parallel Data Lines
1a - Buffers and DMA
1b - SPI without Buffers
2 - SPI with Buffers
3a - SPI with DMA
3b - SPI with DMA works
4a - SPI Master with DMA and Parallel Data Lines
Hercules microcontrollers, DMA and Memory Cache