Disclaimer1: Apologies for the length - These notes are a record of experiments on the BBB, so I was verbose while it was still fresh in my head.

Disclaimer 2: I'm a beginner with ARM and NEON assembler, so some of the information here may be obvious to some, or slightly inaccurate - if so, sorry in advance, and do let me know of any corrections.

 

What is it?

NEON is a set of functionality inside the BeagleBone Black's ARM core, which provides hardware acceleration for operations that can be done in parallel. It could be useful for image manipulation (the example here), but also for data, voice and video operations such as filtering. (Although the AM3359 device also contains a 3D engine, NEON is not related to it).

figure1.png

 

Is it hard to use?

In principle no, although there will be a learning curve to be useful at creating new applications that make good use of NEON.

However, there are already libraries of code that make use of NEON, since it is a mature technology. For example, the cairo library supports NEON (so technically the example here could have been done in cairo). Apparently ffmpeg also uses NEON, so that is accelerated too. ARM has libraries for H.264 too, that also uses NEON hardware acceleration.

 

How can one use it?

You could search out libraries that already use NEON. Or, you could create your own custom apps by either making use of special C functions, or by writing inline NEON assembler into your C code. Doing the latter entails spending a bit of time getting familiar with a few of the ARM instructions and stack and registers when C functions are called, so that you can conveniently pass information to/from the NEON assembler code.

 

NEON in a bit more detail

The BBB's AM3359 silicon has a number on on-chip devices, including a 3D graphics engine. It's not something I'm knowledgeable in, but I was curious in simpler 2D effects and sound and video. The ARM processor within the AM3359 actually contains NEON capability which according to the documentation is a way of executing special instructions on multiple pieces of data at the same time, while normal instructions continue to run.

 

NEON is built on something known as SIMD (Single Instruction Multiple Data) which takes advantage of the fact that although processors may have (say) 32-bit wide registers, some real world algorithms or media applications may only require 8-bit or 16-bit data. The larger register can be populated with more than one item of data, and then the processor hardware can execute the particular instruction (such as add or multiply) in parallel. They are known as vector operations.

(Note: Prior to NEON, there was a technology known as VFP (Vector Floating Point) that also had vector operations, but were not executed in parallel. According to this URL it appears that we can identify NEON instructions as beginning with 'v' and VFP instructions begin with 'f'.)

NEON takes SIMD beyond 32-bit and amongst other features, offers 128-bit registers, which can handle (as an example) eight items of data, each 16-bits wide (16x8=128) simultaneously.

 

Example code

To make NEON useful, a system is needed to take data from memory and populate it in these 128-bit wide (known as Quad-word, or Q registers)   mentioned above. Some special instructions exist which do exactly that; you set a conventional (32-bit) register to point to the memory location where the data stream begins, and then execute the NEON instruction to load from that address upward, into a Q register. Here is an example that I tried as an experiment, based on a screenshot in the ARM documentation:

addition_diagram.png

The diagram was translated into this code, and  Q1 and Q2 were populated with some data from RAM:

 

vld1.16     {q1}, [r0:128]

vld1.16     {q2}, [r1:128]

vadd.i16    q0, q1, q2

vst1.32     {q0}, [r2:128]


 

Load instructions (such as vld) take their operands in this direction:

vld dest<--src

Store instructions (such as vst) work in the opposite direction:

vst src-->dst

 

Using this information, it can be seen that the first instruction takes 128 bits of data starting at the address in register r0, and dumps it into the NEON Q-register called q1.

The next line takes 128 bits of data starting at the address in r1, and stores in q2.

The NEON instruction vadd is responsible for performing a parallel addition, storing the result in q0.

Finally, the vst operation places the contents of q0 (128 bits as mentioned) into the address space beginning at the address stored in register r2.

 

Traditionally performing the task above would have taken a loop to perform the action.

The detail above is in assembler, but the code can be directly written in C using a set of special functions, if the compiler can understand them (they are intrinsic functions). Apparently gcc understands them (see here for the functions) but I didn't get a chance to try them.

To experiment with NEON instructions, you could also create a assembler listing file (.s) and then assemble it, or you could write inline assembler in (say) C code. The latter approach was pursued here.

 

Generally, the slight complication with mixing C and assembler is that a little knowledge of the stack and C calling convention is required. With no arm experience, it was a little challenging, but a couple of evenings experimenting helped. A quick way to see what is going on is to force the compiler to generate an intermediate assembler listing. So, you could create a c file (called say neon.c) and write a simple main() function that calls a function called (say) neontest() and pass some parameters to it. The neontest function is the one in which you plan to insert some assembler, but for now you can just keep the function empty, and compile the code. Here is the entire neon.c file:


__attribute__((aligned (16)))

unsigned short int data1[8];

unsigned short int data2[8];

unsigned short int out[8];


void

neontest(unsigned short int *a, unsigned short int *b,

                unsigned short int* q)

{

}


int

main(void)

{

  neontest(data1, data2, out);

  return(0);

}


 

It is compiled using:

gcc -S neon.c

 

 

The -S tells the compiler to create a .s file, and it can be observed to learn what the compiler is doing. Since we're interested in parallel operations on multiple data, it makes sense to pass at least three parameters to neontest; all of them pointers. Two will be for input data, and the third will be for output data, if we wish to test out the parallel addition described earlier.

 

The explanation of the resulting assembler code is described futher below (in the section 'deciphering the calling convention'), since it is a bit of a digression from NEON (but is necessary knowledge, otherwise it's hard to know how to get information to/from the NEON assembler code).

 

Anyway, the NEON instructions shown earlier were integrated into the neontest function, and the final file is attached, and the assembler listing.

 

When inserting inline assembler with gcc, this is the syntax:

 

__asm__(

"   assembler instruction\n\t"

"   another assembler instruction\n\t"

    );


 

It was compiled using this command line (found on the web; the -Wl portion was to generate a map file, which contains addresses that are useful during debugging):

gcc -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp -Wl,-Map=mymap.map  neon.c

 

 

Note: The attached file (in neon.zip) actually uses a trick I found on a website, which is to save off all other registers into RAM, so that I can freely use those registers without worrying about them; although it was not necessary for this code, the attached file contains it.

 

The c code (in the main function) just sets up data1 to be

[0 10 20 30 40 50 60 70]

 

and data2 to be

[5  5  5  5  5  5  5  5]

 

and then it calls the neontest function. This is the output:

output is: 5, 15, 25, 35, 45, 55, 65, 75

 

 

So, to summarize, it is relatively straightforward to set up a skeleton C file, inspect it to see the calling convention, and then begin experimenting with NEON instructions.

 

More useful example - Image scaling

Now that it feels a little bit comfortable mixing NEON assembler in C code, it was worth trying a more useful scenario. I was interested in bilinear interpolation, which can be used for scaling images; this is clearly useful for traditional 2D games like Super Mario (this is in fact the example that was on the Internet which I used, see here) however it would also be useful for scaling other kinds of images too, such as data to be representing on an LCD for example (which was my end aim, but I ran out of time).

 

It took quite a bit of playing (and lots of segmentation violations) before the code ran - mainly because I don't really understand ARM or NEON instructions in any useful detail currently), so it took a while to understand how to 'glue' the assembler code into the C calling convention to insert the image and extract the result. After it compiled, I tried it on Tintin. The code actually gets called twice, it takes the two iterations to complete. However, it is apparently around 10 times faster to execute than without NEON. (The author has another algorithm which is apparently better still, but that was even harder to follow).

 

The code is attached (neon.zip file). It expects the input image to be a C array. I used Paint Shop Pro to save in RAW format, and then wrote a quick program to convert to the C array (three bytes, i.e. R,G,B, per pixel). The source image of Tintin was 100 by 140 pixels, so that was 100*140*3 bytes in the array. This is the original image:

tintin-source.png

I wanted to resize it by a factor of 2.4 (by the way, the algorithm needs the input and output dimensions to be divisible by 4 with no remainder), to a size of 240x336. The first pass through the code results in this image:

interim.png

Finally, the second pass results in this:

out.png

So, success, although it took a long time getting there.

Note: if you wish to try the code out, it will generate a binary file which is unformatted (raw), which Paint Shop Pro (and presumably other software) can import in. There is no header, just the sequence R,G,B,R,G,B... etc, so that each pixel is 24-bit in other words.

 

Other interesting things to do with NEON could be the FIR filter example in the PDF document referenced below (in the 'where to find more information' section.

 

To summarize, NEON could be extremely useful to attain good speed (apparently maybe 5-10 times acceleration) for data handled in parallel.

The references below have information that was found to be extremely valuable to understand NEON and how to use it. The registers and stack format are described below for those who need this information.

 

Deciphering the calling convention

This was mainly done by looking at the assembler listing, but also by stepping through the code with gdb:

The main function (listed earlier), just before it calls neontest(..) resulted in this snippet of assembler:

 

movw    r0, #:lower16:data1

movt    r0, #:upper16:data1

movw    r1, #:lower16:data2

movt    r1, #:upper16:data2

movw    r2, #:lower16:out

movt    r2, #:upper16:out

bl    neontest


 

(The mov instructions operate in the direction mentioned earlier, i,e. mov dst <-- src).

It can be seen that the addresses (pointers) to data1, data2 and out are placed in r0, r1 and r2 respectively.

'bl' is a branch link' instruction and it basically stores the address of the next instruction into the 'link register' aka lr which is actually r14. This is so that when the neontest function exits, the processor can load the program counter with r14 to continue running main() where it left off.

 

The assembler code at entry to neontest looked like this:

 

neontest:

@ args = 0, pretend = 0, frame = 16

@ frame_needed = 1, uses_anonymous_args = 0

@ link register save eliminated.

str    fp, [sp, #-4]!

add    fp, sp, #0

sub    sp, sp, #20

str    r0, [fp, #-8]

str    r1, [fp, #-12]

str    r2, [fp, #-16]


 

The stack grows downward in address space. At this point it's easier to look at a diagram. The diagram is inverted, so that a rising (i.e. growing) stack is actually decreasing in address. Each block in the diagram represents 32 bits, i.e. the stack address reduces in steps of 4 bytes as the stack grows.

stack.png

Before neontest was called, the stack pointer was at a certain location shown on the first diagram (marked as SP0).

Upon entry to the neontest function, the assembler code shown above was executed. The first line stores the frame pointer (fp) register (it is r11) at the address of SP-4, i.e. in the block above. It is shown in black in the middle diagram.That same line also then decreases the value of SP by 4 (the exclamation mark on that line causes that). So, SP now points to the location marked as SP1 in that middle diagram.

The second assembler line sets fp to equal sp. It is marked as FP1.

The third assembler line decreases SP by 20 bytes, i.e. 20/4 or 5 blocks in the diagram. So, SP is now pointing to the location marked SP2 in the middle diagram.

The fourth assembler line stores r0 (which contained the address of data1) in location fp-8, or in other words two blocks (8 divided by 4) up from the address in fp (which happens to be the location FP1).

 

The remainder two lines do a similar type of thing to r1 and r2.

The skipped 4 bytes (the white space between FP0 and r0 in the middle diagram) are used to store the link register if required, according to this diagram (which is upside down compared to the diagram above).

 

After the neontest function is close to completion, the following assembler code is executed before control returns back to the main() function:

 

add    sp, fp, #0

ldmfd    sp!, {fp}

bx    lr


 

The first line sets SP to equal FP+0, i.e. it quickly moves the stack pointer to the location FP1. So, on the last diagram, it can be seen that the stack pointer is now at the location indicated as SP3. That location happens to contain the earlier contents of fp.

 

So, we now need to populate fp with that older value, set the SP back to the old SP0 value (shown on the first diagram) and set the program counter to the link register value.

The second assembler line above is responsible for popping off the value FP0 and auto-updating the SP value, so that it moves from the location SP3 back down to the old SP0 value.

Finally, the last assembler line is responsible for jumping to the address in the link register (lr).

 

Where to find more information

Useful white paper - contains a FIR example.

http://www.arm.com/files/pdf/neon_support_in_the_arm_compiler.pdf

 

NEON instruction reference

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204j/Bcfjicfj.html

 

GCC NEON function intrinsics (if you wish to code in C)

http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html

 

PDF copy of a presentation on NEON

http://www.elinux.org/images/4/40/Elc2011_anderson_arm.pdf

 

Good explanation of the calling convention

http://stackoverflow.com/questions/15752188/arm-link-register-and-frame-pointer

 

GDB tips (likely you'll need it if you are seeing crashes)

http://www.yolinux.com/TUTORIALS/GDB-Commands.html

 

NEON blog (this is part 1, there are other parts too)

http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/