Skip navigation

So after breaking a BeagleBone Black by putting 5v onto the breakout pin on the P8 side... we were supplied with a new one! (thanks!).


This meant that I could go back to coding for the scoring system. After working out the pinmux, from various confusing (inaccurate) sources and cross-referencing with the technical reference manual(s), for the mode and setting the pin for input,  I had something which, when the infra-red beam is broken, turns a green LED on. I've created a nice spreadsheet which has the values in, which I'll post up later. Probably to Google Docs.


Ball Sensor


The transmitter and receiver are a little difficult to see on there (so, Drew, this is my setup!) and we're not using any capes or such, just plain breadboard and wires. Funnily I didn't find any examples of actually using the PRU to receive data on the pins, just output data on them. The simplest way to handle it, has been just to check if a bit has been set on that pin and then behave accordingly.


We also suspect that the Video4Linux drivers were playing havoc, preventing us from capturing high resolution video to do the ball tracking on the BeagleBone Black. Currently it's pretty low but seems to be sufficient for our image analysis to work, so that we have got nice direction prediction.



OpenCV table football ball tracking using a BeagleBone Black from Jon Stockill on Vimeo.



P.S. We found a use for our catalogue...


Farnell supports us!

Howdy!  I'm a member of Chicago's hackerspace, Pumping Station: One (website), and we are competing in element14's BeagleBone Black Hackerspace Challenge.



We are a federally-recognized non-profit organization with over 200 members and turned 4 years old back in April.  I shot the above photo at our original space before we moved last year to our current space which is much bigger.  I always liked how our logo looked on the wall there... and my photo also catches the most excellent Mitch Altman teaching folks to solder .  Mitch organized element14's last hackerspace challenge: The Great Global Hackerspace Challenge in 2011.  The Pumping Station: One (aka PS1) team built an Arduino-based biosensor array with EKG, GSR, CO2 & more.


We have a couple recurring public meetups at our space that has shaped our ideas for the BeagleBone BlackBeagleBone Black (aka BBB) challange:



Real-time, real fun!

Many members including myself were very intrigued by the BBB's Programmable Real-Time Units (PRUs) which are two 200-MHz microcontrollers embedded in the BBB's TI Sitara AM3359 processor that run independent of the processor's ARM Cortex A8 core.  This means the OS running on the ARM, such as Linux or Android, can offload real-time tasks to the PRUs such as motor control or communication protocols.  element14 Community member shabaz has written wonderful tutorials about using the PRU to interface a LCD and also create a thermal imaging camera.


Box o' LCDs:

After having done many Raspberry Pi projects, I'm most excited about the BBB's ample number of GPIO pins as I always wanted to connect an inexpensive, parallel color LCD but was unable to as the Pi lacks enough pins.  Another PS1 member, Ste (our FPGA & signal processing guru), brought a pile of donated LCD panels to the space which got us thinking if could interface it with the BBB to setup control panels & kiosks around our space:


However, we eventually determined that we would need to design an adapter board which would probably take too long (given the turnaround time for most batch PCB services - especially if we needed a second order to fix errors in the first revision).  We also weren't sure if whether or not the our completed DIY LCD cape would actually be much cheaper that the CircuitCo 7" LCD CapeCircuitCo 7" LCD Cape to make it worthwhile.


PS1's Pick-n-Place project:

Several members have been collaborating on a project to produce a Pick and Place machine. PnP machines are robotic assembly devices that place electronic components on circuit boards as part of the soldering process.  Motion control and computer vision are key ingredients.  The goal is to eventually use computer vision software (like OpenCV) to correct misaligned parts and verify correct placement.  The current prototype relies on dead-reckoning.  Here is an early prototype that was presented at NERP back in February:


    Pick-n-Place machine gantry Ver 0.3a in development (Ed_B)


A couple members involved in the project have moved on to other endeavors, but Ed has continued to work on his gantry build.  He presented a demo at our last NERP meeting:


It is using the TinyG motor controller board, and he was sending it G-code from demo files on a Raspberry Pi over serial:




BeagleBone Black makeover:

We decided that a worthy goal for the challenge would be to replace the TinyG and Pi with just the BeagleBone Black.  The PRUs should be capable of generating the step and direction signals for a stepper motor driver IC like the TI DRV8825TI DRV8825.  Software to generate G-code from a board layout (a centroid file) can run on the BBB.  The G-code can then be transformed into an array of movements that is passed to the PRU program.


In my next post, I'll describe the different approaches we've explored so far and what our current course of action is.


And now for something completely unrelated...

I can't conclude an introduction of our hackerspace without mentioning that we have a working Scanning Electron Microscope (SEM) takes beautiful photos.  It's also suprisingly interactive; here's Steve flying around the microscopic world:


And here's 10 micron-tall letters and numbers etched on to an Analog Devices IC die from 2007:



Happy Hacking,



This week I intended on blogging about working and programming for the PRU in assembly and working with the GPIO.


Until I bricked the Beaglebone Black we were given.


Usually when you brick a device you can get it back up. Typically, it is the instructions/firmware on the flash chip that have corrupted or it's the operating system you're running on the board.


However, when you're wiring up the breakout pins (P8 / P9) and you accidently cross-wire the 5 volt line through a button to an input pin socket, or you accidently short the pin out across ground and such.


That's when you kill it. The Power LED flashes briefly and nothing more happens.


So yes, I killed the challenge BeagleBone Black that we were donated.


Sorry. Can we have another pretty please?


Let this be a lesson to everyone else..



We were not sure whom had time to work on the challenge, the majority of the members have full time jobs and other obligations, so evenings on weekdays and some weekends are often the best bet.


Paul Brook (pbrook) decided to head the challenge project, directing three other interested members; myself (Christopher Stanton / Stanto), Jon Stockill (nav) and Angus (ajtag).


We really needed a way to control the arms of the foosball table, and it was decided between PBrook and Nav that the best way to do this may be with the BeagleBone Black (BBB) controlling a set of servos. These would help to drive a rack and pinion type of system, which we have constructed by using the 40W laser cutter that we are renting from ChickenGrylls at NottingHack. It produces something that looks like this:


Laser Cut Rack and Pinion
Laser Cut Rack and Pinion


It's probably difficult to imagine what is going to be done with it, because there will also be modified tubing to grab the arms/rods to rotate the foosballers.


This isn't the only development that has been progressing, we need a way to be able to track the ball and this has been done in many different ways already with other systems. The most common way is to use a camera to watch the play field and then perform operations based upon it.


Now since the BBB is going to be using linux, then PBrook and AJtag have been looking at using OpenCV/SimpleCV interfaced with a camera to help track the balls movement and feed it into the BBB. The version of Linux that HackSpace members are most familiar with is Debian, while the BBB comes with Angstrum there is a Debian build out there.


So I decided to look at using Debian on the BBB and see if I can make an LED blink with the PRU/PRUSS/PRU-ICSS. Examples and guides on this are confusing for an amateur such as myself, working with the PRU requires loading compiled assembly code onto the chip but to work with the pin headers/breakout pins on the BBB needs the GPIO pins to be pinmux'd using the Device Tree Compiler (dtc) (as far as I could tell) and then you can play with the memory addresses appropriate to the PRU for input/output.


Only this has been made a little tricky, the PRU is not officially supported to the community and has been removed from the latest technical reference manual for the BBB. However, it does still exist within version C of the technical reference manual if you can find it. However, this is mainly mute as a lot (but not all) of the content has been decanted into a github repo'. Most of these can be found by searching for BBB PRU (though not those exact terms and there's an example of working with it on this site).


Now I mentioned working with the PRU in Debian, I discovered that at the time I grabbed the Debian build, its dtc wasn't modified, I suspect the kernel was not fully altered either, which makes interfacing with the breakout pins/sockets difficult/impossible. However, I did manage to flash the onboard LEDs which was a mini-victory for myself. As a distraction I found the patches, packages(?) and guide necessary to patch in the BBB modifications to Debian for ARM.


Hopefully soon there will be some rods that are moved and rotated by servos and a camera tracking a ball, on top of this, we decided that there should be a way of keeping score...

Some have think tanks or mind maps, we just threw around ideas until something stuck over the course of a few days/weeks. Of course for a while we were introducing members of the space to what the BeagleBone Black actually was:


"Well, what can it do?"

     "It's like a RasPi, mixed with a PiFace/Arduino, but faster, and done properly"

"Okay, but what shall we do with it?"

     "I dunno."


After everyone that turned up regularly enough was introduced to the BeagleBone and the Challenge, some ideas started to appear.


There were two main ideas for contension, one was a fire-fighting robot using a camera of some form with infra-red for light/flame detection and controlling a water gun to shoot at it, on wheels. Then the idea turned to having an automated Air Hockey table, but I think that was considered to be a bit tricky due to the way in which the player(s) piece is able to move around quite freely/some other reason. The alternative was a foosball table that was controlled by the BeagleBone Black, with the ball tracked by a camera.


We're English, foosball won, especially when a member bought one from eBay for £50. For an all-wooden table it's in pretty good condition.


Foosball Table


The table's missing a few handles and the playing field bows upwards a bit in the middle, but with some work it'll be fine. We were already thinking up how the BeagleBone can play its part in controlling arm movement and detecting goals that could be scored by either team, of course, this raised even more questions.


BBB - Getting ACE working

Posted by shabaz Top Member Jun 16, 2013


The Beaglebone Black PRU cores are great for high speed operations. The AM3359 chip on the BBB contains an ARM core and two PRU cores amongst other modules. They run as independent CPUs at 200MHz, freeing up the ARM core for continuing to run Linux applications. It is quite amazing to have three processors in a single chip that are fairly easy to use. This post is about a library called ACE that could be useful for any method we create to communicate to the PRU cores easily, hopefully to accelerate PRU adoption and make code easier to understand. Hopefully we can come up with good protocols for communication with the PRU cores.


Improving communications with the PRU

Some method is required for transferring binary code to be run on the PRU, and for communication between PRU and ARM while the binary code is run. The TI example software makes use of shared memory communications. Some method or protocol is still required to make this simple and safe to use, and to allow multiple processes to communicate with the PRU if desired. Not everyone will want to know about memory locations just to push a command to software running on the PRU.

An example approach

For example, one approach would be to have some code on the ARM that sleeps until a command is received from some Linux app. It would interpret the command and then send some data (via shared memory for example) that the PRU could use. The PRU would be sitting in a loop waiting for the data. The PRU would read the data, and execute some task (such as read from an ADC) and then send a response back. Meanwhile, the Linux app would continue to run at full speed. Once the PRU completes, the response would be sent to the original application that had issued the command.


One proposed approach is to make use of ACE available from here. It is considered to be reliable. (There may be other software that is also suitable). It would allow a higher level interface to be created to talk with the PRUs from Linux hosted applications.

This post investigates if ACE can run on the BBB, to encourage protocols to be devised later, for the communication to the Linux process and then a lower layer for communication down to the PRU (maybe people know of some existing methods that could be reused). The latter protocol needs to be lightweight and easy for the PRU to understand, maybe just tag/length/value encoded for example.


Compiling ACE

On the BBB, create a folder to do the development work in, e.g. off your home directory, something like /home/root/develop and create a folder called ace, i.e. so the path is /home/root/develop/ace

Download ACE - the required version is ACE-6.2.0.tar.bz2 and place it in that folder.

bunzip2 ACE-6.2.0.tar.bz2

tar xvf ACE-6.2.0.tar


The tar command will result in a folder called ACE_wrappers being created.

cd ACE_wrappers


export ACE_ROOT

cd ace


(the path is now /home/root/develop/ace/ACE_wrappers/ace)

cp config-linux.h config-bbb-linux.h


Create a new file:

vi config.h


in this new blank file, add this line:

#include "ace/config-bbb-linux.h"


Save and exit.

cd ../include/makeinclude/

cp platform_linux.GNU platform_bbb_linux.GNU


Edit the platform_bbb_linux.GNU file, and insert after the comments at the top, this line:

static_libs_only ?= 1


Save and exit.

Then, create this new file:

vi platform_macros.GNU


in this new file, add a line (note there is no hash):

include $(ACE_ROOT)/include/makeinclude/platform_bbb_linux.GNU


Save and exit.





(the path is now /home/root/develop/ace/ACE_wrappers )

cd ace



This will take a while to compile (about 20 minutes).

Now that the ACE library is compiled, create a folder and subfolders where you want to store everything, if you don't want to copy to /usr immediately, or if you wish to save the compiled stuff for transferring to other BBBs.

For example:

mkdir /home/root/develop/ace/built

cd /home/root/develop/ace/built

mkdir include

mkdir bin

mkdir lib

mkdir share


Then, type these commands to install in the created folders:




cd ace

make install



To now install in the usual folders:




cd ace

make install


Quick ACE test

To test out ACE, make an example application (this one was selected at random):


cd examples/Logger



Then, in one window, execute

cd /home/root/develop/ace/ACE_wrappers/examples/Logger


(7136|3070193664) starting up server logging daemon



And in another window,

cd /home/root/develop/ace/ACE_wrappers/examples/Logger



You should see this at the server:

(7136|3070193664) connected with localhost.localdomain

message = 1

message = 2

message = 3

message = 4

message = 5

message = 6

message = 7

message = 8

message = 9

message = 10

server logging daemon closing down


The built libraries are attached.

Hackspace battle with e14 logo.png

2 Hackspaces... 2 BeagleBone Blacks... 1 Winner!


We're pleased to announce a month long challenge that will pit two transatlantic hackspaces against each other to create a fantastic project using the awesome BeagleBone Black! Each hackspace has been given a BeagleBone Black and challenged to come up with the most awesome project and then submit their results on element14. We've picked the two closest hackspaces to our UK and US offices to compete and here's a little bit more about them...


Pumping Station One is a hackerspace located in Chicago. Its mission is to foster a collaborative environment wherein people can explore and create intersections between technology, science, art, and culture. They fulfil their role as a community resource by hosting classes on electronics, programming, crafts, and any other skills that members (or guests) are willing to share.


Based in the North of England, Leeds Hackspace has been running for three years and is a group aiming to provide Leeds with a permanent not-for-profit hackspace. Their members come from a range of backgrounds, have differing levels of experience and are interested in everything from woodworking to programming to robots with lasers. The hackspace gives provides a shared workshop to develop skills, collaborate on projects and be more active in the community.


Once they have created their projects and submitted them, we will ask you, the element14 community, to vote for your favourite. The winning hackspace will then receive a prize of $1,000 donation to their group (the runner up will also receive a $500 donation).


Our hackspaces have already begun to post updates on their projects. Check them out! You can buy your own Beaglebone Black here. Who will you be rooting for? In the words of Connor MacLeod... "There can be, only one!"




This was a quick, fun exercise, to build a complete thermal camera using the Beaglebone Black, a small LCD and a thermal array sensor.

It was really more of a consolidation, combining some earlier experiements.


The setup

The image here shows the entire assembly, capturing an image of my hand (taken from about 1 foot away. It was hard to take a photo at the same time!).



The final code is attached. The information on each sub-section is at these links:

Thermal array circuit

I2C interface code (either compile it, or just copy the libi2cfunc.a file to /usr/lib, and the i2cfunc.h file to /usr/include )

1.8inch LCD display

Image scaling (or use NEON functionality)

PRU information (you will need to install the PRU assembler if you want to make changes - if you don't, you can use the pre-assembled .bin file in the zip file).


The attached code is not tidy, but it works. It could be optimised a lot, I made no effort to do this. Currently the image updates at about 1Hz, but this could be improved many times. Many conversions are done that could be simplified.

Edit: It looks like actually the whole process is occurring at many tens of Hz, but the LCD updates at about once per second; this looks like a limitation of the LCD display (the update over the serial interface is occurring rapidly, but internally it presumably only updates once per sec. An alternative display is required!).

Edit 2: It turned out that the LCD was fine; the thermal sensor was set to internally update at 1Hz, regardless of the readout speed. The version 2 code attached has it now set to 16Hz in ir.c, but possibly 4Hz or 8Hz would be sufficient and give higher accuracy.

Compiling and running the code

Here is how to use the code:


First, make sure the I2C library is installed as mentioned above.


Then, copy the attached code into any folder, unzip and then follow one of these two steps:


1. If you have not installed the PRU assembler:

make partclean;make ir_app; make dtfrag



2. If you have installed the PRU assembler:

make clean;make



Copy the generated .dtbo file to the /lib/firmware folder.


Then, type the following (it could be placed in a startup script when the board has booted):

export SLOTS=/sys/devices/bone_capemgr.9/slots

echo cape-bone-lcd18 > $SLOTS

cat $SLOTS



The code can now be run by typing:



To generate video the code was modified (v3 code) to accept a file prefix.

./ir_app img


This will generate files beginning with img, i.e.

img00000.png, img00001.png, etc.


Then, they can be converted into a video using:

avconv -i img%05d.png -b:v 1000k test.mp4



Another video (the temperature to color conversion had a bug in the earlier video, fixed in the v3 code):


Here is another photo, this time of a tub of icecream.


This is holding an ice lolly on a stick:


Other ideas

It would be nice to be able to retrieve the images via a PC, either stream the images, or to have a web server to access them, or just dump files in a folder for now. I didn't get a chance to try any of these.

If I get time I might build this into an enclosure, as a permanent project (powering the the entire thing from a Li-Ion battery that connects on the board (Olimex BATTERY-LIPO1400mAh fits, if you buy the right connector).

Disclaimer1: Apologies for the length - These notes are a record of experiments on the BBB, so I was verbose while it was still fresh in my head.

Disclaimer 2: I'm a beginner with ARM and NEON assembler, so some of the information here may be obvious to some, or slightly inaccurate - if so, sorry in advance, and do let me know of any corrections.


What is it?

NEON is a set of functionality inside the BeagleBone Black's ARM core, which provides hardware acceleration for operations that can be done in parallel. It could be useful for image manipulation (the example here), but also for data, voice and video operations such as filtering. (Although the AM3359 device also contains a 3D engine, NEON is not related to it).



Is it hard to use?

In principle no, although there will be a learning curve to be useful at creating new applications that make good use of NEON.

However, there are already libraries of code that make use of NEON, since it is a mature technology. For example, the cairo library supports NEON (so technically the example here could have been done in cairo). Apparently ffmpeg also uses NEON, so that is accelerated too. ARM has libraries for H.264 too, that also uses NEON hardware acceleration.


How can one use it?

You could search out libraries that already use NEON. Or, you could create your own custom apps by either making use of special C functions, or by writing inline NEON assembler into your C code. Doing the latter entails spending a bit of time getting familiar with a few of the ARM instructions and stack and registers when C functions are called, so that you can conveniently pass information to/from the NEON assembler code.


NEON in a bit more detail

The BBB's AM3359 silicon has a number on on-chip devices, including a 3D graphics engine. It's not something I'm knowledgeable in, but I was curious in simpler 2D effects and sound and video. The ARM processor within the AM3359 actually contains NEON capability which according to the documentation is a way of executing special instructions on multiple pieces of data at the same time, while normal instructions continue to run.


NEON is built on something known as SIMD (Single Instruction Multiple Data) which takes advantage of the fact that although processors may have (say) 32-bit wide registers, some real world algorithms or media applications may only require 8-bit or 16-bit data. The larger register can be populated with more than one item of data, and then the processor hardware can execute the particular instruction (such as add or multiply) in parallel. They are known as vector operations.

(Note: Prior to NEON, there was a technology known as VFP (Vector Floating Point) that also had vector operations, but were not executed in parallel. According to this URL it appears that we can identify NEON instructions as beginning with 'v' and VFP instructions begin with 'f'.)

NEON takes SIMD beyond 32-bit and amongst other features, offers 128-bit registers, which can handle (as an example) eight items of data, each 16-bits wide (16x8=128) simultaneously.


Example code

To make NEON useful, a system is needed to take data from memory and populate it in these 128-bit wide (known as Quad-word, or Q registers)   mentioned above. Some special instructions exist which do exactly that; you set a conventional (32-bit) register to point to the memory location where the data stream begins, and then execute the NEON instruction to load from that address upward, into a Q register. Here is an example that I tried as an experiment, based on a screenshot in the ARM documentation:


The diagram was translated into this code, and  Q1 and Q2 were populated with some data from RAM:


vld1.16     {q1}, [r0:128]

vld1.16     {q2}, [r1:128]

vadd.i16    q0, q1, q2

vst1.32     {q0}, [r2:128]


Load instructions (such as vld) take their operands in this direction:

vld dest<--src

Store instructions (such as vst) work in the opposite direction:

vst src-->dst


Using this information, it can be seen that the first instruction takes 128 bits of data starting at the address in register r0, and dumps it into the NEON Q-register called q1.

The next line takes 128 bits of data starting at the address in r1, and stores in q2.

The NEON instruction vadd is responsible for performing a parallel addition, storing the result in q0.

Finally, the vst operation places the contents of q0 (128 bits as mentioned) into the address space beginning at the address stored in register r2.


Traditionally performing the task above would have taken a loop to perform the action.

The detail above is in assembler, but the code can be directly written in C using a set of special functions, if the compiler can understand them (they are intrinsic functions). Apparently gcc understands them (see here for the functions) but I didn't get a chance to try them.

To experiment with NEON instructions, you could also create a assembler listing file (.s) and then assemble it, or you could write inline assembler in (say) C code. The latter approach was pursued here.


Generally, the slight complication with mixing C and assembler is that a little knowledge of the stack and C calling convention is required. With no arm experience, it was a little challenging, but a couple of evenings experimenting helped. A quick way to see what is going on is to force the compiler to generate an intermediate assembler listing. So, you could create a c file (called say neon.c) and write a simple main() function that calls a function called (say) neontest() and pass some parameters to it. The neontest function is the one in which you plan to insert some assembler, but for now you can just keep the function empty, and compile the code. Here is the entire neon.c file:

__attribute__((aligned (16)))

unsigned short int data1[8];

unsigned short int data2[8];

unsigned short int out[8];


neontest(unsigned short int *a, unsigned short int *b,

                unsigned short int* q)






  neontest(data1, data2, out);




It is compiled using:

gcc -S neon.c



The -S tells the compiler to create a .s file, and it can be observed to learn what the compiler is doing. Since we're interested in parallel operations on multiple data, it makes sense to pass at least three parameters to neontest; all of them pointers. Two will be for input data, and the third will be for output data, if we wish to test out the parallel addition described earlier.


The explanation of the resulting assembler code is described futher below (in the section 'deciphering the calling convention'), since it is a bit of a digression from NEON (but is necessary knowledge, otherwise it's hard to know how to get information to/from the NEON assembler code).


Anyway, the NEON instructions shown earlier were integrated into the neontest function, and the final file is attached, and the assembler listing.


When inserting inline assembler with gcc, this is the syntax:



"   assembler instruction\n\t"

"   another assembler instruction\n\t"



It was compiled using this command line (found on the web; the -Wl portion was to generate a map file, which contains addresses that are useful during debugging):

gcc -march=armv7-a -mtune=cortex-a8 -mfpu=neon -ftree-vectorize -ffast-math -mfloat-abi=softfp -Wl,  neon.c



Note: The attached file (in actually uses a trick I found on a website, which is to save off all other registers into RAM, so that I can freely use those registers without worrying about them; although it was not necessary for this code, the attached file contains it.


The c code (in the main function) just sets up data1 to be

[0 10 20 30 40 50 60 70]


and data2 to be

[5  5  5  5  5  5  5  5]


and then it calls the neontest function. This is the output:

output is: 5, 15, 25, 35, 45, 55, 65, 75



So, to summarize, it is relatively straightforward to set up a skeleton C file, inspect it to see the calling convention, and then begin experimenting with NEON instructions.


More useful example - Image scaling

Now that it feels a little bit comfortable mixing NEON assembler in C code, it was worth trying a more useful scenario. I was interested in bilinear interpolation, which can be used for scaling images; this is clearly useful for traditional 2D games like Super Mario (this is in fact the example that was on the Internet which I used, see here) however it would also be useful for scaling other kinds of images too, such as data to be representing on an LCD for example (which was my end aim, but I ran out of time).


It took quite a bit of playing (and lots of segmentation violations) before the code ran - mainly because I don't really understand ARM or NEON instructions in any useful detail currently), so it took a while to understand how to 'glue' the assembler code into the C calling convention to insert the image and extract the result. After it compiled, I tried it on Tintin. The code actually gets called twice, it takes the two iterations to complete. However, it is apparently around 10 times faster to execute than without NEON. (The author has another algorithm which is apparently better still, but that was even harder to follow).


The code is attached ( file). It expects the input image to be a C array. I used Paint Shop Pro to save in RAW format, and then wrote a quick program to convert to the C array (three bytes, i.e. R,G,B, per pixel). The source image of Tintin was 100 by 140 pixels, so that was 100*140*3 bytes in the array. This is the original image:


I wanted to resize it by a factor of 2.4 (by the way, the algorithm needs the input and output dimensions to be divisible by 4 with no remainder), to a size of 240x336. The first pass through the code results in this image:


Finally, the second pass results in this:


So, success, although it took a long time getting there.

Note: if you wish to try the code out, it will generate a binary file which is unformatted (raw), which Paint Shop Pro (and presumably other software) can import in. There is no header, just the sequence R,G,B,R,G,B... etc, so that each pixel is 24-bit in other words.


Other interesting things to do with NEON could be the FIR filter example in the PDF document referenced below (in the 'where to find more information' section.


To summarize, NEON could be extremely useful to attain good speed (apparently maybe 5-10 times acceleration) for data handled in parallel.

The references below have information that was found to be extremely valuable to understand NEON and how to use it. The registers and stack format are described below for those who need this information.


Deciphering the calling convention

This was mainly done by looking at the assembler listing, but also by stepping through the code with gdb:

The main function (listed earlier), just before it calls neontest(..) resulted in this snippet of assembler:


movw    r0, #:lower16:data1

movt    r0, #:upper16:data1

movw    r1, #:lower16:data2

movt    r1, #:upper16:data2

movw    r2, #:lower16:out

movt    r2, #:upper16:out

bl    neontest


(The mov instructions operate in the direction mentioned earlier, i,e. mov dst <-- src).

It can be seen that the addresses (pointers) to data1, data2 and out are placed in r0, r1 and r2 respectively.

'bl' is a branch link' instruction and it basically stores the address of the next instruction into the 'link register' aka lr which is actually r14. This is so that when the neontest function exits, the processor can load the program counter with r14 to continue running main() where it left off.


The assembler code at entry to neontest looked like this:



@ args = 0, pretend = 0, frame = 16

@ frame_needed = 1, uses_anonymous_args = 0

@ link register save eliminated.

str    fp, [sp, #-4]!

add    fp, sp, #0

sub    sp, sp, #20

str    r0, [fp, #-8]

str    r1, [fp, #-12]

str    r2, [fp, #-16]


The stack grows downward in address space. At this point it's easier to look at a diagram. The diagram is inverted, so that a rising (i.e. growing) stack is actually decreasing in address. Each block in the diagram represents 32 bits, i.e. the stack address reduces in steps of 4 bytes as the stack grows.


Before neontest was called, the stack pointer was at a certain location shown on the first diagram (marked as SP0).

Upon entry to the neontest function, the assembler code shown above was executed. The first line stores the frame pointer (fp) register (it is r11) at the address of SP-4, i.e. in the block above. It is shown in black in the middle diagram.That same line also then decreases the value of SP by 4 (the exclamation mark on that line causes that). So, SP now points to the location marked as SP1 in that middle diagram.

The second assembler line sets fp to equal sp. It is marked as FP1.

The third assembler line decreases SP by 20 bytes, i.e. 20/4 or 5 blocks in the diagram. So, SP is now pointing to the location marked SP2 in the middle diagram.

The fourth assembler line stores r0 (which contained the address of data1) in location fp-8, or in other words two blocks (8 divided by 4) up from the address in fp (which happens to be the location FP1).


The remainder two lines do a similar type of thing to r1 and r2.

The skipped 4 bytes (the white space between FP0 and r0 in the middle diagram) are used to store the link register if required, according to this diagram (which is upside down compared to the diagram above).


After the neontest function is close to completion, the following assembler code is executed before control returns back to the main() function:


add    sp, fp, #0

ldmfd    sp!, {fp}

bx    lr


The first line sets SP to equal FP+0, i.e. it quickly moves the stack pointer to the location FP1. So, on the last diagram, it can be seen that the stack pointer is now at the location indicated as SP3. That location happens to contain the earlier contents of fp.


So, we now need to populate fp with that older value, set the SP back to the old SP0 value (shown on the first diagram) and set the program counter to the link register value.

The second assembler line above is responsible for popping off the value FP0 and auto-updating the SP value, so that it moves from the location SP3 back down to the old SP0 value.

Finally, the last assembler line is responsible for jumping to the address in the link register (lr).


Where to find more information

Useful white paper - contains a FIR example.


NEON instruction reference


GCC NEON function intrinsics (if you wish to code in C)


PDF copy of a presentation on NEON


Good explanation of the calling convention


GDB tips (likely you'll need it if you are seeing crashes)


NEON blog (this is part 1, there are other parts too)

BoothStache is a version of BeagleStache by Jason Kridner optimized for an expo hall booth at a conference (like DESIGN West).  Instead of a LCD cape, BoothStache uses the BeagleBone Black'sBeagleBone Black's HDMI port to display the webcam feed on an HDTV.   An added twist is a big red USB button that the user presses to send a tweet of their stache photo (thanks to Python helper that bonnie555 wrote called BeagleButan).  Here's the setup:


And here's a video:


The twitter account it's using:


I used the following equipment:


BeagleBone BlackBeagleBone Black

USB Camera (Amazon)

USB "Stress" Button (eBay)

USB 2.0 Powered Hub

microHDMI cable

32" HDTV


To install the BoothStache version of BeagleStache, use this github repo:


For example, to download and install, run these commands:

cd $HOME

git clone git://

cd stache

make && make install

NOTE: You will need add the specific key values for your Twitter account in the config file (


To get the USB button working, use this github repo:


These are the install instructions from the readme file:

cd $HOME

opkg update

opkg install python-setuptools

okpg install python-ctypes

git clone

cd pyusb

python install

git clone

You should then be able to run beaglebutan to test pressing the button:

root@beaglebone:~/beaglebutan# lsusb

Bus 001 Device 002: ID 1a40:0201 Terminus Technology Inc. FE 2.1 7-port Hub

Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

Bus 001 Device 006: ID 04f3:04a0 Elan Microelectronics Corp.

Bus 001 Device 005: ID 046d:082d Logitech, Inc.


root@beaglebone:~/beaglebutan# python


tempfile doesn't exist


tempfile doesn't exist


tempfile doesn't exist


tempfile doesn't exist


tempfile doesn't exist

You should then start up the beaglestache program by running:

root@beaglebone:~/stache# node ./tweetstache.js

Refer to this gist for an example of the the output on the terminal:


Please leave a comment if you have any questions.