Skip navigation

FPGA Group

6 Posts authored by: jpiat

This blog is part 3 of a 4 part series of implementing a gradient filter on an FPGA.  If you have not already read the earlier parts see the link below to get up to speed before reading this blog.  Additionally the user can catch some of our previous blog posts, linked below.


Part 1 and 2 of this blog series


Other FPGA blogs by ValentF(x)


In the previous two parts, we designed modules to interface a camera and then created a gradient filter on the FPGA. One key aspect of using an FPGA is that the designs needs to be valid by construction. When writing software it's fairly easy to write a buggy first version of an application and then debug using step-by-step debugger, or IO (prints on serial, or LEDs) to get working software. On hardware/FPGA you can easily write a hardware description that compiles/synthesizes well but does not work. When this happens you are left with two options:

  • Use a logic analyzer, either physical (a costly piece of equipment) or soft (a logic analyzer you add to your design in the FPGA) and debug your design outputs.
  • Re-write everything hoping for the best


The best approach when writing HDL is to design a test for every component you create (if your component is a structure of tested component, you should still write a test for it). This test is implemented as a test-bench. A test-bench is a specific HDL component that cannot be synthesized but that can be executed in a simulated environment. This test-bench generates inputs signals (test vectors) for the device to be tested (Unit Under Test, UUT) and gathers the outputs.


Fig 1 : test-bench used to consist in the device physically connected to test equipment. In HDL all this is simulated on the designer’s computer.

The test-bench can be instrumented to automatically validate the device under test by comparing the outputs for a given set of inputs to a reference (Unit Testing). Test-benches can also be used to test the device during it’s lifetime to make sure it still complies with its initial specification when the designer makes changes to it or one of its sub-components (Regression Testing). Because it is impossible to generate all combinations of test inputs, it is very important to make sure that the chosen set will cover most of the cases (test-coverage).


Fig 2 : Minimal HDL development flow

A test-bench is an independent design and writing a test can sometimes take more time than writing the component itself. A well-design test will save you a lot of time when it comes to loading your design to the device and will help you better understand your component behavior.

In the test-bench the input signal can be generated using the usual VHDL syntax plus an extra set of non-synthesizable functions, mainly for handling timing aspects and IOs. The TextIO package provides an interesting set of functions for handling file inputs/outputs to allow reading/writing values from/to files.

The test-bench can then be executed by a simulator (ModelSim, Isim - xilinx’s free version, GHDL, etc). This simulator interprets your VHDL and simulates the behavior of the FPGA. This simulation can either be functional, or timed. A timed simulation will care about the propagation time in the logic while a functional simulation won't. Because the simulator has to emulate the logic you've written, the simulation can take very long. For example the in the next blog post, we will write a test-bench for the gradient filter that processes a QVGA image (320x240 pixels), this simulation takes ~30min to complete. On bigger systems, the simulation time can be well into the range of hours (for regression testing and unit testing, you'd better run these at night). The simulation process is part of what makes HDL development time very long compared to software. For example, when you have an error in your design, it usually takes a minute to fix in the HDL but minutes/hours to validate the fix. If you compare with the usual software development techniques you'll understand why it is so important to think your design through before implementing it.

In the following we will design a test for the gradient filter component we designed in Part 2 of this blog series. This test-bench will be implemented in VHDL and simulated using ISE’s integrated simulator, ISim (comes for free with the web edition).

Basic testing : Testing the arithmetic part of the Sobel filter for X gradient values


In this first part of the testing we will consider the arithmetic part of the Sobel filter that does the pixel window convolution with the Sobel filter convolution (generic convolution before optimization using DSP blocks). At the heart of the convolution is a Multiply And Accumulate operation that does the multiplication of two 16-bit inputs and adds them with the previous output to generate a 32-bit result. In the following we will test this simple component.The created test will simply stimulate the design with static values to observe for potential bugs in the calculated values.

Generating the test-bench skeleton for the unit under test

ISE comes with a nice feature to auto-generate a template of test-bench for a specific component. This allow to free the designer from the hassle of writing the signal instantiation and component instantiation and concentrate on the test behavior. To do so, in the file navigator right click and select “New Source”. In the wizard, select “VHDL Test Bench” and fill-in the filename and location then click “Next”. In the next window select the component to test (the component must be part of your project) and click finish. Beware that if your component has syntax errors, the generated file won’t be valid. To check syntax, select your component file in the project navigator and click on “Check Syntax” in the process panel.

Once generated the test-bench is composed of three parts :

  1. Signals, constants and component declarations.
  2. Components instantiations and wiring
  3. Clocks generations
  4. Stimuli generation


Parts 1, 2, 3 are auto-generated. ISE auto-detects the system clocks (based on the signal names) and by default generates each clock in a separate process. The clock frequency can be tweaked by setting the constant <clock name>_period.  The process looks like this :

clk_process :process   begin       clk <= '0';       wait for clk_period/2;       clk <= '1';       wait for clk_period/2;   end process;

This process runs endlessly and does the following :

  • Sets the clock signal to low
  • Waits for half the clock period. Note that this wait statement is the kind of non synthesizable statement of VHDL
  • Sets the clock signal to high
  • Waits for half the clock period

This process generates a square wave of the configured frequency on the clock signal.


Part 4 is partially generated with comments to help you understand where to write your test code.


stim_proc: process


      -- hold reset state for 100 ns.

      wait for 100 ns;   

      wait for clk_period*10;

      -- insert stimulus here


   end process;


The first part deals with the system reset. You have the reset signal of your UUT active to force the system into reset and then set the reset inactive just after the “wait for 100 ns ;”. Then there is a 10  clock cycles where the test does nothing and then the fun part starts with  “-- insert stimulus here”.


Your stimulus is the sequence of inputs that test the unit. The inputs are generated using traditional assignment operators in HDL and sequencing the inputs is performed by using the wait statement. The wait statement can either be used with time expressed in units picoseconds, nanoseconds, or with a boolean condition using the until statement :


wait for 10 ns ;

wait until clk = ‘1’ ;


Testing MAC16


We have generated the test-bench template for MAC16, now let’s write the test process. We will first write a simple test that will stimulate the MAC16 with two simple values.


stim_proc: process


      -- hold reset state for 100 ns.

       reset <= '1';

      wait for 100 ns;   

       reset <= '0';

      wait for clk_period*10;

      -- insert stimulus here

       A <= to_signed(224,16);

       B <= to_signed(3967,16);

       add_subb <= '1' ;


   end process;


After writing the test process, click the “Simulation” check-box in the project navigator window, then select the test bench file and click “Simulate Behavioral” in the process window.




If your test-bench contains no errors, this will launch the ISim tool. After a bit of time you should end-up with the following window.




Use the zoom-out button and the horizontal scroll-bar to get to the beginning of the simulation with an appropriate scale (you should see the clock edges).




To set the signals display format, right click on the “a[15:0]” signal, select “Radix” and “Signed decimal”.  Do the same for “b[15:0]” and “res[31:0]”. You should now have the following trace.




If you zoom on the resolution signal between 200ns and 250ns you get the following sequence of results.


888608, 1777216, 2665824, 3554432


As we know the expected behavior of the MAC we can check the result validity :


224*3967 = 888608 -> 888608 + (224*367) = 1777216 -> 1777216+ (224*367) = 2665824 …


At this point if something fails in your design, you can go back to ISE, edit your file and then in ISim press the relaunch button to restart the simulation as in the following image.






Reporting errors

Now that we know that the design works, we can improve the test to automatically report errors. The “assert” statement allows us to report warnings/errors/failures to the designer from the simulation. This report will then help the designer to spot exactly where the problem occurs. In our case we will report a failure if the result of the first MAC cycle differs from what is expected.


stim_proc: process


      -- hold reset state for 100 ns.

       reset <= '1';

      wait for 100 ns;   

       reset <= '0';

      wait for clk_period*10;

      -- insert stimulus here

       A <= to_signed(224,16);

       B <= to_signed(3967,16);

       add_subb <= '1' ;

       wait for clk_period ;

       ASSERT res = (224*3967) REPORT "Result does not match what is expected" SEVERITY FAILURE;


   end process;


In this process if the result is different from the expected result, the simulation will stop. A less critical report would be ERROR or WARNING (won’t stop the simulation) and NOTE would just inform the user. The report message will be printed in the simulator console window.






So far in our test we have only tested the behavior of the MAC16 component for a single value and we validated by hand the sequence of value. To create a better test that covers more cases, we need to create an input test vector, that is a sequence of inputs, to apply to the module and an output test vector that is the expected results for the aforementioned input sequence. These vectors can either be created as a file to be read by the simulation using the TextIO package or directly coded in the test-bench. For the purposes of this blog post we will implement the second method (the first method is better for large tests).


First we need to declare the array vector types for out inputs and outputs:


type input_vector_operand_type is array(natural range <>)  of signed(15 downto 0);

type output_vector_res_type is array(natural range <>)  of integer;


Then we need to create the input vectors and expected outputs as follows:


-- test vectors

    constant a_vector : input_vector_operand_type(0 to 5) := (

    to_signed(0, 16),

    to_signed(256, 16),

    to_signed(-64, 16),

    to_signed(16, 16),

    to_signed(0, 16),

    to_signed(0, 16)



    constant b_vector : input_vector_operand_type(0 to 5) := (

    to_signed(1034, 16),

    to_signed(-1, 16),

    to_signed(-89, 16),

    to_signed(32000, 16),

    to_signed(0, 16),

    to_signed(0, 16)



    constant res_vector : output_vector_res_type(0 to 5) := (









For the results, the two initial 0 values are to take into account the pipeline of the MAC16 component. This component has a latency of two clock cycles before a change on the inputs impacts the output.


Then we have to write the process that scans those vectors, and report the errors/failures using assert.


stim_proc: process


      -- hold reset state for 100 ns.

       reset <= '1';

      wait for 100 ns;   

       reset <= '0';

      wait for clk_period*10;

      -- insert stimulus here

       for i in 0 to 5 loop

               A <= a_vector(i);--a_vector(i);

               B <= b_vector(i);--b_vector(i);

               add_subb <= '1' ;

               ASSERT res = res_vector(i) REPORT "Result does not match what is expected "&integer'IMAGE(res_vector(i))&" != "&integer'IMAGE(to_integer(res)) SEVERITY FAILURE;

               wait until falling_edge(clk) ;

       end loop ;


   end process;


The for loop iterates over the range of the test vectors and for each set of inputs, the result of the MAC16 is tested. If the result does not match the assert condition, the simulation will fail and indicate what went wrong.


Now that the base module of our convolution filter has been proven to work, the other components of the sobel filter must be tested. Once  the MAC16 is tested we can plan to test the full gradient filter. Testing the filter using hand-designed test vectors can be very painful considering the amount of information needed to be generated in order to test a whole image. In this case debugging at higher level is a better solution and allows us to evaluate the quality of the filter.


Testing the sobel filter using images will be the topic of the next blog post.



Creative Commons License
This work is licensed to ValentF(x) under a Creative Commons Attribution 4.0 International License.

This blog is part 2 of a 3 part series of implementing a gradient filter on an FPGA.  If you have not already read part2 see the link below to get up to speed before reading this blog.  Additionally the user can catch some of our previous blog posts, linked below.

Part1 of this blog series

Other FPGA blogs by ValentF(x)


Gradient filters for image processing are kernel based operations that process an array of pixels from an input image to generate a single pixel in an output image. Most gradient filter algorithms use a 3x3 pixel window but some uses a 2x2 pixel window. In this posts we will work on a Sobel based operator to implement the gradient filter. The Sobel kernel is designed to extract the gradient in an image, either over the U axis (along columns) or V axis (along horizontal lines) and the two U/V oriented filters can be combined to extract the gradient direction or the U/V gradient intensity.




Figure 1 : logo and its gradient intensity, extracted using U/V Sobel filters


Figure 2 : Computing gradient intensity for a small image

Gradient filter is a step for many vision processing tasks like  : edge enhancement, corner extraction, edge detection ...

Problem analysis

In this part of the article we will analyze the Sobel filter for the V direction (rows) and then generalize to the U direction (columns).


The Sobel filter is a 2D spatial high-pass FIR filter (Sobel operator - Wikipedia, the free encyclopedia).  FIR Filter refers to Finite Impulse Response.  This class of filter computes the filter output based-on the input history, as opposed to an Infinite Impulse Response Filter (IIR) that computes the output based on input history and output history.

The filter kernel is composed of  the following:



This means that to generate a single pixel in the output image, we need to access 9 pixels in the input image, multiply them by 9 values and add the partial results. If Some kernel components are 0, only 6 multiplication and 5 addition operations are needed to be performed. This operation of multiplying the kernel values with values from the input to generate a single output, called a convolution and is the foundation of a wide range of kernel based operations.


Figure 2 : Convolution operation for a 3x3 kernel


To sum-up, the convolution operations requires :

  • To be able to generate a 3x3 pixel window from the input image for every U,V positions in this image
  • To perform 9 multiplications and 8 additions for every pixel (can be optimized depending on the kernel)


In a typical software implementation, the program has direct access to the complete image, which makes accessing the data a memory addressing problem to generate the 3x3 window. Based on our previous post, Gradient Filter implementation on an FPGA - Part 1 Interfacing an FPGA with a camera we only have access to a single pixel of the image at a given time so the window problem will require to manage a memory that stores a number of pixels of the pixel stream.


Hardware implementation


To develop our hardware Sobel filter, we will divide the architecture into two main modules, a 3x3 block module and the arithmetic module.

3x3 Block

As seen in the previous post, the camera module does not output a 2D stream of pixels, but rather a 1D stream of pixels with an additional signal indicating when to increment the second dimension (hsync). To design this component we first need to understand what is the smallest amount of data we need to store in order to do the filtering. Designing in an FPGA is, most of the time, a matter of designing the optimal architecture so every memory bit counts. A quick study of the problem shows that the minimum amount of data to store is two lines of the image plus 3 pixels to have a sliding 3x3 window in the image.



Figure 3 : Position of the sliding windows based on the hsync and pixel_clock signals.

This memory management architecture can simply be performed in hardware. The 3x3 Block works on a 2D register of 3x3 size in each position containing a pixel (8-bit or 9-bit for signed pixels).  The steps for this procedure as are follows.

  • Store pixel at index 2, 3 and index 3, 3 respectively in memory 1 and memory 2
  • Shift columns of the block register
  • Take the incoming pixel, store it in at index 3, 3
  • Grab one pixel in memory 2  and store it a index 2, 3
  • Grab one pixel in memory 1 and store it a index 1, 3


The address management of memory location 1 and 2 is performed based on the pixel count and line count extracted from the synchronization signals pixel_clock and hsync. The sequence of operations described previously can all be executed in parallel (screams to use an FPGA).  This first version of the 3x3 block requires very little work, and works fine at the pixel_clock frequency, but consumes too many memory units. Most of the time, block RAM (BRAM) in an FPGA are atomic so the synthesizer cannot map multiple memories described in HDL to a single block RAM in the FPGA (let say you describe a 256 Byte memory and a 64 Byte memory, they cannot be mapped to a single 2kByte memory).

Info:  In the Spartan3/6, block RAMs are 18Kbits with a configurable bus-width, these block RAMs can be split into two 9Kbits on the Spartan-6.

In our case, memory1 and memory2 cannot be mapped to the same block RAM and will use two block RAMs with lot of their spaces being unused (one line in VGA is 640 pixels and a block RAM can contain 1K pixels or 2K pixels). Where the addressing of the two memories being very similar (exactly similar in fact), one trick is to use a 16-bit wide memory and store pixels of memory1 in the MSBs of the data and pixels of memory2 in the LSBs of the data.

For the actual code, have at look at

Sobel arithmetic


Once we have the block of pixels for every input pixel, the pixel window needs to be convoluted with the Sobel kernel. Most elements of the kernel being zeros, this multiplication requires 6 multiplications. Small FPGAs have very few multipliers and all arithmetic optimization will greatly help. In the case of the Sobel filter, the optimization comes from the fact that all multiplications are ones and twos, so only shifting bits are required!  For a general implementation of the convolution, the convolution would require at least one multiplier (DSP block in the FPGA) and potentially up to 9 multipliers.   Considering that a Spartan-6 LX9 FPGA is composed of 16 DSP blocks, a non optimized implementation of a convolution filter can easily consume all DSP blocks of the FPGA.

Once we have all the multiplications processed, we still need to add all the products together which require 5 additions. These additions can be performed in a lengthy adder with 6 operands. This kind of addition will greatly limit the frequency of the system because propagating the signal through a 5 stage adder takes more time than a single adder (more than 5 times slower due to routing). This is where pipelining (see: Gradient Filter implementation on an FPGA - Part 1 Interfacing an FPGA with a camera for more information on using pipelines in an FPGA) comes to help. The goal of pipelining is to only pay the propagation cost of a single adder on each clock cycle, rather than 5. Such a structure is called an adder-tree. For every clock cycle we only add two operands of the 5 stage adder and store the output into a register. With such structure the longest propagation time is reduced to a single adder, but the latency of the addition is 4 clock cycles.


Figure 4 : Optimized Sobel convolution computation with pipelined adder-tree

This kind of implementation is very frequency efficient and allows to target very high clock frequencies in the system. The main drawback is that it uses more registers, hence more area of the FPGA.

Now that we have the architecture we also need to configure the width of the data-path. One possible way for doing this is to perform all operations on 32-bits or 16-bits to avoid overflow. The better solution is to compute the actual length of the data along it’s path and only use the useful bits and hence limit the resource use of the system.

To Sum it all up - Steps Required:

  • Input is a 8-bit pixels
  • Pixels are multiplied by positive and negative values so needs to be coded on 9-bits at the arithmetic module input (signed binary arithmetic)
  • Pixels are multiplied by 1, 2, 1, -1, -2, -1 so  are respectively shifted by 0, 1, 0, 0, 1, 0, which gives us for each partial product 9-bit, 10-bit, 9-bit, 9-bit, 10-bit, 9-bit
  • These values are added - each addition can add up to 1-bit to the wider word : 9-bit + 10-bit = 11-bit, 9-bit+9-bit=10-bit, 9-bit + 10-bit = 11-bit => 11-bit + 10-bit + 11-bit = 13-bit

In fact because each addition cannot be considered separately, we can consider that the final addition of 11-bit + 10-bit + 11-bit will not overflow from 11-bits. Lets do the worst case to check that it works.

Worst case for a Sobel filter along the image lines is the following 3x3 pixels window (pixel in this case are unsigned values) :
255, 255, 255xxx,  xxx,  xxx000, 000, 000
1x255+2x255+1x255-1x0-2x0-1x0 = 1020,  fits on 10-bits + a bit for the sign => 11bits.

So the output of one Sobel convolution for the Sobel filter fits on 11-bits. To get the gradient intensity the U and V gradient needs to be combined using the Euclidean distance.  The Euclidean distance requires to perform the square of each gradient and then perform a square-root of the addition.
Square roots are very expensive to perform on an FPGA, so we can choose to approximate the Euclidean distance by summing the absolute values.  This saves quite a lot of FPGA area and improves the performance of the system.

The full Sobel implementation can be downloaded at:

There are two implementations

  • First implementation uses a processing clock that is 4x the max pixel clock. In this implementation, the FPGA can use up to two clock cycles to process a single pixel. This is not particularly useful for Sobel, but some other convolutions may use a different kernel coefficient that requires multiplication instead of shifting. This implementation is fine for “small” resolutions like VGA at up to 60FPS (24MHz pixel-clock, 100MHz processing clock).
  • Second implementation is fully pipelined and uses the pixel clock as the processing clock. This implementation uses more resources but is very efficient frequency wise (pixel-clock can be higher than 130MHz) and process HD images.


The next step will be the debugging of our filter using a specific test-bench to stimulate the design with real images and generate output images.  Stay tuned for part 3 of this blog series for implementing a test-bench and simulating the design.

Creative Commons License

This work is licensed to ValentF(x) under a Creative Commons Attribution 4.0 International License.

previous post : Obstacle detection using Laser and image processing on LOGI-Bone

FPGA Camera Data Processing

This is part 1 of a 2 part article which details interfacing a camera to an FPGA, capturing the data and then processing the data using a pipelining technique.  One of the many strengths of using an FPGA is the speed and flexibility it gives to processing data in a real-time manner.  An interface to a camera is a good example of this case scenario where cameras output very high amounts of data very quickly and generally customized or dedicated hardware is required to process this data.


One specific attribute of an FPGA is that it can be used to implement a given processing task directly at the data-source, in this case: the camera. This means that with a good understanding of the signals generated by the camera we can adapt image filters to directly process the signals generated by the camera instead of processing an image stored in memory like a CPU would do, i.e. real-time processing.


A camera is a pixel streaming device.  It converts the photon into binary information for each pixel. Each pixel is a photon integrator that generates an analog signal followed by an analog to digital converter.  The camera then transmits on it’s databus the captured information, one pixel at a time, one row after the other. The pixel can be captured in two different ways that directly affect the kind of application the sensor can be used in, including rolling shutter and global shutter sensors.


Rolling Shutter Camera Sensors

Rolling-shutter sensors are widely adopted because they are cheap and can be built for high resolution images. These sensors do not acquire all the pixels at once, but one line after the other. Because all the pixels are not exposed at the same time, it generates  artifacts in the image.  For examples take a picture of a rotating fan and observe the shape of the fan blades (see image below for comparison). Another noticeable effect can be seen when taking a picture of scene with a halogen or fluorescent light. When using a halogen or fluorescent light all the pixel lines are not exposed with the same amount of light because light intensity varies at 50/60Hz, which is driven by the mains frequency.

Global Shutter Camera Sensors

Global shutter sensor are more expensive and are often used in machine vision. For these sensors all of the pixels are exposed at the same time with a snapshot.  The pixels informations is then streamed to the capturing device (FPGA in our case). These sensors are more expensive because they require more dedicated logic to record all the pixels at once (buffering). Moreover, the sensor die is larger (larger silicon surface), because the same surface contains the photon integrators and the buffering logic.



Once captured, the pixel data can be streamed over different interfaces to the host device (FPGA in our case).  Examples of typical camera data interfaces are parallel interfaces or CSI/LVDS serial interfaces. The parallel interface is composed of a set of electrical signals (one signal per data bit), and is limited in the distance the data can be transmitted (inches in scale). The serial interface sends the different pixel information one after another using the same data lines, positive and negative differential pair. LVDS (Low Voltage Differential Signaling) carries the serial data at high rates (up to 500Mbps for a camera) and allows transmission for longer distances (up to 3 feet on the LOGI SATA type connector).

The LOGI Cam

The LOGI Cam supports many of the Omnivision camera modules, but is shipped with the OV7670 which is a low cost rolling shutter sensor that exposes a parallel data bus with the following signals.


pclk: the synchronization clock to sample every other signal, this signal is active all the time

href: href indicates that a line is being transmitted

vsync: vsync indicate the start of a new image

pixel_data: the 8-bit data-bus that carry pixel information at each pclk pulse when href is active

sio_c/sio_d: an i2c like interface to configure the sensor



Fig 0: First diagram show how pixel are transmitted in a line. Second part is a zoom out of the transmission, and just show how line are transmitted in an image.




Pixel Data Coded Representations

The parallel data bus is common for low cost sensors and is well suited to stream pixel data. What one will notice is that the pixel data is only 8 bits wide, which leads to the question, how does the camera send a color data without more that 8 bits per pixel on this data bus? The answer is that each component of the pixel is sent one after another in sequence until the complete pixel data has been transmitted. This means that for a QVGA (240 lines of 320 pixels per line) color image, with 2 bytes per pixel, the camera sends 240 lines of 640 values (2 bytes per pixel).


RGB Color Space

One might wonder how the camera can compose each pixel’s color data with only 2 bytes (i.e. does it produce only 2^16 or 65536 different values)? There are two typical ways to represent the pixel colors, RGB (Red Green Blue) and YUV coding. RGB coding will split the 16bits (two bytes) into an RGB value, on the camera this is called RGB 565, which means that 16bits are split into 5 bits for red, 6 bits for green, 5 bits for blue. You will note that there is an extra bit for the green data.  This interesting point is guided by our animal nature which programs our eyes to be more sensitive to subtle changes in green, therefore to create the best range of for a color requires us to add an extra green data bit *. With RGB565 there is a total of 65536 colors based upon a total of 16 color bits available per pixel. 


YUV Color Space

The second way of coding pixel data is called YUV (or YCrCb), Y stands for luminance (the intensity of light for each pixel), U/Cr is the red component of the image and V/Cb is the blue component of the image. In YUV, instead of down-scaling the number of bits for each YUV component, the approach is to downscale the resolution for the U/V values. Our eyes are more sensitive to luminance than to color due to the fact that the eye has more rod cells responsible for sensing luminance than cone cells that can sense the colors*. There are a number of YUV formats including YUV 4:4:4, YUV 4:2:2, YUV 4:2:0.  Each format will produce a full resolution image for the Y component (each pixel has a Y value) and a downscaled resolution for U/V. In the camera the Y component resolution has at native resolution of 320x240 for QVGA and U/V resolution is down-scaled for each line (160x240 for QVGA), that is the YUV 4:2:2 format.  See Figure 1 for a depiction of how the image is broken into components of full resolution Y and downscaled resolution of U/V components.  Note that all of the bits are being used for each YUV component, but only every other U/V component is used to downscale the total image size.



* For more information on this topic see the links at the end of the page





Fig 1 : For two consecutive Y values (black dots) , there is only one set of color components Cr/Cb


The data transmission of the YUV data is realized by sending the U component for even pixels and V component for odd pixels. For a line the transmission looks like the following.




So, two consecutive Y pixels share the same U/V components (Y0 and Y1 share U0V0).

One advantage of such data transmission is that if your processing only needs the grayscale image, you can drop the U/V components to create a grayscale image instead of computing Y from the corresponding RGB value. In the following we will only base our computations on this YUV color space.

Interfacing With the Camera

Now that we understand the camera bus, we can now capture image information to make it available for processing on the FPGA. As you noticed, the camera pixel bus is synchronous (there is a clock signal) so we could just take the bus data as it is output by the camera and directly use the camera clock to synchronize our computation. This approach is often used when the pixel clock is at a high frequency (for HD images or high frame-rate cameras), but it requires that each operation on a pixel can only take one clock cycle. This means that if the operation takes more than one clock cycle you’ll have to build a processing pipeline the size of your computation.


Digression on Pipelining

Pipelining is used when you want to apply more than one operation to a given set of data and still be able to process that data set in one clock-cycle. This technique is often used at the instruction level in processors and GPUs to increase efficiency. Lets take a quick example that computes the following formula.


Y = A*X + B (with A and B being constant)


To compute the value of Y for a given value of X you just have to do one multiplication followed by one addition.



In a fully sequential way, the processing takes two steps. Each time you get a new X value you must apply the two operations to get Y result. This means that a new value of X data can enter the processing pipeline every two steps, otherwise the processing loses data.


If you want to apply the same processing but still be able to compute a new value of Y at each step, and thus process a new X incoming data at each step, you’ll need to apply pipelining, which means that you will process multiple values of X at the same time. A pipeline for this operation would be:




So after the first step there is no Y value computed, but on second step Y0 is ready, on the third step Y1 is ready, on the fourth step Y2 is ready and so on. This pipeline has a latency of two (it takes two cycles between data entering the pipeline and the corresponding result going out of the pipeline). Pipelining is very efficient for maximizing the total throughput or processing frequency of data.  Though, pipelining consumes more resources, as you need to have more than one operation being executed at a given time. For example if your computation takes say 9 operations, you’ll need to have a 9 stage pipeline (9 steps latency) and must have 9 computing resources working at the same time.


The decision for where to apply pipelining is based upon the maximum task processing frequency required for the hardware, resources available for the hardware and in some cases power consumption of the hardware, i.e. the higher the processing clock , the more power loss in current leakage.

Back to our application

Using the LOGI Boards, we consider that we have a rather small FPGA (9K Logic elements and few DSP blocks) with limited resources and that the frequency of performance is not an issue where the VGA image at 30FPS produces a data stream with ~12Mpixels per second.  So, we won’t use the pixel-clock as the clock source for our system, but rather use a 100Mhz system clock for processing and will consider that at most we have 4 clock cycles to process each pixel (max of ~24Mhz pixel clock => VGA@60Fps).


Here is the component view of the camera interface for the YUV pixel data bus:




The component generates a pixel bus with YUV and synchronization signals from the multiplexed bus of the camera. The new bus is synchronous to the system clock. This means that to grab pixels from the camera and be able to process them, we need to work with two different clock domains, the camera clock domain and the system clock domain.  The two clock domains are asynchronous to each other, i.e.there is no guaranteed phase relation between the two clocks. To make the two asynchronous domains work together, and to ensure that no metastable conditions occur (see link below for explanation and further information on this topic), we need to perform clock domain crossing to make sure that the data coming out of the camera can be processed with the system clock. In that case the simplest and cheapest way to perform clock domain crossing, is to use a multi-flop synchronizer circuit.

This synchronizer circuit is made of an input flip-flop synchronized in the input clock domain and a set of two flip-flop synchronized in the output clock domain.


What is a Flip-flop ?


A flip-flop is basically the component at the base of most digital circuit whose behavior evolves over time. A D flip-flop has an input named D, and output named Q and a time-base called the clock. In terms of time, the input at the flip-flop is the future and the output of the flip-flop is the present. Each time there is a clock tick (when a rising edge appears on the clock input) , the time evolves a single step and the present becomes the future (Q takes the value of D at the clock-tick).



If you think of a basic operation such as counting, it basically involves adding one to the present value to compute the future value (and so on). A counter circuit can be described as a D-latch (of N bits depending on the maximum count you want to support) whose input is the output value plus one. Additionally a flip-flop can have an enable input, that enable the copy of D on Q only when its asserted and a reset input, that set Q to an initial value.


If you want to know more about flip-flop  you can read :


Back to our synchronizer problem, the case of the camera and the FPGA having two different clocks and thus two different clock domains.  The problem is that the time evolution of two independents clock domains is not synched by the same time-base. For a D-flip-flop to work the future (D input) must be stable for a given amount of time before the clock-tick (setup time) and while the clock is high (hold time). But when the input of a flip-flop is not in the same clock domain, it’s not possible to guarantee theses timing conditions. The synchronizer circuit is required to minimize the risk of registering an unstable future input into the target clock-domain (more on that in



The camera stream


The data from the camera multiplexes the luminance (Y) and chroma (colors UV) pixel data.  Thus, we need to de-multiplex the Y and the UV components of data and generate a pixel bus where each rising-edge of the new pixel-clock sends the luminance and chroma associated to the pixel. This principle is displayed in following diagram.




This architecture is synchronized to the pixel_clock generated by the camera. This means that for each new clock cycle, data is latched the D flip-flops. The data signals that are latched are decided based upon which enable signals are activated. The enable signals are generated by the state-machine that evolves at each clock cycle. In practice this specific state machine is implemented as a counter, as there are no transition conditions (transition happen on each clock rising edge). 


Finite State Machine


A finite state machine (FSM) is a model for a sequential process. In this model, the behavior is captured by a set of states (the numbered circles in the previous figure) that are connected through transitions (the arrows between states). These transitions can be conditioned, meaning that the transition between two states can only occur if the associated condition holds true. Each state is associated to a set of actions that are maintained as long as the state is active. A state machine is built from three components : state memory, state evolution, and action. The state memory holds current state of the state machine, while the state evolution compute the future state based on the system inputs and present state. The actions are computed from current state (Moore state machine) and system inputs (Mealy state machine). If you want to know more on state-machine you can read :



Fig 3 : Sequence of the camera interface to understand how U/V data are stored to be used for two consecutive Y values




The outputs of this architecture are fed into a single flip-flop synchronizer (one DFF in each clock domain) and the pixel_out_hsync (inverted signal of href), pixel_out_vsync, pixel_out_clock are generated to be synchronous to the system clock.




The output of the camera interface can then be fed in the appropriate filter. In future development we will stick to this bus format nomenclature (pixel_<out/in>_hsync, pixel_<out/in>_vsync, pixel_<out/in>_clock, pixel_<out/in>_data ) so that all of the filters we design can be chained together easily.


Now that we have an interface to the camera, we can start designing the first image filter.  The design of a 2D convolution operator will be detailed in part 2 of this article.  But for now we have left you some useful links which can help better understand the design concepts that are being used in this project. 


Getting Deeper Into the Article Topics


How the eye extracts color information using cones and rods:


More on clock domain crossing and metastability:


More on YUV color space :


The OV7670 datasheet :


More on rolling-shutter vs global shutter:


Download the latest LOGI projects repository and start having a look at the project and associated HDL.


Vision-related components for FPGA (yuv_camera_camera_interface.vh for correponding code)

Creative Commons License

This work is licensed to ValentF(x) under a Creative Commons Attribution 4.0 International License.

The Problem


Typical obstacle detection on low cost mobile indoor robots are usually performed using a variety of sensors, namely sonar and infrared sensors. These sensors provide poor information that is only able to detect the presence of a reflective surface in the proximity of the sensor and the distance from the surface. While in most cases it’s enough to navigate a robot on a crowded floor, it does not help the robot for other tasks and adds more sensors to the robot. This does not allow to deviate from the long used paradigm one task = one sensor.


A camera provides rich information that can be processed to extract a large collection of heterogeneous information for the robot to make decisions. A single image allows, for example, to detect colored items, obstacles, people, etc.


One problem that remains with using a camera is that it can be tricked by specific patterns (optical illusions, or homogeneous scene), or changes in the environment (lighting change for example).


Active vision adds a light (visible or infrared) projector to the system  that adds information to the scene, which helps the image processing algorithm. An Example of this is Microsoft's first version of the Kinect which used an infrared projector to allow 3D reconstruction from any scene. Recovering depth information (3D or pseudo 3D) in vision can be performed through three distinct methods:

  • Stereo-vision: Two cameras can be used to recover depth information from a scene (like our eyes and brain do)   
  • Active vision: Projecting known information onto a scene allows to extract depth (just like the kinect or most 3D scanners)   
  • Structure From Motion: SFM works in mono or a multi-vision. The 3D information is recovered by capturing images from different points of view with respect to time (Simultaneous Localization And Mapping SLAM does that). Our brain also uses SFM. For example, close an eye and you are still able to construct 3D information by moving your head/body or by subtle movements of the eye.


With 3D information about a given scene, it’s fairly easy to detect obstacles, assuming the definition that an obstacle is an object that sticks out of the ground plane (simple definition).

All these techniques are quite hard to implement in software and harder to implement in hardware (FPGA) and require a lot of computing power to be performed in real-time.

A simpler method to detect an obstacle is to reconstruct 1D information (distance to an object) from the camera using a 1D projector, namely a dot projector or a line projector, such as a laser line seen in figure 1..  It’s even easier to simply raise an alarm about the presence of an obstacle for a given orientation or defined threshold in the robot frame (radar style information). 2D information (depth and X position of object) can be extracted by making multiple 1D measurements.

The Method


The following example pictures the basic principle of a 2D method of object detection using a 1D line..

grabbed_frame0000_duo.jpgFig 1 :This picture shows the camera view of a laser line projected on the ground.

The image of the laser line appears shifted when hitting an obstacle.


Fig 2: In the normal case the image of the laser on the ground appear at a given position in the image.



Fig 3: When the laser hit an obstacle its image appear shifted compared to the case without obstacle

The 2D object detection method involves

  • Projecting information onto the scene
  • Detecting the projected information in the scene image


Using a laser line, each column in the captured camera image frame can be used to make an independent depth measurement (1D measurement in each column). This allows to achieve a 2D measurement by getting an obstacle detection for each column of the image. 

Detecting The Line


The laser line in the image has two distinguishable properties:

  • It’s red
  • It’s straight


A naive approach to detecting the laser line would be to detect red in the image and try to identify segments of the line based on this information. The main problem with the red laser in that case is that because of the sensitivity of the camera, highly saturated red can appear white in the image. Another problem is that because of optical distortion of the camera lens, a line will transform into a curve in the image (film a straight line with a wide angle lens like on the GoPro and you clearly see the effect).


One interesting property of the red laser line, is that because of the intensity, it will generate a high gradient (change of light intensity) along the image column.



Fig 4: Grayscale view of the laser line



Fig 5 : Image of gradient in the vertical direction


This means that one way to discriminate the laser in the image is to compute the gradient along the image column, detect the gradient maximum along the column and assume it’s the line. This gives the position of the laser in each column of the image.


On a fully calibrated system (camera intrinsics , extrinsics, distortion parameters, stereo-calibration of laser/camera,  etc) the metric distance to the object could be extracted. In our case we assume that the robot navigates on flat ground and that as a consequence, the laser image should always appear at the same position for each column. If the position moves slightly, it means that there is an object protruding from the ground plane. This allows the algorithm to determine that there is an obstacle on the robot’s path.


The Build


The robot base is purchased from dfrobot ( and the motors are driven by an Arduino motor shield (L298D based). The FPGA is in charge of generating the PWM signals for each motor (also in charge of PID control in a future evolution), interface of the camera, gradient computation, column max computation. The position of the max for each column is made available to the BeagleBone Black that reads it using the logi python library and computes the motor PWM duty cycle to be written in the PWM controller register.


A 7.2v NiCd battery powers the motors and a 5V DC/DC regulator powers the BeagleBone Black.

The LOGI-Cam is equipped with a laser line using the high-current output available on the LOGI-Cam board. A 3D printed mount allows to set the orientation of the laser line, and a wide angle lens was mounted on the camera to allow detection of object at a closer range.


Fig 6: The camera fitted with the laser mount and a wide angle lense


Fig 7: Laser connected to high current output of logi-cam.
Note the bead of hot glue to avoid putting stress on the solder joints.

Also note the wire that shorts the optional resistor slot to get max current (the laser module already has a built-in resistor)


Fig 8: The assembled bot with a angled cardboard support to point assembly toward ground



The design for the FPGA is available on our github account . This design is composed of a pixel processing pipeline that grabs a YUV frame from the camera, extract the Y component, applies a gaussian filter (blurring to limit image noise effect on gradient), applies a sobel filter, computes maximum value of vertical gradient for each column and stores the maximum position in memory. The memory can be accessed from the wishbone bus. The wishbone bus also connects an i2c_master to configure the camera, a PWM controller to generate PWM for the motors, a GPIO block to control the direction of the motor and the laser state (on/off) and a FIFO to grab images from the sensor for debugging purposes. The behavior of the image processing modules (Sobel, Gaussian) will be detailed in a next blog post.



The system works fairly well but is very sensitive to lighting conditions and acceleration of the robot. One side effect of the chosen method is that the sensor also works as cliff detection ! When there is a cliff, the laser disappears from the camera field of view and a random gradient is detected as max. This gradient has little chance to be where the laser is detected and as a consequence an obstacle is reported making the robot stop. The resulting robot system is also pretty heavy for the motor size and inertia causes the robot to stop with a bit of delay. The video was shot with the algorithm running at 15Fps (now runs fine at 30Fps) and with debugging through a terminal window running  over wifi, which causes the control loop to not run as fast as possible.


Future Improvements

The current method is quite sensitive to the lighting of the scene, the reflectivity of the scene, the color of the scene (detecting the red laser line on a red floor won't work well). To improve the reliability we can work in the spectral domain with less interference with the projector. Using an infrared laser line and using a bandpass filter on the camera we can isolate the system from natural perturbations. One problem with this method is that the images from the camera cannot be used for other tasks.

Another problem can arise with neon lighting that creates variation in lighting (subtle for the eye, not for a camera). More over, the camera being a rolling shutter (all image lines are not captured at the same time but in sequence) the change in lighting creates change in luminosity along the image line, which in turn creates a perfectly Y-axis oriented gradient that interferes with the laser created gradient. The camera has 50Hz rejection but it’s not working as expected.

Another improvement could be to extend to the 3D detection scenario using a 2D projector (like on the Kinect). This would require to detect dots (using the Harris or Fast detector algorithms) and a 3D point cloud could be computed by the processor.

For those who don’t own a Logi Board

The technique described in this article is generic and can also be performed in software using OpenCV for the vision algorithms. In comparison the power of using an FPGA is that it allows to perform the vision operations at a much faster pace and lower latency (time difference between an event occur in the scene and the detection) than what a CPU can perform. The FPGA can also simultaneously generate the real-time signals for the motors (PWM). With a Raspberry Pi + Pi-camera or Beaglebone-Black + USB camera you can expect to reach ~10fps in QVGA and an unpredictable latency.


Getting Deeper Into the Theory


Pinhole camera model :

Understand the camera model used in most algorithm.

Multiple View Geometry Richard Hartley and Andrew Zisserman,Cambridge University Press, March 2004

Know everything about camera calibration, and geometry involved in image formation

Hardware description languages (HDLs) are a category of programming languages that target digital hardware design. These languages provides special features to design sequential logic( the system evolve over time represented by a clock) or combinational logic (the system output is a direct function of its input). While these language have proved to be efficient to design hardware, they often lack the tool support (editors are far behind what you can get to edit C/java/etc) and the syntax can be hard to master. More-over, these language can generate sub-optimal, faulty hardware which can be very difficult to debug.


Over the past-year some alternative languages have arisen to address the main issues of the more popular HDLs (VHDL/Verilog). These new languages can be classified into two categories as follows.


Categories of Hardware description Languages (HDLs)


  1. HLS (High Level Synthesis) : HLS tools, try to take an existing programming language as an input and generate the corresponding HDL (hardware description language).. Some of these tools are quite popular in the EDA industry such as CatapultC from Mentor Graphics, Matlab HDL coder, but are very expensive. Xilinx recently integrated the support of SystemC and C in their Vivado toolchain but it only supports high-end FPGA.
  2. Alternative syntax : Some tools propose an alternative syntax to VHDL or Verilog. The alternative syntax approach keeps the control of the generated hardware, but gives the advantage of an easier to master syntax and sometimes of ease of debugging.


While the HLS seems attractive. there is a good chance it will generate sub-optimal hardware if the designer does not write the “software” with hardware in mind. The approach is a bit magical as you can take existing C/Matlab software and generate hardware in a few clicks.

HLS is very practical to reduce time to first prototype (especially with Matlab) and for people with little (or no) HDL knowledge to produce a functional hardware design.  However,  HLS tools are not good for the users who want to learn digital design and the good HLS tools are usually very expensive (a CatapultC license can cost more than 100k$ [link], and Matlab HDL coder starts at 10k$ [link]).

Over the past year some open-source, free to use alternatives to HDL have emerged. These tools do not pretend to create hardware from behavioral description, but propose to smoothen the learning curve for digital logic design by relying on easier to master syntax and feature rich tools.

In the following we will review two of these alternatives languages (myHDL, PSHDL). To test the languages, we will use them to design and debug a simple PWM module. We chose these two languages based on their distinct properties and community adoption but other tools such as MiGen (python based syntax) which will not be covered here, but use the same kind of design flow.

Category 1 - myHDL []


myHDL uses python to design and test hardware components. A hardware component is designed as a python function whose arguments are the inputs and outputs of the component. The component can then describe sub-functions and designate them as combinational or sequential logic using a decorator (some text prefixed by the @ symbol that defines properties for a function/method). Once the design is complete, it can be exported to HDL (VHDL or Verilog) using a small python snippet. The design can also be tested/simulated in the python environment and generate waveform traces.


Installation of myHDL

Installing myHDL is straightforward and can be done with a single line on a Linux system(not tested with windows).

sudo pip install myhdl

Design of the PWM module in myHDL


The pwm component is pretty straightforward. It has two inputs period and t_on that designate respectively the period of the pwm signal and the number of cycles that the pwm edges are triggered on. The module has two outputs: pwm_out that is the pwm signal and period_end that is asserted at the end of a period and de-asserted otherwise. Here is the corresponding myHDL code.

from myhdl import *


def pwm_module(period, t_on, pwm, period_end, clk):


    count = Signal(intbv(0)[16:])



    def logic():

    if count == period: = 0 = 1

    else: = count + 1 = 0


    if count > t_on: = 0

    else: = 1

    return logic


The module is evaluated/simulated using the following test-bench.


def TestBench():

    clk = Signal(bool(0))

    period = Signal(intbv(200)[16:])

    t_on = Signal(intbv(100)[16:])

    pwm_out = Signal(bool(0))

    period_end = Signal(bool(0))


    pwm_inst = pwm_module(period, t_on, pwm_out, period_end, clk)     


    def tb_clkgen():

   = not clk



    def tb_stim():

    period = 200

    t_on = 100      

    yield delay(2)

        for ii in xrange(400):

            yield clk.negedge

            print("%3d  %s" % (now(), bin(pwm_out, 1)))


        raise StopSimulation

    return tb_clkgen, tb_stim, pwm_inst


if __name__ == '__main__':



The Corresponding HDL code is generated by changing this line of code:

pwm_inst = pwm_module(period, t_on, pwm_out, period_end, clk)

into this line of code:

pwm_inst = toVHDL(pwm_module, period, t_on, pwm_out, period_end, clk)

and here is the resulting VHDL:




library IEEE;

use IEEE.std_logic_1164.all;

use IEEE.numeric_std.all;

use std.textio.all;


use work.pck_myhdl_081.all;


entity pwm_module is

    port (

        period: in unsigned(15 downto 0);

        t_on: in unsigned(15 downto 0);

        pwm: out std_logic;

        period_end: out std_logic;

        clk: in std_logic


end entity pwm_module;


architecture MyHDL of pwm_module is


signal count: unsigned(15 downto 0);



PWM_MODULE_LOGIC: process (clk) is


    if rising_edge(clk) then

        if (count = period) then

                count <= to_unsigned(0, 16);

                period_end <= '1';


                count <= (count + 1);

                period_end <= '0';

        end if;

        if (count > t_on) then

                pwm <= '0';


                pwm <= '1';

        end if;

    end if;

end process PWM_MODULE_LOGIC;


end architecture MyHDL;


myHDL lines to VHDL lines : 20 -> 42 = 0.47

Pro and Cons of using myHDL or similar languages



  • Python syntax is clean and forces the user to structure the code appropriately
  • The syntax elements introduced for hardware design are relevant and does not add much syntax (the next attribute is a great idea and reflects the hardware behavior)
  • The quality of the generated code is great !
  • The simulation has a great potential, you can even generate a waveform
  • One can take advantage of the extensive collection of Python packages to create powerful simulations
  • Small designs can fit in one file.


  • The use of decorators is not great for readability (and i’am not a decorator fan …)
  • One really needs to understand digital logic and hardware design before starting a myHDL module
  • One really needs to master the basics of Python before getting started
  • Python will raise errors for syntax errors but simulation does raise warning or errors if there is a design error (incomplete if/case clause or other things that a synthesizer would detect)
  • There is no (that i know of) myHDL specific editor or environment that would ease the beginner experience.

Category 2 - The Custom Syntax Approach - PSDHL []


PSHDL (plain and simple hardware description language) is an HDL language with custom syntax that takes elements inherited from C/SystemC and adds a custom set of keywords and coding elements to represent a hardware module. A module has no arguments but declares some internal variables as “in” or “out”. The nice and clever thing about PSHDL is that only one keyword is used to represent a sequential part of the module. Instead of declaring a computational unit of sequential or combinational logic, a keyword “register” is used to identify the variables/signals that are to be updated in a synchronous process. This is particularly relevant because every HDL design will be translated into  LUTs, MUXs, D-latches or registers.

The PSHDL syntax is fairly easy to understand and there is not much syntactical noise. The best thing about PSHDL is that it runs in the web browser! Just like the mBED initiative (for ARM micro-controllers), the website, proposes to create a workspace in which you can edit, compile (to VHDL) and debug your design all on-line. This means, no tools to install (still need to install the ISE/quartus tools to synthesize), OS independent (no problem with running it under Linux/Windows.  The community is rather small up until now, but the tool deserves a try!

Creating the PWM module using PSHDL


Below is the PSHDL code for the pwm module:


module de.tuhh.ict.pwmModule {

  @clock in bit clk;

  in uint<16> period, t_on;

  out bit pwm_out, period_end;

  register(clock=clk) uint<16> counter;

  if(counter > period){

    counter = 0;

    period_end = 1 ;



    counter = counter + 1;

    period_end = 0;


  if(counter > t_on){

   pwm_out = 0;



    pwm_out = 1;




Below is the VHDL code generated by PSHDL


library ieee;

use ieee.std_logic_1164.all;

use ieee.numeric_std.all;

use work.Casts.ALL;

use work.ShiftOps.ALL;

use work.Types.ALL;

entity de_tuhh_ict_pwmModule is

   port (

       rst : in std_logic;

       clk : in std_logic;

       period : in unsigned(15 downto 0);

       t_on : in unsigned(15 downto 0);

       pwm_out : out std_logic;

       period_end : out std_logic




architecture pshdlGenerated of de_tuhh_ict_pwmModule is

   signal counter : unsigned(15 downto 0);


   process(counter, period, t_on)


       pwm_out <= '0';

       period_end <= '0';

       if (counter > period) then

           period_end <= '1';


           period_end <= '0';

       end if;

       if (counter > t_on) then

           pwm_out <= '0';


           pwm_out <= '1';

       end if;

   end process;



       if RISING_EDGE(clk) then

           if rst = '1' then

               counter <= (others => '0');


               if (counter > period) then

                   counter <= (others => '0');


                   counter <= (counter + TO_UNSIGNED(1, 16));

               end if;

           end if;

       end if;

   end process;


PSHDL lines to VHDL lines : 21 -> 51 = 0.41

Pro and cons

Pros :

  • Online tool ! No installation and is OS agnostic
  • Easy to use syntax for people who know C/C++
  • Clever use of the “register” keyword to denote sequential assignments
  • Outputs nicely formatted VHDL
  • Generated VHDL interfaces easily with existing VHDL code


  • Online tool - some people may complain that its not great for privacy and intellectual property
  • Some syntax elements like the package prefix for a module create lengthy text
  • A “register” can be associated to a clock and reset by passing arguments to it. This is misleading for C/C++ programmers as it creates a type with an argument (and not a template like the other types) which is not valid C/C++ code.
  • Simulation does not seem to be fully functional in the online tool
  • The community is small (but can grow if you give it a try)



These two alternative HDL languages/tools do a great job to ease the process of  writing and debugging HDL. They both rely on different principles.  myHDL defines combinational/sequential functions while PSHDL defines sequential signals for the sequential behavior.  This allows you to pick what works best for you! The main drawback with both of these (tools and other alternatives languages) is that they are not directly supported in the vendor tools (Quartus/ISE) and that they are not recognized as standard hardware design languages in the professional world. This means that you will still have to learn VHDL/Verilog at some point if this is part of your career plan.


There is no indication that FPGA vendors are willing to open the access to their synthesis tools for third parties, so for any VHDL/Verilog alternatives you will still have to install and use their tools to synthesize and create the binary files to configure the FPGA.


One other language that tends to emerge as a standard for hardware (and system) design is SystemC (at least with Xilinx). While myHDL does not rely on any of the SystemC concepts, PSHDL has the advantage of being (to some extend) C/C++ based.


To get more people to use FPGAs there is a need to propose a large diversity of languages/tools. See the diversity of languages available to program microcontrollers. Some years ago you had to use C or assembler to design embedded software, but now you can use C/C++, Arduino (C++ based), javascript, Python and more.  We need the same kind of languages competition for HDL as each new language may attract more users and create new uses for FPGAs.


What features do you foresee being needed for a killer hardware description language? 

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.


In a previous blog post ValentF(x) gave an explanation of what FPGAs (field programmable gate arrays) are and how they are a very valuable resource when designing electronics systems.  The article went on to describe the major differences in the way FPGAs operate from CPU/MCU technology.  Finally, it was highlighted that FPGAs, especially when used in conjunction with CPU technology, are a powerful tool with both having their own respective strong points in how they process data.  


This blog article focuses on how a user should begin to look at using an FPGA in conjunction with a CPU using a co-processing system.  The user will better understand how the system is  designed to handle processing multiple tasks, scheduling, mapping of the processing tasks and the intercommunication between the LOGI FPGA boards and the host CPU or MCU.


This article will use examples from the LOGI Face project, which is an open source animatronics robot project as the basis for discussing the co-processing methodologies.  We will be using real examples from the LOGI projects.  Additionally we will refer the user to the LOGI Face Lite project which is a more basic version of LOGI Face that the user can fully replicated with 3D printable parts and off-the-shelf components.  The LOGI Face Lite wiki page contains instructions to build and run the algorithms in the project. 

What is Hardware/Software Co-design ?

Co-design consists of designing an electronics system as a mixture of software and hardware components. Software components usually run on processors such as a CPU, DSP or GPU, where hardware components run on an FPGA or a dedicated ASIC (application specific integrated circuit). This kind of design method is used to take advantage of the inherent parallelism between the tasks of the application and ease of re-use between applications.


Steps for Designing a Hardware/Software Co-processing System


  • Partition the application into the hardware and software components
  • Map the software components to the CPU resources
  • Map the needed custom hardware components to the FPGA resources
  • Schedule the software components
  • Manage the communications between the software and hardware components


These steps can either be performed by co-design tools, or by hand, based on the knowledge of the designer. In the following we will describe how the LOGI Pi can be used to perform such a co-design in run real-time control oriented applications or high performance applications with the Raspberry Pi.

Communication between the LOGI Pi and Raspberry Pi

A critical requirement of a co-design system is the method of communication between the FPGA and CPU processing units of the platform.  The processing units in this case are the LOGI FPGA and the Raspberry Pi.  The LOGI projects use the wishbone bus to ensure fast, reliable and expandable communication between the hardware and software components. The Raspberry Pi does not provide a wishbone bus on its expansion, so it was required to take advantage of the SPI port and to design a hardware “wrapper” component  in the FPGA that transforms the SPI serial bus into a 16 bit wishbone master component. The use of this bus allows users to take advantage of the extensive repository of open-source HDL components hosted on and other shared HDL locations.

To handle the communication on the SPI bus each transaction is composed of the following information.

1) set slave select low 2) send a 16 bit command word with bits 15 to 2 being the address of the access, bit 1 to indicate burst mode (1) , and bit 0 to indicate read (1) or write (0) (see figure).3) send/receive 16 bit words to/from the address set in the first transaction. If burst mode is set, the address will be increased on each subsequent access until the chip select line is set to high (end of transaction). 4) set slave select high

Such transactions allow users to take advantage of the integrated SPI controller of the Raspberry Pi which uses a 4096 byte fifo. This access format permits the following useful bandwidth to be reached (the 2 bit of synchro is transfer overhead on the SPI bus):

  • For a single 16 bit access :(16 bit data)/(2 bit synchro + 16 bit command + 16 bit data) => 47% of the theoretical bandwidth.
  • For a 4094 byte access : ( 2047 * (16 bit data))/(2 bit synchro + 16 bit command + 2047 * (16 bit data) 99.7% of the theoretical bandwidth.


This means that for most control based applications (writing/reading registers), we get half of the theoretical bandwidth, but for data based applications, such as writing and reading to buffers or memory, the performance is 99% of the theoretical bandwidth. It could be argued that getting rid of the wishbone interface and replacing it with an application specific data communication protocol (formatted packet of data) on the SPI bus could give the maximum bandwidth, but this would break the generic approach that is proposed here. The communication is abstracted using a dedicated C API that provides memory read and write functions.  ValentF(x) also provides a library of hardware components (VHDL) that the user can integrate into designs (servo controller, pwm controller, fifo controller, pid controller …).

Communication between the LOGI Bone and BeagleBone

The BeagleBone exposes an external memory bus (called GPMC General on its P8 and P9 expansion connectors. This memory bus, provides 16-bit multiplexed address/data, 3 chip select, read, write, clock, address latch, high-byte/low-byte signals.The bus behavior is configured through the device-tree on the linux system as a synchronous bus with 50Mhz clock. This bus is theoretically capable of achieving 80MB/s but current settings limit the bus speed to a maximum of 20MB/s read, 27MB/s write. Higher speeds (50MB/s) can be achieved by enabling burst access (requires to re-compile the kernel) but this breaks the support for some of the IPs (mainly wishbone_fifo). Even higher speeds were measured by switching the bus to asynchronous mode and disabling DMA, but the data transfers would then increase the CPU load quite a lot.

On the FPGA side, ValentF(x) provides a wishbone wrapper that transforms this bus protocol into a wishbone master compatible with the LOGI drivers. On the Linux system side a kernel module is loaded and is in charge of triggering DMA transfers for each user request. The driver exposes a LOGI Bone_mem char device in the “/dev” directory that can be accessed through open/read/write/close functions in C or directly using dd from the command line.

This communication is also abstracted using a dedicated C API that provides memory read/write functions. This C API standardizes function accesses for the LOGI Bone and LOGI Pi thus enabling code for the LOGI Bone to be ported to the LOGI Pi with no modification.

Abstracting the communication layer using Python

Because the Raspberry Pi platform is targeted toward education, it was decided to offer the option to abstract the communication over the SPI bus using a Python library that provides easy access function calls to the LOGI Pi and LOGI Bone platforms. The communication package also comes with a Hardware Abstraction Library (HAL) that provides Python support for most of the hardware modules of the LOGI hardware library.  LOGI HAL, which is part of the LOGI Stack, gives easy access to the wishbone hardware modules by providing direct read and write access commands to the modules.  The HAL drivers will be extended as the module base grows.

A Basic Example of Hardware/Software Co-design with LOGI Face

LOGI Face is a demonstration based on the LOGI Pi platform that acts as a telepresence animatronic device. The LOGI Face demo includes software and hardware functionality using the Raspberry Pi and the LOGI Pi FPGA in a co-design architecture. 

LOGI Face Software

The software consists of a VOIP (voice over internet protocol) client, text to voice synthesizer library and LOGI Tools which consist of C SPI drivers and Python language wrappers that give easy and direct communication to the wishbone devices on the FPGA. Using a VOIP client allows communication to and from LOGI Face from any internet connected VOIP clients, giving access to anyone on the internet access to sending commands and data to LOGI face which are communicated to the FPGA to control the hardware components.  The software parses the commands and data and tagged subsets of data are then synthesized to speech using the espeak text to voice library.   Users can also use the linphone VOIP client to bi-directionally communicate with voice through LOGI Face.  The remote voice is broadcasted and heard on installed speaker in LOGI Face and the local user can then speak back to the remote user using the installed microphone in LOGI Face. 

LOGI Face Hardware

The FPGA hardware side implementation consists of a SPI to wishbone wrapper, wishbone hardware modules including servos(mouth and eyebrows), RGB LEDs (hair color), 8x8 LED matrix (eyes) and SPI ADC drivers.  The wishbone wrapper acts as glue logic that converts the SPI data to the wishbone protocol.  The servos are used to emulate emotion by controlling the mouth which smiles or frowns and the eyebrows are likewise used to show emotions.  A diagram depicting the tasks for the application can be seen in the following diagram.


LOGI Face Tasks

The LOGI Face applications tasks are partitioned on the LOGI Pi platform with software components running on the Raspberry Pi and hardware components on the LOGI Pi. The choice of software components was made to take advantage of existing software libraries including the espeak text to speech engine and linphone SIP client. The hardware components are time critical tasks including the wishbone wrapper, servo drivers, led matrix controller, SPI ADC controller and PWM controller.

Further work on this co-processing system could include optimizing CPU performance by moving parts of the espeak TTS (text to speech) engine and other software tasks to hardware in the FPGA. Moving software to the FPGA is a good example that showcases the flexibility and power of using an FPGA with an CPU.

A diagram with the final co-processing tasks of the LOGI Face application can be see in the following diagram.


LOGI Face Lite


LOGI Face Lite is a simplified version of the above mentioned LOGI Face project.   The LOGI Face Lite project was created to allow direct access to the main hardware components on LOGI Face.  LOGI Face Lite is intended to allow users to quickly build and replicate the basic software and hardware co-processing functions including servo, SPI ADC, PWM RGB LED and 8x8 Matrix LEDs.  Each component has an HDL hardware implementation on the FPGA and function API call access from the Raspberry Pi.  We hope that this give users a feel for how they might go about designing a project using the Raspberry Pi or BeagleBone and the LOGI FPGA boards. 

Note that the lite version has removed the VOIP Lin client and text to speech functionality to give users a more direct interface to the hardware components using a simple python script.  We hope this that will make it easier to understand and begin working with the components and that when the user is ready will move to the full LOGI Face project with all of the features.

Diagram of wiring and functions

3D model of frame and components

Assembled LOGI Face Lite

FPGA Control

  • 2x Servos to control the eyebrows - mad , happy, angry, surprised, etc.
  • 2x Servos to control mouth - smile, frown, etc
  • 1x RGB LEDs which control the hair color, correspond to mood, sounds, etc
  • 2 x 8x8 LED matrices which display animated eyes - blink, look up/down or side to side, etc
  • SPI microphone connected ADC to collect ambient sounds which are used to dynamically add responses to LOGI Face


Software ControlEach of the FPGA controllers is accessible to send and receive data from on the Raspberry Pi.  A basic example Python program is supplied which shows how to directly access the FPGA hardware components from the Raspberry Pi.

Build the HDL using the Skeleton EditorAs an exercise the users can use LOGI Skeleton Editor to generate the LOGI Face Lite hardware project, which can then be synthesized and loaded into the FPGA.  A Json file can be downloaded from the wiki can can then be imported into the Skeleton Editor, which will then configure the HDL project the user.  The user can then  use the generated HDL from Skeleton Editor to synthesize and generate a bitstream from Xilinx ISE. Alternatively we supply a pre-built bitsream to configure the FPGA.

Re-create the ProjectWe encourage users to go to  the LOGI Face Lite ValentF(x) wiki page for a walk through on how to build the mechanical assembly with a 3D printable frame, parts list for required parts and instructions to configure the Skeleton project, build the hardware and finally run the software.

You can also jump to any of these resources the project


The LOGI Pi and LOGI Bone were designed to develop co-designed applications in a low cost tightly coupled FPGA/processor package. On the Raspberry Pi or BeagleBone the ARM processor has plenty of computing power while the Spartan 6 LX9 FPGA provides enough logic for many types of applications that would not otherwise be available with a standalone processor.


A FPGA and processor platform allows users to progressively migrate pieces of an application to hardware to gain performance while an FPGA only platform can require a lot of work to get simple tasks that processors are very good at. Using languages such as Python on the Raspberry Pi or BeagleBone enables users to quickly connect to a variety of library or web services and the FPGA can act as a proxy to sensors and actuators, taking care of all low-level real-time signal generation and processing. The LOGI Face project can easily be augmented with functions such as broadcasting the weather or  reading  tweets, emails or text messages by using the many available libraries of Python.  The hardware architecture can be extended to provide more actuators or perform more advanced processing on the sound input such as FFT, knock detection and other interesting applications.


We hope to hear from you about what kind of projects you would like to see and or how we might improve our current projects. 



Creative Commons License

This work is licensed to ValentF(x) under a Creative Commons Attribution 4.0 International License.