Skip navigation



I am Anton.About 3 years ago I discovered the world of embedded programming.

First it was simple Arduino, but  after few  month I realized, that Arduino is too weak for my ambitions:D

I grabed  DE1 FPGA from  Terasic board, and was very impressed of what can you do with an FPGA:)


Time goes by, and now I want to share my experience and help other  beginners to join the FPGA world.

From time to time(depending on my success) I am producing interesting and detailed FPGA tutorials(not as full time job, but more like hobby).

Here are some of my videos (more on my youtube channel).

Please take a look and  leave a feedback:)



English it not my native language, hope my pronunciation is not very terrible:)


This blog is part 2 of a 3 part series of implementing a gradient filter on an FPGA.  If you have not already read part2 see the link below to get up to speed before reading this blog.  Additionally the user can catch some of our previous blog posts, linked below.

Part1 of this blog series

Other FPGA blogs by ValentF(x)


Gradient filters for image processing are kernel based operations that process an array of pixels from an input image to generate a single pixel in an output image. Most gradient filter algorithms use a 3x3 pixel window but some uses a 2x2 pixel window. In this posts we will work on a Sobel based operator to implement the gradient filter. The Sobel kernel is designed to extract the gradient in an image, either over the U axis (along columns) or V axis (along horizontal lines) and the two U/V oriented filters can be combined to extract the gradient direction or the U/V gradient intensity.




Figure 1 : logo and its gradient intensity, extracted using U/V Sobel filters


Figure 2 : Computing gradient intensity for a small image

Gradient filter is a step for many vision processing tasks like  : edge enhancement, corner extraction, edge detection ...

Problem analysis

In this part of the article we will analyze the Sobel filter for the V direction (rows) and then generalize to the U direction (columns).


The Sobel filter is a 2D spatial high-pass FIR filter (Sobel operator - Wikipedia, the free encyclopedia).  FIR Filter refers to Finite Impulse Response.  This class of filter computes the filter output based-on the input history, as opposed to an Infinite Impulse Response Filter (IIR) that computes the output based on input history and output history.

The filter kernel is composed of  the following:



This means that to generate a single pixel in the output image, we need to access 9 pixels in the input image, multiply them by 9 values and add the partial results. If Some kernel components are 0, only 6 multiplication and 5 addition operations are needed to be performed. This operation of multiplying the kernel values with values from the input to generate a single output, called a convolution and is the foundation of a wide range of kernel based operations.


Figure 2 : Convolution operation for a 3x3 kernel


To sum-up, the convolution operations requires :

  • To be able to generate a 3x3 pixel window from the input image for every U,V positions in this image
  • To perform 9 multiplications and 8 additions for every pixel (can be optimized depending on the kernel)


In a typical software implementation, the program has direct access to the complete image, which makes accessing the data a memory addressing problem to generate the 3x3 window. Based on our previous post, Gradient Filter implementation on an FPGA - Part 1 Interfacing an FPGA with a camera we only have access to a single pixel of the image at a given time so the window problem will require to manage a memory that stores a number of pixels of the pixel stream.


Hardware implementation


To develop our hardware Sobel filter, we will divide the architecture into two main modules, a 3x3 block module and the arithmetic module.

3x3 Block

As seen in the previous post, the camera module does not output a 2D stream of pixels, but rather a 1D stream of pixels with an additional signal indicating when to increment the second dimension (hsync). To design this component we first need to understand what is the smallest amount of data we need to store in order to do the filtering. Designing in an FPGA is, most of the time, a matter of designing the optimal architecture so every memory bit counts. A quick study of the problem shows that the minimum amount of data to store is two lines of the image plus 3 pixels to have a sliding 3x3 window in the image.



Figure 3 : Position of the sliding windows based on the hsync and pixel_clock signals.

This memory management architecture can simply be performed in hardware. The 3x3 Block works on a 2D register of 3x3 size in each position containing a pixel (8-bit or 9-bit for signed pixels).  The steps for this procedure as are follows.

  • Store pixel at index 2, 3 and index 3, 3 respectively in memory 1 and memory 2
  • Shift columns of the block register
  • Take the incoming pixel, store it in at index 3, 3
  • Grab one pixel in memory 2  and store it a index 2, 3
  • Grab one pixel in memory 1 and store it a index 1, 3


The address management of memory location 1 and 2 is performed based on the pixel count and line count extracted from the synchronization signals pixel_clock and hsync. The sequence of operations described previously can all be executed in parallel (screams to use an FPGA).  This first version of the 3x3 block requires very little work, and works fine at the pixel_clock frequency, but consumes too many memory units. Most of the time, block RAM (BRAM) in an FPGA are atomic so the synthesizer cannot map multiple memories described in HDL to a single block RAM in the FPGA (let say you describe a 256 Byte memory and a 64 Byte memory, they cannot be mapped to a single 2kByte memory).

Info:  In the Spartan3/6, block RAMs are 18Kbits with a configurable bus-width, these block RAMs can be split into two 9Kbits on the Spartan-6.

In our case, memory1 and memory2 cannot be mapped to the same block RAM and will use two block RAMs with lot of their spaces being unused (one line in VGA is 640 pixels and a block RAM can contain 1K pixels or 2K pixels). Where the addressing of the two memories being very similar (exactly similar in fact), one trick is to use a 16-bit wide memory and store pixels of memory1 in the MSBs of the data and pixels of memory2 in the LSBs of the data.

For the actual code, have at look at

Sobel arithmetic


Once we have the block of pixels for every input pixel, the pixel window needs to be convoluted with the Sobel kernel. Most elements of the kernel being zeros, this multiplication requires 6 multiplications. Small FPGAs have very few multipliers and all arithmetic optimization will greatly help. In the case of the Sobel filter, the optimization comes from the fact that all multiplications are ones and twos, so only shifting bits are required!  For a general implementation of the convolution, the convolution would require at least one multiplier (DSP block in the FPGA) and potentially up to 9 multipliers.   Considering that a Spartan-6 LX9 FPGA is composed of 16 DSP blocks, a non optimized implementation of a convolution filter can easily consume all DSP blocks of the FPGA.

Once we have all the multiplications processed, we still need to add all the products together which require 5 additions. These additions can be performed in a lengthy adder with 6 operands. This kind of addition will greatly limit the frequency of the system because propagating the signal through a 5 stage adder takes more time than a single adder (more than 5 times slower due to routing). This is where pipelining (see: Gradient Filter implementation on an FPGA - Part 1 Interfacing an FPGA with a camera for more information on using pipelines in an FPGA) comes to help. The goal of pipelining is to only pay the propagation cost of a single adder on each clock cycle, rather than 5. Such a structure is called an adder-tree. For every clock cycle we only add two operands of the 5 stage adder and store the output into a register. With such structure the longest propagation time is reduced to a single adder, but the latency of the addition is 4 clock cycles.


Figure 4 : Optimized Sobel convolution computation with pipelined adder-tree

This kind of implementation is very frequency efficient and allows to target very high clock frequencies in the system. The main drawback is that it uses more registers, hence more area of the FPGA.

Now that we have the architecture we also need to configure the width of the data-path. One possible way for doing this is to perform all operations on 32-bits or 16-bits to avoid overflow. The better solution is to compute the actual length of the data along it’s path and only use the useful bits and hence limit the resource use of the system.

To Sum it all up - Steps Required:

  • Input is a 8-bit pixels
  • Pixels are multiplied by positive and negative values so needs to be coded on 9-bits at the arithmetic module input (signed binary arithmetic)
  • Pixels are multiplied by 1, 2, 1, -1, -2, -1 so  are respectively shifted by 0, 1, 0, 0, 1, 0, which gives us for each partial product 9-bit, 10-bit, 9-bit, 9-bit, 10-bit, 9-bit
  • These values are added - each addition can add up to 1-bit to the wider word : 9-bit + 10-bit = 11-bit, 9-bit+9-bit=10-bit, 9-bit + 10-bit = 11-bit => 11-bit + 10-bit + 11-bit = 13-bit

In fact because each addition cannot be considered separately, we can consider that the final addition of 11-bit + 10-bit + 11-bit will not overflow from 11-bits. Lets do the worst case to check that it works.

Worst case for a Sobel filter along the image lines is the following 3x3 pixels window (pixel in this case are unsigned values) :
255, 255, 255xxx,  xxx,  xxx000, 000, 000
1x255+2x255+1x255-1x0-2x0-1x0 = 1020,  fits on 10-bits + a bit for the sign => 11bits.

So the output of one Sobel convolution for the Sobel filter fits on 11-bits. To get the gradient intensity the U and V gradient needs to be combined using the Euclidean distance.  The Euclidean distance requires to perform the square of each gradient and then perform a square-root of the addition.
Square roots are very expensive to perform on an FPGA, so we can choose to approximate the Euclidean distance by summing the absolute values.  This saves quite a lot of FPGA area and improves the performance of the system.

The full Sobel implementation can be downloaded at:

There are two implementations

  • First implementation uses a processing clock that is 4x the max pixel clock. In this implementation, the FPGA can use up to two clock cycles to process a single pixel. This is not particularly useful for Sobel, but some other convolutions may use a different kernel coefficient that requires multiplication instead of shifting. This implementation is fine for “small” resolutions like VGA at up to 60FPS (24MHz pixel-clock, 100MHz processing clock).
  • Second implementation is fully pipelined and uses the pixel clock as the processing clock. This implementation uses more resources but is very efficient frequency wise (pixel-clock can be higher than 130MHz) and process HD images.


The next step will be the debugging of our filter using a specific test-bench to stimulate the design with real images and generate output images.  Stay tuned for part 3 of this blog series for implementing a test-bench and simulating the design.

Creative Commons License

This work is licensed to ValentF(x) under a Creative Commons Attribution 4.0 International License.

Update 28 June 2015: XXICC has been updated to XXICC (21st Century Co-design) release 0.0q

Here is the release 0.0p of XXICC.  There is no rev 0.0o since the letter O looks too much like the digit 0.  0.0p adds a Flavia implementation for the Gadget Factory Papilio DUO board, which has a Xilinx Spartan-6 LX9 FPGA and an Arduino-compatible Atmel ATmega32U4.  Flavia 0.0p also adds pinout tables to map top-level ports and signal names to FPGA pins.  You can also specify pull-up, pull-down, and keeper circuits for FPGA I/Os.


0.0p is a “partial” release specifically intended to support the author’s element14 ’blog Raspberry Pi 2 meets Papilio DUO.  To get this ’blog published in a timely fashion, we chose to defer some features to next release: see the release notes.


XXICC (21st Century Co-design) is a not-for-profit research project which attempts to bring digital hardware/software co-design into the 21st Century using an improved programming language and a Reduced Software Complexity philosophy.  Its goal is to make it easier and more enjoyable to write and maintain digital hardware and software. XXICC is pronounced "Chicken Coop", so-called because it has so many layers.


For an overview of XXICC, see the home page and wiki.  For details on the GalaxC programming language, XXICC Object Editor, and GalaxC extensions for Hardware Design (GCHD), here are the latest documents and source code.


Release notes for XXICC rev 0.0p

Programming in the GalaxC Language rev 0.0j: reference and user guide for the GalaxC programming language, unchanged for 0.0p.

The XXICC Anthology rev 0.0n: collection of miscellaneous XXICC topics, including user guides for the XXICC Object Editor, GCHD and Flavia.  No changes for 0.0p: see Raspberry Pi 2 meets Papilio DUO for Papilio DUO and synthesis pinout table.

Data files for FlaviaPD59 release 0.0p: Data files for the FlaviaPD59 implementation of the Free Logic Array for the Papilio DUO.

Data files for FlaviaLP56 release 0.0p: Data files for the FlaviaLP56 implementation for the ValentF(x) LOGI-Pi.

Data files for FlaviaLB56 release 0.0p:  Data files for the FlaviaLB56 implementation for the ValentF(x) LOGI-Bone.

Taming the Wild Bitstream (unchanged for 0.0p): Supplement to Flavia: the Free Logic Array.

XXICC code release 0.0p: all source code for XXICC.

XXICC source code listing rev 0.0p: source code listing as PDF.

Release 0.0p does not including XXICC executable binaries: you must build the executable from source code.

GalaxC sample/demo programs rev 0.0k: sample GalaxC programs and GCHD logic libraries, unchanged for 0.0p.

GalaxC sample/demo program listings rev 0.0k: PDF listing of the sample GalaxC programs and GCHD examples, unchanged for 0.0p.

Installing and Running XXICC rev 0.0p: Document describing how to install and run XXICC.

Compiling and Running GalaxC Programs rev 0.0k: Document describing how to compile and run your own GalaxC programs, unchanged for 0.0p.


I've tested XXICC 0.0p on GNU/Linux (Ubuntu on x86 PC, Raspberry Pi Debian "Wheezy", and ODROID-C1 Ubuntu).  My main machine is Ubuntu, so the others are more likely to have anomalies.  Constructive comments and suggestions are most welcome.  I'd especially like to find out how to reproduce some of the bugs that have been eluding me.


XXICC is a FLOSS (Free as in Liberty Open Source Software) project.  Software is licensed under GPLv3 and other content is licensed under Creative Commons CC-BY-SA 3.0.


Cheap Entry into FPGA

Posted by michaelkellett Top Member May 3, 2015

There's been quite  a lot of talk in the FPGA group about cheap ways to get started so a couple of weeks ago I decided to forsake the professional setup I use for real work and try one of the cheapest routes I could find. The idea was to try something as general purpose as possible and to keep clear of as much proprietary hardware and software as possible. That basically means using a simple board with the FPGA and very little else on it and using the FPGA manufacturer's free software.


All the FPGA manufacturers offer development boards and some of the simple ones from Lattice and Altera are quite cheap. Xilinx and MicroSemi don't get down to quite the same price level. There are third party boards at just about any level and the cheapest I can find with a real FPGA are based on Altera's Cyclone2 device.

There are several vendors for these on Aliexpress - this one is typical : ALTERA FPGA Cyslonell EP2C5T144 Minimum System Learning Board Development Board-in Other Electronic Components from Elec…

You'll pay about £10 with free shipping.

I bought mine in a bundle with a programmer from Etang Electronics on Amazon for £32.58 for the lot including shipping.


It looks like this:




And the good news is that it all works !


The Cyclone2 is a rather old design and although Altera still recommend it for new designs it isn't supported by the latest version of their design software. So when you download the software to develop code for it make sure to get version 13.0sp1. The part on the board is  a real FPGA with hardware multipliers, some embedded memory blocks and about 4000 logic elements. You get a voltage regulator, a 50MHz oscillator, a switch and three LEDs.


I've never used Altera's design software before so it took a little while to get round the quirks. There are one or two places where it seems to need to be told stuff it must already know (like where to find the simulator that was installed with it !) and it can generate test bench files but gives them an odd filename extension which you end up changing. There is a lot if support material on the web.


You get a free version of ModelSim for simulation and I found this to be the most difficult part - I expect mainly because I use a paid for version of a different simulator and have done for the last 10 years.

There is a free open source  VHDL simulator (GHDL) but it's reputation is not very good - so if you go this route you are stuck with ModelSim.


These boards are down at Arduino price levels so perfectly affordable for building in to projects.


If any one is interested I'll post some actual code that does something and if any one has an idea of a project they fancy using this board I'll be happy to discuss it.