Sorting Networks - The Verification Problem In the final post on this subject of parallel sorting using FPGAs I will talk about the difficult issue of how to verify that such a design actually works. That is when given a set of any N input numbers in any arbitrary order the design outputs them in ascending sorted order. There are two main verification strategies for hardware designs, based on formal proof and exhaustive testing. The first one uses logical mathematical methods to prove th ...

Sorting Networks - The results for the VHDL implementation of Batcher's sorting algorithm So how good is this VHDL implementation of a parallel sorting network? Before we look at the results here are the two modules that were missing from the previous post, a generic DELAY module for elements to be sorted: library ieee; use ieee.std_logic_1164.ALL; use work.SORTER_PKG.all; entity DELAY is generic(SIZE:INTEGER:=1); port(CLK:in STD_LOGIC; ...

Sorting Networks - The VHDL implementation of Batcher's sorting algorithm OK, enough with the preliminaries, it's time now for the real deal, how do we implement in an FPGA this parallel sorting network thing. The main design goals are creating a generic, reusable and efficient implementation. We want to be able to implement a sorting network of any size N with a single piece of code, we want to be able to sort all kinds of data, not just integers and we want something close in size to ...

Sorting Networks - The Batcher or odd-even mergesort sorting algorithm Now that we have defined the FPGA design engineering problem - parallel sorting of a set of N items in one clock, that is one new sorting operation starting every clock - we need to chose a sorting algorithm.While ideal in terms of performance (number of compare-exchange operations respectively latency) optimal sorting networks are irregular and more importantly they cannot be generated programmatically for an arbitr ...

Sorting Networks Sorting networks are an interesting and unsolved mathematical puzzle. They are quite different from the usual sorting algorithms one encounters in computer science like bubble sort, quick sort, merge sort, heap sort and so on. While sequential algorithms are generic, in the sense that they work on an input set of items of any size, their execution time grows with N, either as O(N·log2N) for the fast algorithms or O(N2) for the more trivial ones and the ...

Sorting and Searching Algorithms Now that we went through a variety of VHDL design building blocks and we saw the techniques to create both generic, reusable and at the same time efficient (in terms of speed and area) designs, using either behavioral inference or primitive instantiations, it is time to put what we have learned in practice. We have already looked briefly at FIRs, Finite Impulse Response Filters in the context of introducing the DSP48 primitive, there is a lot more to be s ...

The DSP58 Primitive Xilinx has recently announced a new 7nm FPGA family called Versal. Devices in this family will have an improved version of the UltraScale/UltraScale+ DSP48E2 primitive we just studied in the last posts. The new Versal primitive is called DSP58 and there are numerous improvements compared to the earlier DSP48. First of all, the signed multiplier, which was 27x18 in DSP48 is now 27x24 and the 48-bit post-adder/accumulator is 58 bits, which is where the name of ...

The DSP48 Primitive - Complex multipliers Traditionally a complex multiplication can be decomposed into four real multiplications and two additions: x+i·y=(a+i·b)(c+i·s)=(a·c−b·s)+i·(a·s+b·c) where i=√−1 This maps well into four DSP48s, including the two additions, which can use the post-adders and the DSP48 P cascades. The latency of a fully pipelined DSP48 implementation is fo ...

The DSP48 Primitive - Inferring larger multipliers The DSP48E2 primitive contains a signed 27x18 multiplier, any signed multiplier up to this size can be implemented with just one such primitive. If we need larger multipliers we can achieve that with multiple DSP48s. The way larger multipliers are built uses a feature of the DSP48 primitive in which the 48-bit dedicated P cascade output of one DSP48 is right shifted by 17 bits before being added to the partial product calculate ...

The DSP48 Primitive - Small Multiplications - Two For the Price of One The DSP48E2 primitive contains a signed 27x18 multiplier, any signed multiplier up to this size can be implemented with just one such primitive. Unsigned multiplications are of course possible if you add a zero MSB bit to the operands and then treat them as signed but the largest unsigned multiplication that can be done that way with one DSP48E2 is 26x17. When the operands are much smaller it becomes possib ...

The DSP48 Primitive - Wide XOR Mode The DSP48 primitive can be used for more than just multiply and accumulate. It can for example implement very wide XOR functions. Apart from the obvious ability of XORing two 48-bit operands using the A concatenated with B, or A:B and C inputs and producing a 48-bit result on the P output, or 48 XOR2 logic functions, it is also possible to implement 8 XOR12s, or 4 XOR24s, or 2 XOR48s or one XOR96 with a single DSP48E2. It is also possible to compute a ...

The DSP48 Primitive - Symmetric FIR with DSP48 Primitive Instantiations We will now add the option of choosing between DSP48 inference or primitive instantiations to the symmetric FIR introduced in Post 22. It might make sense to review Post 22 and Post 23 before continuing. We will use the same technique, the generic BEHAVIORAL can be set to TRUE or FALSE to select between the two implementations. This is very similar to the code used in Post 24, except that now we are using the ...

The DSP48 Primitive - FIR with DSP48 Primitive Instantiations After a break I will be resuming my weekly posts on The Art of FPGA Design. In the last posts we started looking at the DSP48 primitive, essentially a signed 27x18 multiplier which also includes a 27-bit pre-adder and a 3-input 48-bit post-adder. In older FPGA families like the 7-series the multiplier is 25x18, the pre-adder is 25-bits and the 48-bit post-adder has only two inputs. The DSP48 primitive also includes a lot of p ...

The DSP48 Primitive - Instantiating the DSP48 Behavioral inference has many advantages - relatively simple and compact code, works with signed and unsigned operands of any size, hides the intricacies of the DSP48 primitive from the user. It should definitely be the first choice when coding a DSP based design if it produces the desired results in terms of device utilization and clock speed. That's a big if, when things do not go as you want there isn't much you can do - fighting wi ...

The DSP48 Primitive - Behavioral Symmetric FIR Inference The DSP48 primitive has an optional preadder function, which can be used to compute things like PCOUT=PCIN+(A+D)*B, which when used for implementing symmetric or anti-symmetric FIRs can reduce the number of multipliers used in half. The following diagram shows how such a symmetric FIR is built using the case N=4, a symmetric FIR with 8 taps as an example: The forward data delay line is identical to the one for the non-sym ...