The DSP48 Primitive

 

This post will start a longer series dedicated to the DSP48 primitive, a MAC (multiply/accumulate) block which is the workhorse for any kind of signal processing design that requires lots of mathematical operations beyond simple additions or subtractions, which are well handled with fabric based implementations that use the dedicated carry chain primitives.

 

The DSP48, of which there are multiple flavors, one for each Xilinx FPGA family, started as a signed 18x18 multiplier in the earliest Virtex devices, about 20 years ago. Over time the size of the multiplier has increased to 25x18, then 27x18 and a 48-bit post adder and a 25/27-bit preadder have been added.

 

Simplifying things a bit we can say that a DSP48 computes expressions like P=(A+D)*B+C, where A and D are 25 or 27 bits, B is 18 bits and P and C are 48 bits, all signed numbers. By the way, the variable names I used in the above expression match the DSP48 input and output port names, which is of course a good coding practice.

 

Leaving the historical families behind, we will focus on 7-Series (Spartan7, Artix7, Kintex7, Virtex7 and Zynq7000), which contain a primitive called DSP48E1 and UltraScale/UltraScale+ (Kintex, Virtex and Zynq MPSoC), which have the newer 27x18 flavor called DSP48E2. Xilinx FPGAs contain from as few as 10 DSP48s in the smallest Spartan7 device XC7S6 to as many as 12,288 in some of the largest Virtex UltraScale+ devices, VU13P and VU29P. Similarly, the data sheet maximum clock speed is between 464MHz in the slowest speed grade Spartan7 and Artix7 to 891MHz in the fastest speed grade UltraScale+. This means that the peak performance of DSP48s in the fastest speed grade VU13P device is almost 11TMACs (11 thousand billions of 27x18 multiplications and 48-bit additions every second).

 

While data sheet numbers are generally values that cannot be easily achieved in normal designs, this is not the case with the DSP48 - as a general rule of thumb, whatever the datasheet DSP48 fMAX value is for a particular device family and speed grade, that level of performance can be relatively easily achieved if proper design rules are followed.

 

Even more importantly, these multiply and accumulate operations are not independent of each other, in typical designs the vast majority of them are sums of products, in some cases of many such terms. FIR filters, complex multiplications, FFTs, linear algebra matrix operations, convolutional neural networks are just a few examples. All DSP48 primitives are organized in vertical columns spanning the entire height of a device, with dedicated cascade connections between them going up along the column. These dedicated cascade chains do not use normal fabric routing so they do not add to routing congestion and their speed is not affected by unrelated logic. You can chain all the DSP48s in a column and compute a huge sum of products at full speed, without impacting or being affected by the rest of the design in the fabric. The DSP48s not only implement the multiplications, but the additions required to calculate the sum of products are also free, provided by the post adders and the dedicated column cascade routing.

 

Obviously, the devil hides in the details, as it always does. While you can indeed achieve maximum DSP48 fMAX, this requires pipelining. There are multiple optional registers inside the DSP48 and they have to be all used to reach that speed. The properly pipelined DSP48 latency is 4 clocks if the A+D preadder is used and 3 clocks if it is bypassed. Lower latencies can be achieved at the cost of a reduced clock rate but this is generally not a good design choice, since it leads to a less efficient design. The columnar nature of the DSP48s makes it easy to compute the sum of products but transferring the operands and the result from and to the fabric or between columns could become a placement problem.

 

Finally, as with the other coding examples we have seen earlier, the synthesis tool performance is mixed - when it works it works well and in most cases it is perfectly possible to infer DSP48s from behavioral code and still achieve optimum device utilization and speed. But most cases is not all cases and it is not uncommon to have to instantiate DSP48 primitives to achieve the desired level of performance. While you can infer a DSP48 without a detailed knowledge of how the primitive works, you cannot instantiate one without that knowledge.

 

We will explore both coding styles in the next posts dedicated to the DSP48 primitive.

 

Back to the top: The Art of FPGA Design