The DSP48 Primitive - Behavioral FIR Inference


As mentioned earlier, the DSP48 primitive is an essential part of any signal processing FPGA design and in over 90% of cases it's either FIR like sums of products or complex multiplications. For this reason we will focus now on efficient implementation of Finite Impulse Response filters with DSP48s, which will also cover other cases where computation of sums of products is required like linear algebra matrix multiplication and convolutional neural networks.


In mathematical terms what we need to calculate is:


y(n)=Sum(h(k)*x(n-k)) for k=0..N-1


This can be seen as a convolution between the input data x and the filter impulse response h. The length of the impulse response if finite, N in this case, which is also known as the filter number of taps.


There is an overwhelming variety of variants - the data and/or coefficients can be real or complex, the coefficients can be arbitrary or symmetric, with N odd or even in the later case, some of the coefficients could be zero (half-band filters), some of the input samples could be zero (interpolating FIRs), some of the output samples might not be needed (decimating FIRs), there are polyphase versions that can do fractional sample rate changes, the input sample rate could be smaller or larger than the system clock frequency, the list goes on and on. But in the end all these cases can be reduced to a basic building block, the simplest possible case of a single rate (meaning the input and output sample rates are equal to the system clock frequency), real data/real coefficients non-symmetric FIR with N constant coefficient taps. We need to find an elegant way to implement such a filter that is efficient, meaning it runs at the highest possible frequency using the least amount of resources possible and scalable, in the sense that this optimal performance is maintained as the number of taps N increases.


As in previous examples, there are two fundamentally different ways this could be achieved in an HDL design flow, each one with its own advantages and disadvantages. We could use behavioral inference where the FIR is generated by the synthesis tool from a functional description of what we actually want or a structural design approach where we instantiate the DSP48 primitives ourselves, configuring and connecting them as we want. The first method is more readable and easier to test, debug and maintain, even portable to other technologies to a certain extent, while the second one is more efficient and gives us direct control over the FPGA resources. If the first approach produces the results we need it is almost always preferable, if not the second approach, while more difficult can save the day. In both cases, we would like the FIR design to be as generic and reusable as possible.


Even if one always chooses the first way of behavioral inference, this is not an excuse for not understanding how the DSP48 primitive works. This is a complicated module, it has over 50 ports and 50 generic parameters and the number of ways it can be configured and used is in the thousands. It is also relatively difficult to use directly without a deep understanding of its functionality, which is one of the reasons why behavioral inference is preferable, especially for beginners. But the risk of this approach is that you can easily get sub-optimal results in terms of device utilization and speed, which for large and complex designs (in this case many instances of relatively large FIR filters) can become a very expensive proposition.


The most important thing one needs to know when trying to infer DSP48s is that you can always run them as fast as the data sheet will allow, typically between 600MHz and 891MHz, depending on FPGA family and speed grade but only if you use all the optional pipeline registers inside the primitive. For a multiply/add configuration the latency is 3 clocks and for a pre-add/multiply/add it is four clocks. Even if you do not use the post-adder and simply want a plain multiplication, the minimum latency to achieve optimal performance is still 3 clocks. The throughput is always one, meaning one multiply/add every clock but it takes 3 to 4 clocks for the result for a given set of input operands to appear at the outputs. A particular HDL coding style is needed to infer this pipelined DSP48 and also a particular FIR structure called direct systolic needs to be used to achieve an efficient and scalable computation of a large sum of products.


Ignoring the pre-adder for now, which becomes important only for the case of symmetric or anti-symmetric FIRs, each DSP48 in a systolic chain will compute something like PCOUT=A*B+PCIN, where the PCOUT of one DSP48 drives the PCIN of the next DSP48 in the chain. The three register levels that need to be inferred if maximum speed is desired are A and B for the input data and filter coefficients, M which is an internal multiplier pipeline register and P, which is the post-adder output register that will drive the PCOUT port. The PCOUT to PCIN cascade connections use dedicated routing between two adjacent DSP48s located in the same vertical column and they always go from the bottom of the die to the top - a DSP48 chain as long as the height of a DSP48 column can be created. The input data also propagates up the column using dedicated ACOUT to ACIN connections but it needs to be delayed by two clocks, one required by the convolution equation and one to compensate for the P pipeline registers in the post-adder chain. The A cascade connections are also dedicated routing and the two A delay registers are part of the DSP48 primitive so in the end a single rate non-symmetric FIR with N constant coefficient taps can be implemented with N DSP48s, without using any fabric logic or routing resources. This means that this systolic FIR implementation is both very efficient and scalable, running at full speed independent of the number of taps N. This blog is too short to get into too much detail about the DSP48 internal structure and functionality but if you are not already familiar with this, UG479, The 7-Series DSP48E1 User Guide and UG579, The UltraScale DSP48E2 User Guide are required reading, even (or maybe especially) if you only plan to use the behavioral inference path.


Without further delay, here is how I would normally code a single rate non-symmetric FIR using behavioral inference. As mentioned earlier, we want to keep the code generic, easy to use test, debug and reuse. So the design will implement an FIR of any number of taps, with arbitrary fixed point input and output data:


library IEEE; 
use IEEE.STD_LOGIC_1164.all;

use work.types_pkg.all; -- VHDL93 version of package providing SFIXED type support 

entity SYSTOLIC_FIR is
port(CLK:in STD_LOGIC;
in SFIXED_VECTOR; -- set of N coefficients
       I:in SFIXED;         -- forward data input
       O:out SFIXED);       -- filter output end SYSTOLIC_FIR; 

architecture TEST of SYSTOLIC_FIR is
assert I'length<28 report "Input Data width must be 27 bits or less" severity warning;
assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;

if BEHAVIORAL generate
type TAC is array(0 to N) of SFIXED(I'range);
signal AC:TAC;
type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
signal PC:TPC;
for K in 0 to N-1 generate
signal A1,A2:SFIXED(I'range):=(others=>'0');
            signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
signal P:SFIXED(PC(K+1)'range):=(others=>'0');
if rising_edge(CLK) then
if K=0 then -- for the first tap the A cascade delay is one clock
else        -- for all the other taps the A cascade delay is two clocks
end if;
-- register for the coefficient inputs
                M<=B*A2;                    -- multiplier internal register
                P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
end if;
end process;
1)<=A2; -- A cascade output
            PC(K+1)<=P;  -- P cascade output
end generate;
high),O'high,O'low); -- truncate the final sum to match the O output port range
  end generate;


The names of the signals defined locally in the lk: generate loop match exactly the DSP48 internal register names. Using SFIXED signals of arbitrary size instead of the STD_LOGIC_VECTORs of fixed size that the DSP48 instantiations would require, with automatic handling of the data resizing and binary point alignment makes the design easier to understand, maintain and use. An example instantiation of this generic FIR block would look like this:


library IEEE; 
use IEEE.STD_LOGIC_1164.all;

use work.types_pkg.all; -- VHDL93 version of package providing SFIXED type support 

generic(N:INTEGER:=4; -- FIR number of taps
port(CLK:in STD_LOGIC;
in SFIXED(7 downto -16);    -- input data is fixed point
       O:out SFIXED(9 downto -22));  -- output data is a different fixed point

architecture TEST of TEST_SYSTOLIC_FIR is
constant COEF:work.types_pkg.REAL_VECTOR(0 to N-1):=(0.125,0.250,0.375,0.5); -- the filter coefficients are REAL
constant CL:INTEGER:=-17; -- constants used to convert the REAL coefficients to SFIXED
constant CH:INTEGER:=0;
signal CI:SFIXED_VECTOR((CH+1)*N-1 downto CL*N); -- FIR coefficients in SFIXED format
convert the filter coefficients from REAL to SFIXED
  lk:for K in 0 to N-1 generate
       CI((CH-CL+1)*(K+1)-1+CI'low downto (CH-CL+1)*K+CI'low)<=SFIXED_VECTOR(TO_SFIXED(COEF(K),CH,CL));
end generate;
-- instantiate the generic SYSTOLIC_FIR
  fr:entity work.SYSTOLIC_FIR generic map(N=>N, -- filter number of taps
port map(CLK=>CLK,
-- filter coefficients, they do not have to be constant
                                       I=>I,    -- FIR input data
                                       O=>O);   -- FIR output data
end TEST;


While the coefficients are constants right now, the top level design could very well implement variable coefficients with no changes to the SYSTOLIC_FIR generic building block. I kept the filter size short, only 4 taps so we can look at the implementation results in schematic form:

This confirms that there are no fabric logic or routing resources used other than DSP48s and dedicated A and P cascade routing and the filter will scale up to larger N values without a degradation in speed. As a general rule, Vivado synthesis will infer properly pipelined FIRs, including symmetric ones with the use of the pre-adder which will be the object of the next post, from behavioral HDL but only if a particular coding style is followed.


Back to the top:The Art of FPGA Design