The DSP48 Primitive - Symmetric FIR with DSP48 Primitive Instantiations
We will now add the option of choosing between DSP48 inference or primitive instantiations to the symmetric FIR introduced in Post 22. It might make sense to review Post 22 and Post 23 before continuing.
We will use the same technique, the generic BEHAVIORAL can be set to TRUE or FALSE to select between the two implementations. This is very similar to the code used in Post 24, except that now we are using the DSP48E2 input port D and also the INMODE input port to take advantage of the pre-adder. The first DSP48 in the chain is again slightly different and we can also select between symmetric and anti-symmetric implementations using the pre-adder add or subtract feature:
use work.types_pkg.all; -- VHDL93 version of package providing SFIXED type support
entity SYMMETRIC_SYSTOLIC_FIR is
ODD:BOOLEAN:=FALSE; -- if ODD the FIR order is 2*N-1 else it is 2*N
CI:in SFIXED_VECTOR; -- set of N symmetric coefficients, filter order is 2*N if even or 2*N-1 if odd - in this case set the middle coefficient to half the desired value
I:in SFIXED; -- forward data input
O:out SFIXED); -- filter output
architecture TEST of SYMMETRIC_SYSTOLIC_FIR is
assert I'length<28 report "Input Data width must be 27 bits or less" severity warning;
assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;
sd:entity work.SDELAY generic map(SIZE=>2*N-1)
ib:if BEHAVIORAL generate
type TAC is array(0 to N) of SFIXED(I'range);
type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
lk:for K in 0 to N-1 generate
signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
if rising_edge(CLK) then
if not ODD and K=0 then -- remove one A delay for the first tap if filter is even symmetric
if ANTISYMMETRIC then
B<=ELEMENT(CI,K,N); -- register for the coefficient inputs
M<=B*AD; -- multiplier internal register
P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
AC(K+1)<=A2; -- A cascade output
PC(K+1)<=P; -- P cascade output
O<=RESIZE(PC(PC'high),O'high,O'low); -- truncate the final sum to match the O output port range
ip:if not BEHAVIORAL generate
type TAC is array(0 to N) of STD_LOGIC_VECTOR(29 downto 0);
signal AC:TAC; -- A cascade
type TPC is array(0 to N) of STD_LOGIC_VECTOR(47 downto 0);
signal PC:TPC; -- P cascade
lk:for K in 0 to N-1 generate
signal C:SFIXED((O'high+1)/N-1 downto O'low/N):=(others=>'0');
signal P:SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
signal INMODE:STD_LOGIC_VECTOR(4 downto 0);
signal OPMODE:STD_LOGIC_VECTOR(8 downto 0);
function A_INPUT(K:INTEGER) return STRING is
if K=0 then
else return "CASCADE";
function AREG(K:INTEGER;ODD:BOOLEAN) return INTEGER is
if not ODD and K=0 then
return 1; -- for the first tap the A cascade delay is one clock
return 2; -- for all the other taps the A cascade delay is two clocks
INMODE<="10100" when (ODD or K>0) and not ANTISYMMETRIC else -- (D+A2)*B1
"10101" when not ODD and K=0 and not ANTISYMMETRIC else -- (D+A1)*B1
"11100" when (ODD or K>0) and ANTISYMMETRIC else -- (D-A2)*B1
"11101"; -- when not ODD and K=0 and ANTISYMMETRIC -- (D-A1)*B1
OPMODE<="110000101" when K=0 else "110010101"; -- P=C+(D±A)*B when K=0 else P=C+PCIN+(D±A)*B
ds:entity work.DSP48E2GW generic map(AMULTSEL=>"AD",
C=>C, -- zero
i0:if K=N-1 generate
The structural version of the generic symmetric/anti-symmetric FIR can be tested now the same way we did in Post 22, just make the BEHAVIORAL generic FALSE.
As we have seen in Post 22, with this type of symmetric FIR the result of the behavioral code inference was not ideal and now the structural version will really make a positive difference.
The inferred version of this FIR design uses 4 DSP48E2s, 216 FFs and 108 LUTs and runs at 779MHz in a ZU9EG-2.
The structural version of the FIR uses 4 DSP48E2s, 24 FFs and 12 LUTs (for the SDELAY module) and runs at 909MHz in the same ZU9EG-2. Your maximum clock speed in this particular device will be 775MHz, limited by the datasheet numbers and not by the FPGA design timing closure. In the fastest -3 speed grade fMAX will always be 891MHz.
The lesson here is this - if behavioral code inference produces optimal results it is the best coding technique and its use is highly recommended. So try it first and if it works move on to the next design challenge. If the inference result is not ideal then switching to primitive instantiations is the way to go, it will require more design effort but it will guarantee the expected results and give you complete control over the synthesis results.
For the particular example used in this post the structural version uses 9x less FFs and LUTs and has a 17% extra timing margin, which is equivalent to one FPGA speed grade - the structural design in a -2 will be faster than the inferred design in a -3! The lower fabric utilization can become important if the FIR has a higher order than the N=4 we used here and/or there are many instances of this FIR in the same design and we want to close timing at the highest possible clock rate.
In the next posts we will look at other uses of DSP48s and compare inference and instantiation coding techniques for them.
Back to the top: The Art of FPGA Design