The DSP48 Primitive - Symmetric FIR with DSP48 Primitive Instantiations

 

We will now add the option of choosing between DSP48 inference or primitive instantiations to the symmetric FIR introduced in Post 22. It might make sense to review Post 22 and Post 23 before continuing.

 

We will use the same technique, the generic BEHAVIORAL can be set to TRUE or FALSE to select between the two implementations. This is very similar to the code used in Post 24, except that now we are using the DSP48E2 input port D and also the INMODE input port to take advantage of the pre-adder. The first DSP48 in the chain is again slightly different and we can also select between symmetric and anti-symmetric implementations using the pre-adder add or subtract feature:


library IEEE;
use IEEE.STD_LOGIC_1164.all;
use
IEEE.NUMERIC_STD.all; 
use
work.types_pkg.all; -- VHDL93 version of package providing SFIXED type support 

entity SYMMETRIC_SYSTOLIC_FIR is
 
generic(N:INTEGER;
          ODD:
BOOLEAN:=FALSE; -- if ODD the FIR order is 2*N-1 else it is 2*N
          ANTISYMMETRIC:BOOLEAN:=FALSE;
          BEHAVIORAL:
BOOLEAN:=TRUE);
 
port(CLK:in STD_LOGIC;
       CI:
in SFIXED_VECTOR; -- set of N symmetric coefficients, filter order is 2*N if even or 2*N-1 if odd - in this case set the middle coefficient to half the desired value
       I:in SFIXED;         -- forward data input
       O:out SFIXED);       -- filter output
end SYMMETRIC_SYSTOLIC_FIR;

architecture
TEST of SYMMETRIC_SYSTOLIC_FIR is
  signal ID:SFIXED(I'range);
begin
 
assert I'length<28 report "Input Data width must be 27 bits or less" severity warning;
 
assert CI'length/N<19 report "Coefficient width must be 18 bits or less" severity warning;

  sd:entity work.SDELAY generic map(SIZE=>2*N-1)
                       
port map(CLK=>CLK,
                                 I=>I,
                                 O=>ID);
  ib:
if BEHAVIORAL generate
      
type TAC is array(0 to N) of SFIXED(I'range);
      
signal AC:TAC;
      
type TPC is array(0 to N) of SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
      
signal PC:TPC;
    
begin
       AC(AC'low)<=I;
       PC(PC'
low)<=(others=>'0');
       lk:
for K in 0 to N-1 generate
            signal A1,A2,D,AD:SFIXED(I'range):=(others=>'0');
  
         signal B:SFIXED((CI'high+1)/N-1 downto CI'low/N):=(others=>'0');
  
         signal M:SFIXED(A2'high+B'high+1 downto A2'low+B'low):=(others=>'0');
  
         signal P:SFIXED(PC(K+1)'range):=(others=>'0');
         
begin
            process(CLK)
           
begin
             
if rising_edge(CLK) then
                D<=ID;
  
             if not ODD and K=0 then -- remove one A delay for the first tap if filter is even symmetric
                  A2<=AC(K);
               
else
                  A1<=AC(K);
                  A2<=A1;
               
end if;
  
             if ANTISYMMETRIC then
                  AD<=RESIZE(D-A2,AD);
               
else
                  AD<=RESIZE(D+A2,AD);
  
             end if;
                B<=ELEMENT(CI,K,N); -- register for the coefficient inputs
                M<=B*AD;  -- multiplier internal register
                P<=RESIZE(M+PC(K),PC(K+1)); -- post-adder output register
            
end if;
           
end process;
            AC(K+
1)<=A2; -- A cascade output
            PC(K+1)<=P;  -- P cascade output
         
end generate;
          O<=RESIZE(PC(PC'
high),O'high,O'low); -- truncate the final sum to match the O output port range
   
end generate;

  ip:
if not BEHAVIORAL generate
      
type TAC is array(0 to N) of STD_LOGIC_VECTOR(29 downto 0);
      
signal AC:TAC; -- A cascade
      
type TPC is array(0 to N) of STD_LOGIC_VECTOR(47 downto 0);
      
signal PC:TPC; -- P cascade
    
begin
       AC(AC'low)<=(others=>'0');
       PC(PC'
low)<=(others=>'0');
       lk:
for K in 0 to N-1 generate
           signal C:SFIXED((O'high+1)/N-1 downto O'low/N):=(others=>'0');
           
signal P:SFIXED(I'high+(CI'high+1)/N+LOG2(N) downto I'low+CI'low/N);
           
signal INMODE:STD_LOGIC_VECTOR(4 downto 0);
           
signal OPMODE:STD_LOGIC_VECTOR(8 downto 0);

           
function A_INPUT(K:INTEGER) return STRING is
           
begin
             
if K=0 then
               
return "DIRECT";
             
else return "CASCADE";
           
end if;
           
end;

           
function AREG(K:INTEGER;ODD:BOOLEAN) return INTEGER is
           
begin
             
if not ODD and K=0 then
               
return 1; -- for the first tap the A cascade delay is one clock
             
else
               
return 2; -- for all the other taps the A cascade delay is two clocks
             
end if;
           
end;
         
begin
            INMODE<="10100" when (ODD or K>0) and not ANTISYMMETRIC else    -- (D+A2)*B1
  
                 "10101" when not ODD and K=0 and not ANTISYMMETRIC else -- (D+A1)*B1
  
                 "11100" when (ODD or K>0) and ANTISYMMETRIC else        -- (D-A2)*B1
                   
"11101"; -- when not ODD and K=0 and ANTISYMMETRIC      -- (D-A1)*B1
            OPMODE<="110000101" when K=0 else "110010101"; -- P=C+(D±A)*B when K=0 else P=C+PCIN+(D±A)*B
            ds:entity work.DSP48E2GW generic map(AMULTSEL=>"AD",
                                                
A_INPUT=>A_INPUT(K),
                                                 AREG=>AREG(K,ODD))
                                    
port map(CLK=>CLK,
                                              A=>I,
                                              B=>ELEMENT(CI,K,N),
                                              C=>C,
-- zero
                                              D=>ID,
                                              ACIN=>AC(K),
                                              PCIN=>PC(K),
                                              OPMODE=>OPMODE,
                                              ACOUT=>AC(K+
1),
                                              PCOUT=>PC(K+
1),
                                              P=>P);
            i0:
if K=N-1 generate
                 O<=RESIZE(P,O'high,O'
low);
              
end generate;
        
end generate;
    end generate;
end TEST;

 

The structural version of the generic symmetric/anti-symmetric FIR can be tested now the same way we did in Post 22, just make the BEHAVIORAL generic FALSE.

 

As we have seen in Post 22, with this type of symmetric FIR the result of the behavioral code inference was not ideal and now the structural version will really make a positive difference.

 

The inferred version of this FIR design uses 4 DSP48E2s, 216 FFs and 108 LUTs and runs at 779MHz in a ZU9EG-2.

 

The structural version of the FIR uses 4 DSP48E2s, 24 FFs and 12 LUTs (for the SDELAY module) and runs at 909MHz in the same ZU9EG-2. Your maximum clock speed in this particular device will be 775MHz, limited by the datasheet numbers and not by the FPGA design timing closure. In the fastest -3 speed grade fMAX will always be 891MHz.

 

The lesson here is this - if behavioral code inference produces optimal results it is the best coding technique and its use is highly recommended. So try it first and if it works move on to the next design challenge. If the inference result is not ideal then switching to primitive instantiations is the way to go, it will require more design effort but it will guarantee the expected results and give you complete control over the synthesis results.

 

For the particular example used in this post the structural version uses 9x less FFs and LUTs and has a 17% extra timing margin, which is equivalent to one FPGA speed grade - the structural design in a -2 will be faster than the inferred design in a -3! The lower fabric utilization can become important if the FIR has a higher order than the N=4 we used here and/or there are many instances of this FIR in the same design and we want to close timing at the highest possible clock rate.

 

In the next posts we will look at other uses of DSP48s and compare inference and instantiation coding techniques for them.

 

Back to the top:    The Art of FPGA Design