## Using the Carry-Save Adder, Computing a Running Average

I will show in the next few posts some design examples where using a 3-input carry-save adder instead of the normal 2-input ripple-carry adder makes a significant difference. The first example is a running average, where we have a stream of input samples and we want to compute a continuous running average every clock, as the average of the last N samples. In mathematical terms:

y(n)=1/N*Sum(x(n-k)), k=0..N-1

As a first observation we will ignore the division by N - if N is a power of 2 then this is just a right shift, which in hardware terms is a free operation. If N is not a power of 2 then we can replace the division by a constant N with multiplication with another constant 1/N. Multiplication by a constant coefficient is another operation where 3-input carry-save adders can be used and we will look at this in a future post. So all we want to do is compute the sum of the last N input samples every clock. A naive implementation would require N-1 2-input adders or N/2 3-input adders but there is a better way. We can rewrite the previous equation as:

y(n)=y(n-1)+x(n)-x(n-N)

and now we need only one 3-input adder and a delay of N clocks, which is much more efficient, especially when N is large. In particular, we should be able to implement both additions with a single CSA 3-input adder, using a single ripple carry chain. We can even code this using behavioral VHDL-2008 very easily like this:

`library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.NUMERIC_STD.all;  use work.TYPES_PKG.all;  entity FIR_AVG is  generic(N:INTEGER:=16);  port(CLK:in STD_LOGIC;       I:in SFIXED(0 downto -15);       O:out SFIXED(4 downto -15)); end FIR_AVG;  architecture TEST of FIR_AVG is  signal D:SFIXED_VECTOR(0 to N)(I'range); begin -- D(N)=I*z(-N), delay I by N clocks   D(D'low)<=I;  lk:for K in 1 to N generate       signal Q:SFIXED(I'range):=(others=>'0');     begin       process(CLK)       begin         if rising_edge(CLK) then            Q<=D(K-1);         end if;       end process;       D(K)<=Q;     end generate; -- O(n)=O(n-1)+I(n)-I(n-N)  process(CLK)  begin    if rising_edge(CLK) then       O<=RESIZE(O+I-D(N),O);    end if;  end process; end TEST; `

As usual, the code is very generic and it is written so that any N and any input and output data sizes can be used, but as an example I made N=16 and selected 16-bit input data and 20-bit output data - to maintain full precision and avoid overflows we have to allow for addition bit growth and the output range should be log2(N) bits larger than the input range. The problem with this version of the design is that the synthesis and implementation result is less than ideal, 49 LUT6es and 36 FFs. While the FF count is what was expected, the LUT6 count is almost double of what we want, due mostly to the synthesis tool inferring two carry chains when only one would be enough, and the SRL16s not being mapped two per LUT6.

Once we recognize that the default behavioral coding style does not produce the optimal expected result, it is time to leverage some of the generic structural modules we saw in earlier posts, namely SDELAY and CSA3, which were designed to address the two issues mentioned above. When using these modules the code is almost as compact and also as generic as the original behavioral version. The implementation results are now much better, 30 LUT6es and 36 FFs:

`library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.NUMERIC_STD.all;  use work.TYPES_PKG.all;  entity FIR_AVG is  generic(N:INTEGER:=16);  port(CLK:in STD_LOGIC;       I:in SFIXED(0 downto -15);       O:out SFIXED(4 downto -15));end FIR_AVG;  architecture TEST of FIR_AVG is  signal ID:SFIXED(I'range);  signal P:SFIXED(O'range); begin -- ID=I*z(-N), delay I by N clocks   sd:entity work.SDELAY generic map(SIZE=>N)                        port map(CLK=>CLK,                                 I=>I,                                 O=>ID); -- O(n)=O(n-1)+I(n)-I(n-N)   csa:entity work.CSA3 generic map(PIPELINE=>TRUE,                                   NEGATIVE_A=>FALSE,                                   NEGATIVE_B=>TRUE)                       port map(CLK=>CLK,                                A=>I,                                B=>ID,                                C=>P,                                P=>P);  O<=RESIZE(P,O); end TEST;`

This type of running average is essentially a Finite Impulse Response (FIR) filter of order N, where all coefficients happen to be equal to 1. A different kind of running average algorithm is the exponential decay average circuit, which is an Infinite Impulse Response (IIR) filter of order 1 and is described by the equation:

y(n)=a*x(n)+(1-a)*y(n-1)

Right now we are only interested In the particular case where the coefficient a is a negative power of 2, when the two multiplications reduce to arithmetic right shifts, which are free operations in hardware and the entire design becomes a single 3-input carry-save adder:

y(n)=y(n-1)+ShiftRigth(x(n),k)-ShiftRight(y(n-1),k), where a=2**(-k)

Here is the VHDL-2008 implementation:

`library IEEE; use IEEE.STD_LOGIC_1164.all; use IEEE.NUMERIC_STD.all;  use work.TYPES_PKG.all;  entity IIR_AVG is  generic(K:INTEGER:=4); -- a=2**(-K)  port(CLK:in STD_LOGIC;       I:in SFIXED(0 downto -15);       O:out SFIXED(2 downto -15)); end IIR_AVG;  architecture TEST of IIR_AVG is  signal P:SFIXED(O'high downto O'low-K); begin -- O(n)=a*I(n)+(1-a)*O(n-1), where  a=2**(-K)   csa:entity work.CSA3 generic map(PIPELINE=>TRUE,                                   NEGATIVE_A=>FALSE,                                   NEGATIVE_B=>TRUE)                       port map(CLK=>CLK,                                A=>SHIFT_RIGHT(I,K),                                B=>SHIFT_RIGHT(P,K),                                C=>P,                                P=>P);  O<=RESIZE(P,O); end TEST;`

While in the FIR case you could tune the averager by changing the value of N, here you can do the same thing by changing the coefficient a. Both designs are good examples of multiplier-less filters, that also benefit from the 2x reduction in ripple-carry adders we can achieve by using the CSA3 module.

In conclusion, we were able to reduce the size of the design by a factor of almost 2x while achieving the same high speed, over 800 MHz maximum clock rates. If we had many instances of such a module in a design this could start making a significant difference. We were able to do this while keeping the top level of the design very generic, as small and easy to understand and maintain as the behavioral version. The extra work required to achieve a more efficient implementation is hidden in the generic SDELAY and CSA3 lower level modules.

I will continue next week with another example where a 3-input CSA can reduce the size of a large design by a factor of 2x, a generic and arbitrary size adder tree, where we will have another opportunity to also use recursive component instantiations. .

Back to the top:The Art of FPGA Design