The DSP48 Primitive - Wide XOR Mode


The DSP48 primitive can be used for more than just multiply and accumulate. It can for example implement very wide XOR functions. Apart from the obvious ability of XORing two 48-bit operands using the A concatenated with B, or A:B and C inputs and producing a 48-bit result on the P output, or 48 XOR2 logic functions, it is also possible to implement 8 XOR12s, or 4 XOR24s, or 2 XOR48s or one XOR96 with a single DSP48E2. It is also possible to compute a 3-input bitwise XOR between A:B, C and P or P cascade and build multi-clock XOR accumulators or even wider XORs with multiple DSP48s cascaded.


To keep things simple I will illustrate here just one such function, a XOR96, but the number of possible variations is very large. I will also compare a fabric based XOR96 inferred version with an instantiated DSp48 primitive, also using the DSP48E2GW generic wrapper introduced in Post 23. A BOOLEAN generic parameter can be used to select between the two alternative implementations, both are functionally equivalent, a XOR96 logic function with a latency of 2 clocks:


library IEEE; 
use IEEE.STD_LOGIC_1164.all;

use work.types_pkg.all; 

entity XOR96 is
port(CLK:in STD_LOGIC;
in STD_LOGIC_VECTOR(95 downto 0);
end XOR96; 

TEST of XOR96 is
  i0:if USE_LUTs generate
signal RI:STD_LOGIC_VECTOR(I'range):=(others=>'0');
process(CLK) -- register all inputs so that we can measure fMAX
if rising_edge(CLK) then
xor RI; -- VHDL-2008 wide XOR
end if;
  end process;
else i1: generate
signal ALUMODE:STD_LOGIC_VECTOR(3 downto 0):=X"4";       -- X XOR Y XOR Z
signal INMODE:STD_LOGIC_VECTOR(4 downto 0):="10001";     -- use A1 and B1
signal OPMODE:STD_LOGIC_VECTOR(8 downto 0):="000001111"; -- W=0, Z=0, Y=C, X=A:B
signal A:SFIXED(29 downto 0);
signal B:SFIXED(17 downto 0);
signal C:SFIXED(47 downto 0);
signal D:SFIXED(26 downto 0);
signal P:SFIXED(47 downto 0);
signal XOROUT:STD_LOGIC_VECTOR(7 downto 0);
       A<=SFIXED(I(47 downto 18));
17 downto 0));
95 downto 48));
entity work.DSP48E2GW generic map(USE_WIDEXOR=>"TRUE",    -- Use the Wide XOR function (FALSE, TRUE)
                                            XORSIMD=>"XOR24_48_96") -- Mode of operation for the Wide XOR (XOR12, XOR24_48_96)
port map(ALUMODE=>ALUMODE,          -- 4-bit input: ALU control
                                         CLK=>CLK,                  -- 1-bit input: Clock
                                         INMODE=>INMODE,            -- 5-bit input: INMODE control
                                         OPMODE=>OPMODE,            -- 9-bit input: Operation mode - P<=xor A:B:C
                                         A=>A,                      -- 30-bit input: A data
                                         B=>B,                      -- 18-bit input: B data
                                         C=>C,                      -- 48-bit input: C data
                                         D=>D,                      -- 27-bit input: D data
                                         P=>P,                      -- 48-bit output: Primary data
                                         XOROUT=>XOROUT);           -- 8-bit output: XOR data
end generate;
end TEST;


The fabric based behavioral implementation uses 19 LUT6es and 97 FFs for the two pipeline levels. The instantiated DSP48 version uses one DSP48E2 primitive and nothing else. Both versions will run at the maximum possible speed permitted by the Data Sheet for a given FPGA family and speed grade, for example 891MHz in an UltraScale+ speed grade -3 device. As mentioned earlier, understanding how to configure the DSP48 attributes and ports, the UltraScale Architecture DSP Slice User Guide is mandatory reading.


Which one of the two versions is top be preferred depends on many aspects. In general it is best to do multiplies and accumulates with the DSP48s, especially on designs that do a lot of signal processing. On the other hand, there are designs like wired communications and networking that do not require any kind of signal processing and the DSP48s are left unused - in these cases moving logic functions like counters, adders, or in this case wide XORs into the unused DSP48 primitives makes a lot of sense. Wide XORs are needed for computing and checking CRC (cyclic redundancy checks) or FEC (forward error correction) operations at very high data rates, when a lot of such XOR logic functions are needed.


In the next posts we will continue examining different ways of using the DSP48 primitive.


Back to the top:  The Art of FPGA Design