Looking through archive.org I found an old web page of mine  it was
archived in February 1999, but if I recall correctly was written three
or four years before that.
And, hey, look! I used to be an engineer!
Adder Architectures
Disclaimer
HTML generated semiautomatically from an old email to the
hotwired mailing list that was written very late at night. Don’t
treat as gospel….
Notation
x_{i} is bit i of the number x, in standard twos complement form.
 · is logical and
 + is logical or
 ¤ is logical xor
Basics
We’re adding two nbit numbers a & b, and a carryin c_{0} to give an
nbit result z, and a carryout c_{n}
z = a plus b plus c_{0}
The operands and result are of the form
a = { a_{n1}, a_{n2}, … a_{1}, a_{0} }
and so on.
Logically addition is this:
For all 0 <= i < n
z_{i} = a_{i} ¤ b_{i} ¤ c_{i}
c_{i+1} = x_{i} · y_{i} + c_{i} · (x_{i} + y_{i}) [1]
This is often reformulated in terms of propagate (p) and generate (g) signals:
g_{i} = a_{i} · b_{i} [2]
p_{i} = a_{i} ¤ b_{i} [3]
(The alternative p_{i} = a_{i} + b_{i} is sometimes used)
(Often a kill signal (k) is used too: k_{i} = not(a_{i} + b_{i}) )
Then the carry signals can be given as:
c_{i+1} = g_{i} + c_{i} · p_{i} substitute [2] & [3] in [1] [4]
and
z_{i} = p_{i} ¤ c_{i}
The expensive bit about addition is generating c_{i}. Once all the c_{i}
are generated the z_{i} can be generated in constant time (not a negligable
time, but at least it’s constant with operand length).
Different adder architectures
Ripple Carry (RCA)
A ripple carry adder (RCA) implements [1] directly. You all understand
ripple carry adders, so I’ll just say that they are cheap and slow.
The critical path in an RCA is from c_{0}, a_{0} or b_{0} through all the
full adders to z_{n1} or c_{n}.
Carry lookahead (CLA)
For a four bit adder we can expand [4] to give
c_{1} = g_{0} + c_{0}·p_{0} [5]
c_{2} = g_{1} + g_{0}·p_{1} + c_{0}·p_{0}·p_{1} [6]
c_{3} = g_{2} + g_{1}·p_{2} + g_{0}·p_{1}_{p}_{2} + c_{0}·p_{0}·p_{1}·p_{2} [7]
c_{4} = g_{3} + g_{2}·p_{3} + g_{1}·p_{2}·p_{3} + g_{0}·p_{1}·p_{2}·p_{3} + c_{0}·p_{0}·p_{1}·p_{2}·p_{3} [8]
This is reasonable for 4 bits, but will get grotesquely large in
terms of are and fanout if taken much further.
One option is to divide the operands into groups of four bits, use
carrylookahead within each group and ripple the carries between groups.
Divide and Conquer
If we can improve on speed of rippling carries within each group, then
surely we can improve on the speed of rippling carries between groups
by a similar approach.
Consider Group_{0}, consisting of
(g_{0}  g_{3}, p_{0}  p_{3})
Defining a `group generate’ g* and a `group propagate’ p*:
For Group_{0}:
g*_{0} = g_{3} + g_{2}·p_{3} + g_{1}·p_{2}·p_{3} + g_{0}·p_{1}·p_{2}·p_{3}
p*_{0} = p_{0}·p_{1}·p_{2}·p_{3}
and identically for all other groups.
Now
c_{4} = g8_{0} + c_{0}·p*_{0}
c_{8} = g*_{1} + g*_{0}·p*_{1} + c_{0}·p*_{0}·p*_{1}
c_{12} = g*_{2} + g*_{1}·p*_{2} + g*_{0}·p*_{1}·p*_{2} + c_{0}·p*_{0}·p*_{1}·p*_{2}
c_{16} = [deleted  it’s huge….]
These are identical to equations [58] used to generate carries within
groups.
There’s no need to stop there.
The fourbit groups can be combined into sixteenbit groups of groups.
Each can produce generate (g**) and propagate (p**) signals, combined as
above to give c_{16}, c_{32}, c_{48} and c_{64}.
This divide and conquer approach will eventually generate c_{i} for the
entire word width.
<This would be a lot clearer with diagrams>
For board level adders CLA is pretty good, ‘cos it can use standard MSI
lookahead generator ICs for nearly everything.
It’s not bad in CMOS, but it’s not particularly good either. I’d use
one of the following architectures in preference. The only exception might
be in a cellbased design where an optimised lookahead cell is provided.
Carry Skip adder
“In VLSI technology the carryskip adder is comparable in speed to the carry
lookahead technique (for commonly used word lengths) while it requires
less chip area and consumes less power.”
– Computer Arithmetic Algorithms, Israel Koren
…and that’s why it’s the adder I’d use for a 32 bit system, and probably
for a 64 bit system too.
Carry propagation can skip any stage for which p_{i} = 1 (ie a_{i} != b_{i}).
Several consecutive stages can be skipped if p_{i} =1 for each stage.
A carryskip adder is divided into groups of consecutive stages, with a
simple ripple carry scheme in each group.
Each group generates a group propagate signal, p*_{i}.
For Group_{i}, consisting of k stages j, j+1, … j+k1
p*_{i} = p_{j} · p_{j+1} · … · p_{j+k1}
This is used to allow an incoming carry into the group to skip over all the
stages in the group and generate a group carry out.
Group_{i}Carry_{out} = c_{j+k} + p*_{i} · Group_{i}Carry_{in}
(c_{j+k} is the normal ripple carry out from the most significat stage
in the group)
The critical path through a carryskip adder is via ripple carry through,
one of the groups, and via the skip carry chain through the remainder of
the groups.
(Think about it. A carry coming out of a group via the ripple
chain must have been generated within the group  if it was generated
before the group, p*_{i} would have been 1 and the stage would have been
skipped. So the critical path will travel through only one group).
In a carryskip adder the groups will not all be the same size. The optimal
division of an nbit adder into carryskip groups depends on the
characteristics of the target technology.
A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1}
for a typical technology.
VLSI implementations of carryskip adders tend to be quite small  using the
32 bit adder given above the extra cost over an RCA is about 20 extra gates.
While a single level of carryskip speeds things up a lot a second level of
skip, skipping over more than one group can speed things up a little further
for very little extra cost.
Carry Select adder
The reason carry propagation is slow in a ripple adder is because each stage
needs to have a_{i}, b_{i} and c_{i} available before it can calculate c_{i}+i.
One way of removing this dependency on c_{i} is to calculate both
a_{i} + b_{i} + 0, and a_{i} + b_{i} + 1, then choose the appropriate result when
c_{i} becomes available.
This is the basic trick of the carry select adder.
For 32bit operands the 32 stage adder could consist of four 8 bit
groups.
Group_{0} is purely an 8bit ripple carry adder, with c_{0} as it’s carry input.
The sum output from this RCA goes directly to the adder output.
Group_{1} has two 8bit ripple cary adders, one with a C_{in} of 0, the other with
a C_{in} of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor
controlled by the C_{out} of Group_{0}. The output of the mux goes to the adder
output.
Group_{2} has two 8bit RCAs, the same as group_{1}. The sum outputs of this go
to two 2:1 muxes. One is controlled by the C_{out} of the Group_{1} chain with a
C_{in} of 0, so the mux output is what the sum would be if the C_{out} of Group_{0}
were 0. The other is controlled by the C_{out} of the Group_{1} chain with a C_{in}
of 1, giving the sum if the C_{out} of Group_{1} were 1.
These two mux outputs are fed to another 2:1 mux, controlled by the C_{out} of
Group_{0}, to select the correct sum to send to the adder output.
Group_{3} is similar. The sums from the two RCAs go through two 2:1 muxes
controled by the two C_{out}s of the two RCAs in Group_{2}, giving two values, one
if the C_{out} of Group_{1} is 1, one if it is 0.
These two values are passed to another pair of 2:1 muxes, controlled by the
two C_{out}s of Group_{1}, giving two values, one if the C_{out} of Group_{0} is 1, the
other if it is 0.
The correct one of these two values is selected by a final 2:1 controlled by
the C_{out} of Group_{0}.
<This is far too complicated for ASCII art at this time of night. If there’s
any interest I might HTMLify it>
Carry select adders tend to be slightly faster than skip adders, particularly
for wide operands. They will typically consume a little over twice the area
of a ripple carry adder. The layout for a select adder tends to be very
regular, which can be a big advantage for datapath compilers/tilers.
One of the DEC Alphas uses a slight variant on this approach. There’s a
paper in Vol 27, No 11, November 1992 of the IEEE Journal of SolidState
Circuits giving a few paragraphs about the adder, and a nice diagram.
Prefix adders
These are bizarre to think about, but very powerful, particularly for longer
word lengths.
It’s a little mathematical in places….
Define an operator ‘o’:
(g,p) o (g’,p’) = (g + (p · g’), p · p’)
Define G_{i} & P_{i}:
(G_{1},P_{1}) = (g_{1},p_{1})
(G_{i},P_{i}) = (g_{i},p_{i}) o (G_{i1},P_{i1}) 2 <= i <= n
Then
c_{i} = G_{i} for 1 <= i <= n [1]
There’s a fairly easy, but not too interesting, inductive proof of [1].
Next, ‘o’ is associative, ie
(g_{1},p_{1}) o ( (g_{2},p_{2}) o (g_{3},p_{3}) )
= ( (g_{1},p_{1}) o (g_{2},p_{2}) ) o (g_{3},p_{3}) for all (g_{i},p_{i}) [2]
Again I’ll miss out the proof  it’s just an ‘expand both sides and notice
they are identical’.
So to find c_{i} it suffices to calculate
(G_{i},P_{i}) = (g_{i},p_{1}) o (g_{i1},p_{i1}) o ··· o (g_{1},p_{1})
and by [2] this can be calculated in any order.
The BrentKung architecture
(The original reference is Brent & Kung, IEEE Transactions on Computers,
Vol C31,No 3, March 1982)
First consider the simple problem of calculating just c_{n}, for n=16.
Since ‘o’ is associative (G_{16},P_{16}) can be generated by a binary tree:
Each wire in the diagram carries a pair of signals (g,p). (g_{i},p_{i}) are
fed in at the base.
Each node ‘O’ performs this operation:
(g_{in},p_{in}) o (g’_{in},p’_{in})


O
\
 \
 \
 \
(g_{in},p_{in}) (g’_{in},p’_{in})
(G_{16},P_{16})


O
\____
 \______
 \________
 \
O O
\ \
 \__  \__
 \__  \__
 \__  \__
 \  \
O O O O
\_ \_ \_ \_
 \_  \_  \_  \_
 \  \  \  \
O O O O O O O O
\ \ \ \ \ \ \ \
 \  \  \  \  \  \  \  \
 \  \  \  \  \  \  \  \
               
               
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(g_{i},p_{i})
This will give (G_{16},p_{16}) and hence c_{16}. To generate c_{15} downto c_{1}
another similar tree is required:
c_{i}
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
               
 O  O  O  O  O  O  O  
 \  \  \  \  \  \  \  
  \  \  \  \  \  \  \ 
  O    O    O     
  \_   \_   \_    
   \_    \_    \_    
    \    \    \   
    O           
    \_          
     \__         
      \__        
       \        
       \        
        \       
               
               
               
O               
\____             
  \_______           
     \________        
        \       
O        O       
\        \       
 \__        \__      
  \__        \__     
   \__        \__    
    \        \   
O    O    O    O   
\_   \_   \_   \_  
 \_    \_    \_    \_  
  \    \    \    \ 
O  O  O  O  O  O  O  O 
\  \  \  \  \  \  \  \ 
 \  \  \  \  \  \  \  \
               
               
               
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(g_{i},p_{i})
There are many varying architectures for prefix adders, driven by
speed/area/layout complexity tradeoffs.