Looking through archive.org I found an old web page of mine - it was archived in February 1999, but if I recall correctly was written three or four years before that.

And, hey, look! I used to be an engineer!

# Adder Architectures

### Disclaimer

HTML generated semi-automatically from an old email to the hotwired mailing list that was written very late at night. Don’t treat as gospel….

### Notation

x_{i} is bit i of the number x, in standard twos complement form.

- · is logical and
- + is logical or
- ¤ is logical xor

### Basics

We’re adding two n-bit numbers a & b, and a carry-in c_{0} to give an
n-bit result z, and a carry-out c_{n}

z = a plus b plus c_{0}

The operands and result are of the form

a = { a_{n-1}, a_{n-2}, … a_{1}, a_{0}}

and so on.

Logically addition is this:

For all 0 <= i < n

z_{i}= a_{i}¤ b_{i}¤ c_{i}

c_{i+1}= x_{i}· y_{i}+ c_{i}· (x_{i}+ y_{i})[1]

This is often reformulated in terms of propagate (p) and generate (g) signals:

g_{i}= a_{i}· b_{i}[2]

p_{i}= a_{i}¤ b_{i}[3]

(The alternative p_{i} = a_{i} + b_{i} is sometimes used)

(Often a kill signal (k) is used too: k_{i} = not(a_{i} + b_{i}) )

Then the carry signals can be given as:

c_{i+1}= g_{i}+ c_{i}· p_{i}substitute[2]&[3]in[1][4]

and

z_{i}= p_{i}¤ c_{i}

The expensive bit about addition is generating c_{i}. Once all the c_{i}
are generated the z_{i} can be generated in constant time (not a negligable
time, but at least it’s constant with operand length).

## Different adder architectures

### Ripple Carry (RCA)

A ripple carry adder (RCA) implements **[1]** directly. You all understand
ripple carry adders, so I’ll just say that they are cheap and slow.

The critical path in an RCA is from c_{0}, a_{0} or b_{0} through all the
full adders to z_{n-1} or c_{n}.

### Carry lookahead (CLA)

For a four bit adder we can expand **[4]** to give

c_{1}= g_{0}+ c_{0}·p_{0}[5]

c_{2}= g_{1}+ g_{0}·p_{1}+ c_{0}·p_{0}·p_{1}[6]

c_{3}= g_{2}+ g_{1}·p_{2}+ g_{0}·p_{1}_{p}_{2}+ c_{0}·p_{0}·p_{1}·p_{2}[7]

c_{4}= g_{3}+ g_{2}·p_{3}+ g_{1}·p_{2}·p_{3}+ g_{0}·p_{1}·p_{2}·p_{3}+ c_{0}·p_{0}·p_{1}·p_{2}·p_{3}[8]

This is reasonable for 4 bits, but will get grotesquely large in terms of are and fan-out if taken much further.

One option is to divide the operands into groups of four bits, use carry-lookahead within each group and ripple the carries between groups.

#### Divide and Conquer

If we can improve on speed of rippling carries within each group, then surely we can improve on the speed of rippling carries between groups by a similar approach.

Consider Group_{0}, consisting of
(g_{0} - g_{3}, p_{0} - p_{3})

Defining a `group generate’ g* and a `group propagate’ p*:

For Group_{0}:

g*_{0}= g_{3}+ g_{2}·p_{3}+ g_{1}·p_{2}·p_{3}+ g_{0}·p_{1}·p_{2}·p_{3}

p*_{0}= p_{0}·p_{1}·p_{2}·p_{3}

and identically for all other groups.

Now

c_{4}= g8_{0}+ c_{0}·p*_{0}

c_{8}= g*_{1}+ g*_{0}·p*_{1}+ c_{0}·p*_{0}·p*_{1}

c_{12}= g*_{2}+ g*_{1}·p*_{2}+ g*_{0}·p*_{1}·p*_{2}+ c_{0}·p*_{0}·p*_{1}·p*_{2}

c_{16}= [deleted - it’s huge….]

These are identical to equations **[5-8]** used to generate carries within
groups.

There’s no need to stop there.

The four-bit groups can be combined into sixteen-bit groups of groups.

Each can produce generate (g**) and propagate (p**) signals, combined as
above to give c_{16}, c_{32}, c_{48} and c_{64}.

This divide and conquer approach will eventually generate c_{i} for the
entire word width.

<This would be a lot clearer with diagrams>

For board level adders CLA is pretty good, ‘cos it can use standard MSI lookahead generator ICs for nearly everything.

It’s not *bad* in CMOS, but it’s not particularly good either. I’d use
one of the following architectures in preference. The only exception might
be in a cell-based design where an optimised lookahead cell is provided.

### Carry Skip adder

“In VLSI technology the carry-skip adder is comparable in speed to the carry look-ahead technique (for commonly used word lengths) while it requires less chip area and consumes less power.”

– Computer Arithmetic Algorithms, Israel Koren

…and that’s why it’s the adder I’d use for a 32 bit system, and probably for a 64 bit system too.

Carry propagation can skip any stage for which p_{i} = 1 (ie a_{i} != b_{i}).
Several consecutive stages can be skipped if p_{i} =1 for each stage.

A carry-skip adder is divided into groups of consecutive stages, with a simple ripple carry scheme in each group.

Each group generates a group propagate signal, p*_{i}.

For Group_{i}, consisting of k stages j, j+1, … j+k-1

p*_{i}= p_{j}· p_{j+1}· … · p_{j+k-1}

This is used to allow an incoming carry into the group to skip over all the stages in the group and generate a group carry out.

Group_{i}-Carry_{out}= c_{j+k}+ p*_{i}· Group_{i}-Carry_{in}

(c_{j+k} is the normal ripple carry out from the most significat stage
in the group)

The critical path through a carry-skip adder is via ripple carry through, one of the groups, and via the skip carry chain through the remainder of the groups.

(Think about it. A carry coming out of a group via the ripple
chain *must* have been generated within the group - if it was generated
before the group, p*_{i} would have been 1 and the stage would have been
skipped. So the critical path will travel through only one group).

In a carry-skip adder the groups will not all be the same size. The optimal division of an n-bit adder into carry-skip groups depends on the characteristics of the target technology.

A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1} for a typical technology.

VLSI implementations of carry-skip adders tend to be quite small - using the 32 bit adder given above the extra cost over an RCA is about 20 extra gates.

While a single level of carry-skip speeds things up a lot a second level of skip, skipping over more than one group can speed things up a little further for very little extra cost.

### Carry Select adder

The reason carry propagation is slow in a ripple adder is because each stage
needs to have a_{i}, b_{i} and c_{i} available before it can calculate c_{i}+i.
One way of removing this dependency on c_{i} is to calculate both
a_{i} + b_{i} + 0, and a_{i} + b_{i} + 1, then choose the appropriate result when
c_{i} becomes available.

This is the basic trick of the carry select adder.

For 32bit operands the 32 stage adder could consist of four 8 bit groups.

Group_{0} is purely an 8bit ripple carry adder, with c_{0} as it’s carry input.
The sum output from this RCA goes directly to the adder output.

Group_{1} has two 8bit ripple cary adders, one with a C_{in} of 0, the other with
a C_{in} of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor
controlled by the C_{out} of Group_{0}. The output of the mux goes to the adder
output.

Group_{2} has two 8bit RCAs, the same as group_{1}. The sum outputs of this go
to two 2:1 muxes. One is controlled by the C_{out} of the Group_{1} chain with a
C_{in} of 0, so the mux output is what the sum would be if the C_{out} of Group_{0}
were 0. The other is controlled by the C_{out} of the Group_{1} chain with a C_{in}
of 1, giving the sum if the C_{out} of Group_{1} were 1.

These two mux outputs are fed to another 2:1 mux, controlled by the C_{out} of
Group_{0}, to select the correct sum to send to the adder output.

Group_{3} is similar. The sums from the two RCAs go through two 2:1 muxes
controled by the two C_{out}s of the two RCAs in Group_{2}, giving two values, one
if the C_{out} of Group_{1} is 1, one if it is 0.

These two values are passed to another pair of 2:1 muxes, controlled by the
two C_{out}s of Group_{1}, giving two values, one if the C_{out} of Group_{0} is 1, the
other if it is 0.

The correct one of these two values is selected by a final 2:1 controlled by
the C_{out} of Group_{0}.

<This is far too complicated for ASCII art at this time of night. If there’s any interest I might HTMLify it>

Carry select adders tend to be slightly faster than skip adders, particularly for wide operands. They will typically consume a little over twice the area of a ripple carry adder. The layout for a select adder tends to be very regular, which can be a big advantage for datapath compilers/tilers.

One of the DEC Alphas uses a slight variant on this approach. There’s a paper in Vol 27, No 11, November 1992 of the IEEE Journal of Solid-State Circuits giving a few paragraphs about the adder, and a nice diagram.

### Prefix adders

These are bizarre to think about, but very powerful, particularly for longer word lengths.

It’s a little mathematical in places….

Define an operator ‘o’:

(g,p) o (g’,p’) = (g + (p · g’), p · p’)

Define G_{i} & P_{i}:

(G_{1},P_{1}) = (g_{1},p_{1})

(G_{i},P_{i}) = (g_{i},p_{i}) o (G_{i-1},P_{i-1}) 2 <= i <= n

Then

c_{i}= G_{i}for 1 <= i <= n[1]

There’s a fairly easy, but not too interesting, inductive proof of **[1]**.

Next, ‘o’ is associative, ie

(g_{1},p_{1}) o ( (g_{2},p_{2}) o (g_{3},p_{3}) )

= ( (g_{1},p_{1}) o (g_{2},p_{2}) ) o (g_{3},p_{3}) for all (g_{i},p_{i})[2]

Again I’ll miss out the proof - it’s just an ‘expand both sides and notice they are identical’.

So to find c_{i} it suffices to calculate

(G_{i},P_{i}) = (g_{i},p_{1}) o (g_{i-1},p_{i-1}) o ··· o (g_{1},p_{1})

and by **[2]** this can be calculated in any order.

#### The Brent-Kung architecture

(The original reference is Brent & Kung, IEEE Transactions on Computers, Vol C-31,No 3, March 1982)

First consider the simple problem of calculating just c_{n}, for n=16.
Since ‘o’ is associative (G_{16},P_{16}) can be generated by a binary tree:

Each wire in the diagram carries a pair of signals (g,p). (g_{i},p_{i}) are
fed in at the base.

Each node ‘O’ performs this operation:

(g_{in},p_{in}) o (g’_{in},p’_{in})

| | O |\ | \ | \ | \

(g_{in},p_{in}) (g’_{in},p’_{in})

(G_{16},P_{16})

| | O |\____ | \______ | \________ | \ O O |\ |\ | \__ | \__ | \__ | \__ | \__ | \__ | \ | \ O O O O |\_ |\_ |\_ |\_ | \_ | \_ | \_ | \_ | \ | \ | \ | \ O O O O O O O O |\ |\ |\ |\ |\ |\ |\ |\ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

* (g _{i},p_{i})*

This will give (G_{16},p_{16}) and hence c_{16}. To generate c_{15} downto c_{1}
another similar tree is required:

*c _{i}*

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 | | | | | | | | | | | | | | | | | O | O | O | O | O | O | O | | | |\ | |\ | |\ | |\ | |\ | |\ | |\ | | | | \| | \| | \| | \| | \| | \| | \| | | | O | | | O | | | O | | | | | | | |\_| | | |\_| | | |\_| | | | | | | | \_ | | | \_ | | | \_ | | | | | | | | \| | | | \| | | | \| | | | | | | | O | | | | | | | | | | | | | | | |\_| | | | | | | | | | | | | | | | \__| | | | | | | | | | | | | | | | \__| | | | | | | | | | | | | | | | \ | | | | | | | | | | | | | | | |\ | | | | | | | | | | | | | | | | \| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | O | | | | | | | | | | | | | | | |\____| | | | | | | | | | | | | | | | \_______ | | | | | | | | | | | | | | | | \________ | | | | | | | | | | | | | | | | \| | | | | | | | O | | | | | | | O | | | | | | | |\ | | | | | | | |\ | | | | | | | | \__ | | | | | | | \__ | | | | | | | | \__ | | | | | | | \__ | | | | | | | | \__ | | | | | | | \__ | | | | | | | | \| | | | | | | | \| | | | O | | | O | | | O | | | O | | | |\_| | | |\_| | | |\_| | | |\_| | | | \_ | | | \_ | | | \_ | | | \_ | | | | \| | | | \| | | | \| | | | \| | O | O | O | O | O | O | O | O | |\ | |\ | |\ | |\ | |\ | |\ | |\ | |\ | | \| | \| | \| | \| | \| | \| | \| | \| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(g_{i},p_{i})

There are many varying architectures for prefix adders, driven by speed/area/layout complexity tradeoffs.