Looking through archive.org I found an old web page of mine - it was archived in February 1999, but if I recall correctly was written three or four years before that.

And, hey, look! I used to be an engineer!

Adder Architectures

Disclaimer

HTML generated semi-automatically from an old email to the hotwired mailing list that was written very late at night. Don’t treat as gospel….

Notation

x_i is bit i of the number x, in standard twos complement form.

· is logical and
+ is logical or
¤ is logical xor

Basics

We’re adding two n-bit numbers a & b, and a carry-in c₀ to give an n-bit result z, and a carry-out c_n

z = a plus b plus c₀

The operands and result are of the form

a = { a_n-1, a_n-2, … a₁, a₀ }

and so on.

Logically addition is this:

For all 0 <= i < n

z_i = a_i ¤ b_i ¤ c_i

c_i+1 = x_i · y_i + c_i · (x_i + y_i) [1]

This is often reformulated in terms of propagate (p) and generate (g) signals:

g_i = a_i · b_i [2]

p_i = a_i ¤ b_i [3]

(The alternative p_i = a_i + b_i is sometimes used)

(Often a kill signal (k) is used too: k_i = not(a_i + b_i) )

Then the carry signals can be given as:

c_i+1 = g_i + c_i · p_i substitute [2] & [3] in [1] [4]

and

z_i = p_i ¤ c_i

The expensive bit about addition is generating c_i. Once all the c_i are generated the z_i can be generated in constant time (not a negligable time, but at least it’s constant with operand length).

Different adder architectures

Ripple Carry (RCA)

A ripple carry adder (RCA) implements [1] directly. You all understand ripple carry adders, so I’ll just say that they are cheap and slow.

The critical path in an RCA is from c₀, a₀ or b₀ through all the full adders to z_n-1 or c_n.

Carry lookahead (CLA)

For a four bit adder we can expand [4] to give

c₁ = g₀ + c₀·p₀ [5]

c₂ = g₁ + g₀·p₁ + c₀·p₀·p₁ [6]

c₃ = g₂ + g₁·p₂ + g₀·p₁_p₂ + c₀·p₀·p₁·p₂ [7]

c₄ = g₃ + g₂·p₃ + g₁·p₂·p₃ + g₀·p₁·p₂·p₃ + c₀·p₀·p₁·p₂·p₃ [8]

This is reasonable for 4 bits, but will get grotesquely large in terms of are and fan-out if taken much further.

One option is to divide the operands into groups of four bits, use carry-lookahead within each group and ripple the carries between groups.

Divide and Conquer

If we can improve on speed of rippling carries within each group, then surely we can improve on the speed of rippling carries between groups by a similar approach.

Consider Group₀, consisting of (g₀ - g₃, p₀ - p₃)

Defining a group generate' g* and agroup propagate’ p*:

For Group₀:

g*₀ = g₃ + g₂·p₃ + g₁·p₂·p₃ + g₀·p₁·p₂·p₃

p*₀ = p₀·p₁·p₂·p₃

and identically for all other groups.

Now

c₄ = g8₀ + c₀·p*₀

c₈ = g₁ + g₀·p₁ + c₀·p₀·p*₁

c₁₂ = g₂ + g₁·p₂ + g₀·p₁·p₂ + c₀·p₀·p₁·p*₂

c₁₆ = [deleted - it’s huge….]

These are identical to equations [5-8] used to generate carries within groups.

There’s no need to stop there.

The four-bit groups can be combined into sixteen-bit groups of groups.

Each can produce generate (g) and propagate (p) signals, combined as above to give c₁₆, c₃₂, c₄₈ and c₆₄.

This divide and conquer approach will eventually generate c_i for the entire word width.

For board level adders CLA is pretty good, ‘cos it can use standard MSI lookahead generator ICs for nearly everything.

It’s not bad in CMOS, but it’s not particularly good either. I’d use one of the following architectures in preference. The only exception might be in a cell-based design where an optimised lookahead cell is provided.

Carry Skip adder

“In VLSI technology the carry-skip adder is comparable in speed to the carry look-ahead technique (for commonly used word lengths) while it requires less chip area and consumes less power.”
– Computer Arithmetic Algorithms, Israel Koren

…and that’s why it’s the adder I’d use for a 32 bit system, and probably for a 64 bit system too.

Carry propagation can skip any stage for which p_i = 1 (ie a_i != b_i). Several consecutive stages can be skipped if p_i =1 for each stage.

A carry-skip adder is divided into groups of consecutive stages, with a simple ripple carry scheme in each group.

Each group generates a group propagate signal, p*_i.

For Group_i, consisting of k stages j, j+1, … j+k-1

p_i = p_j · p_j+1 · … · p_j+k-1

This is used to allow an incoming carry into the group to skip over all the stages in the group and generate a group carry out.

Group_i-Carry_out = c_j+k + p_i · Group_i-Carry_in

(c_j+k is the normal ripple carry out from the most significat stage in the group)

The critical path through a carry-skip adder is via ripple carry through, one of the groups, and via the skip carry chain through the remainder of the groups.

(Think about it. A carry coming out of a group via the ripple chain must have been generated within the group - if it was generated before the group, p*_i would have been 1 and the stage would have been skipped. So the critical path will travel through only one group).

In a carry-skip adder the groups will not all be the same size. The optimal division of an n-bit adder into carry-skip groups depends on the characteristics of the target technology.

A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1} for a typical technology.

VLSI implementations of carry-skip adders tend to be quite small - using the 32 bit adder given above the extra cost over an RCA is about 20 extra gates.

While a single level of carry-skip speeds things up a lot a second level of skip, skipping over more than one group can speed things up a little further for very little extra cost.

Carry Select adder

The reason carry propagation is slow in a ripple adder is because each stage needs to have a_i, b_i and c_i available before it can calculate c_i+i. One way of removing this dependency on c_i is to calculate both a_i + b_i + 0, and a_i + b_i + 1, then choose the appropriate result when c_i becomes available.

This is the basic trick of the carry select adder.

For 32bit operands the 32 stage adder could consist of four 8 bit groups.

Group₀ is purely an 8bit ripple carry adder, with c₀ as it’s carry input. The sum output from this RCA goes directly to the adder output.

Group₁ has two 8bit ripple cary adders, one with a C_in of 0, the other with a C_in of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor controlled by the C_out of Group₀. The output of the mux goes to the adder output.

Group₂ has two 8bit RCAs, the same as group₁. The sum outputs of this go to two 2:1 muxes. One is controlled by the C_out of the Group₁ chain with a C_in of 0, so the mux output is what the sum would be if the C_out of Group₀ were 0. The other is controlled by the C_out of the Group₁ chain with a C_in of 1, giving the sum if the C_out of Group₁ were 1.

These two mux outputs are fed to another 2:1 mux, controlled by the C_out of Group₀, to select the correct sum to send to the adder output.

Group₃ is similar. The sums from the two RCAs go through two 2:1 muxes controled by the two C_outs of the two RCAs in Group₂, giving two values, one if the C_out of Group₁ is 1, one if it is 0.

These two values are passed to another pair of 2:1 muxes, controlled by the two C_outs of Group₁, giving two values, one if the C_out of Group₀ is 1, the other if it is 0.

The correct one of these two values is selected by a final 2:1 controlled by the C_out of Group₀.

Carry select adders tend to be slightly faster than skip adders, particularly for wide operands. They will typically consume a little over twice the area of a ripple carry adder. The layout for a select adder tends to be very regular, which can be a big advantage for datapath compilers/tilers.

One of the DEC Alphas uses a slight variant on this approach. There’s a paper in Vol 27, No 11, November 1992 of the IEEE Journal of Solid-State Circuits giving a few paragraphs about the adder, and a nice diagram.

Prefix adders

These are bizarre to think about, but very powerful, particularly for longer word lengths.

It’s a little mathematical in places….

Define an operator ‘o’:

(g,p) o (g’,p’) = (g + (p · g’), p · p’)

Define G_i & P_i:

(G₁,P₁) = (g₁,p₁)

(G_i,P_i) = (g_i,p_i) o (G_i-1,P_i-1) 2 <= i <= n

Then

c_i = G_i for 1 <= i <= n [1]

There’s a fairly easy, but not too interesting, inductive proof of [1].

Next, ‘o’ is associative, ie

(g₁,p₁) o ( (g₂,p₂) o (g₃,p₃) )

= ( (g₁,p₁) o (g₂,p₂) ) o (g₃,p₃) for all (g_i,p_i) [2]

Again I’ll miss out the proof - it’s just an ‘expand both sides and notice they are identical’.

So to find c_i it suffices to calculate

(G_i,P_i) = (g_i,p₁) o (g_i-1,p_i-1) o ··· o (g₁,p₁)

and by [2] this can be calculated in any order.

The Brent-Kung architecture

(The original reference is Brent & Kung, IEEE Transactions on Computers, Vol C-31,No 3, March 1982)

First consider the simple problem of calculating just c_n, for n=16. Since ‘o’ is associative (G₁₆,P₁₆) can be generated by a binary tree:

Each wire in the diagram carries a pair of signals (g,p). (g_i,p_i) are fed in at the base.

Each node ‘O’ performs this operation:

(g_in,p_in) o (g’_in,p’_in)

            |
            |
            O
            |

            | 

            |  

            |   </pre>

        (g_in,p_in)  (g’_in,p’_in)



(G₁₆,P₁₆)
                 |
                 |
                 O
                 |____
                 |     ______
                 |             ________
                 |                      

                 O                       O
                 |\                      |

                 | __                   | __
                 |    __                |    __
                 |       __             |       __
                 |          \            |          

                 O           O           O           O
                 |_         |_         |_         |_
                 |  _       |  _       |  _       |  _
                 |    \      |    \      |    \      |    

                 O     O     O     O     O     O     O     O
                 |\    |\    |\    |\    |\    |\    |\    |\

                 | \   | \   | \   | \   | \   | \   | \   | 

                 |  \  |  \  |  \  |  \  |  \  |  \  |  \  |  \

                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
       (g_i,p_i)


This will give (G₁₆,p₁₆) and hence c₁₆. To generate c₁₅ downto c₁
another similar tree is required:


c_i
                16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  O  |  O  |  O  |  O  |  O  |  O  |  O  |  |
                 |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |
                 |  | |  | |  | |  | |  | |  | |  | |  |
                 |  |  O  |  |  |  O  |  |  |  O  |  |  |  |  |
                 |  |  |_|  |  |  |_|  |  |  |_|  |  |  |  |
                 |  |  |  _ |  |  |  _ |  |  |  _ |  |  |  |
                 |  |  |  | |  |  |  | |  |  |  | |  |  |  |
                 |  |  |  |  O  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |_|  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  _|  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  _|  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  \  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |\ |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  | |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 O  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |_|  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  ____ |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  | ________ |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  | |  |  |  |  |  |  |  |
                 O  |  |  |  |  |  |  |  O  |  |  |  |  |  |  |
                 |\ |  |  |  |  |  |  |  |\ |  |  |  |  |  |  |
                 | __ |  |  |  |  |  |  | __ |  |  |  |  |  |
                 |  | __ |  |  |  |  |  |  | __ |  |  |  |  |
                 |  |  | __ |  |  |  |  |  |  | __ |  |  |  |
                 |  |  |  | |  |  |  |  |  |  |  | |  |  |  |
                 O  |  |  |  O  |  |  |  O  |  |  |  O  |  |  |
                 |_|  |  |  |_|  |  |  |_|  |  |  |_|  |  |
                 |  _ |  |  |  _ |  |  |  _ |  |  |  _ |  |
                 |  | |  |  |  | |  |  |  | |  |  |  | |  |
                 O  |  O  |  O  |  O  |  O  |  O  |  O  |  O  |
                 |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |
                 | |  | |  | |  | |  | |  | |  | |  | |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
(g_i,p_i)


There are many varying architectures for prefix adders, driven by
speed/area/layout complexity tradeoffs.

i used to be an engineer

PUBLISHED ON OCT 14, 2015