# i used to be an engineer

###### PUBLISHED ON OCT 14, 2015

Looking through archive.org I found an old web page of mine - it was archived in February 1999, but if I recall correctly was written three or four years before that.

And, hey, look! I used to be an engineer!

### Disclaimer

HTML generated semi-automatically from an old email to the hotwired mailing list that was written very late at night. Don’t treat as gospel….

### Notation

xi is bit i of the number x, in standard twos complement form.

• · is logical and
• + is logical or
• ¤ is logical xor

### Basics

We’re adding two n-bit numbers a & b, and a carry-in c0 to give an n-bit result z, and a carry-out cn

z = a plus b plus c0

The operands and result are of the form

a = { an-1, an-2, … a1, a0 }

and so on.

For all 0 <= i < n

zi = ai ¤ bi ¤ ci

ci+1 = xi · yi + ci · (xi + yi) 

This is often reformulated in terms of propagate (p) and generate (g) signals:

gi = ai · bi 

pi = ai ¤ bi 

(The alternative pi = ai + bi is sometimes used)

(Often a kill signal (k) is used too: ki = not(ai + bi) )

Then the carry signals can be given as:

ci+1 = gi + ci · pi substitute  &  in  

and

zi = pi ¤ ci

The expensive bit about addition is generating ci. Once all the ci are generated the zi can be generated in constant time (not a negligable time, but at least it’s constant with operand length).

### Ripple Carry (RCA)

A ripple carry adder (RCA) implements  directly. You all understand ripple carry adders, so I’ll just say that they are cheap and slow.

The critical path in an RCA is from c0, a0 or b0 through all the full adders to zn-1 or cn.

For a four bit adder we can expand  to give

c1 = g0 + c0·p0 

c2 = g1 + g0·p1 + c0·p0·p1 

c3 = g2 + g1·p2 + g0·p1p2 + c0·p0·p1·p2 

c4 = g3 + g2·p3 + g1·p2·p3 + g0·p1·p2·p3 + c0·p0·p1·p2·p3 

This is reasonable for 4 bits, but will get grotesquely large in terms of are and fan-out if taken much further.

One option is to divide the operands into groups of four bits, use carry-lookahead within each group and ripple the carries between groups.

#### Divide and Conquer

If we can improve on speed of rippling carries within each group, then surely we can improve on the speed of rippling carries between groups by a similar approach.

Consider Group0, consisting of (g0 - g3, p0 - p3)

Defining a `group generate' g* and a`group propagate’ p*:

For Group0:

g*0 = g3 + g2·p3 + g1·p2·p3 + g0·p1·p2·p3

p*0 = p0·p1·p2·p3

and identically for all other groups.

Now

c4 = g80 + c0·p*0

c8 = g1 + g0·p1 + c0·p0·p*1

c12 = g2 + g1·p2 + g0·p1·p2 + c0·p0·p1·p*2

c16 = [deleted - it’s huge….]

These are identical to equations [5-8] used to generate carries within groups.

There’s no need to stop there.

The four-bit groups can be combined into sixteen-bit groups of groups.

Each can produce generate (g) and propagate (p) signals, combined as above to give c16, c32, c48 and c64.

This divide and conquer approach will eventually generate ci for the entire word width.

<This would be a lot clearer with diagrams>

For board level adders CLA is pretty good, ‘cos it can use standard MSI lookahead generator ICs for nearly everything.

It’s not bad in CMOS, but it’s not particularly good either. I’d use one of the following architectures in preference. The only exception might be in a cell-based design where an optimised lookahead cell is provided.

“In VLSI technology the carry-skip adder is comparable in speed to the carry look-ahead technique (for commonly used word lengths) while it requires less chip area and consumes less power.”
– Computer Arithmetic Algorithms, Israel Koren

…and that’s why it’s the adder I’d use for a 32 bit system, and probably for a 64 bit system too.

Carry propagation can skip any stage for which pi = 1 (ie ai != bi). Several consecutive stages can be skipped if pi =1 for each stage.

A carry-skip adder is divided into groups of consecutive stages, with a simple ripple carry scheme in each group.

Each group generates a group propagate signal, p*i.

For Groupi, consisting of k stages j, j+1, … j+k-1

pi = pj · pj+1 · … · pj+k-1

This is used to allow an incoming carry into the group to skip over all the stages in the group and generate a group carry out.

Groupi-Carryout = cj+k + pi · Groupi-Carryin

(cj+k is the normal ripple carry out from the most significat stage in the group)

The critical path through a carry-skip adder is via ripple carry through, one of the groups, and via the skip carry chain through the remainder of the groups.

(Think about it. A carry coming out of a group via the ripple chain must have been generated within the group - if it was generated before the group, p*i would have been 1 and the stage would have been skipped. So the critical path will travel through only one group).

In a carry-skip adder the groups will not all be the same size. The optimal division of an n-bit adder into carry-skip groups depends on the characteristics of the target technology.

A 32 bit adder might have 10 groups of sizes {1, 2, 3, 4, 5, 6, 5, 3, 2, 1} for a typical technology.

VLSI implementations of carry-skip adders tend to be quite small - using the 32 bit adder given above the extra cost over an RCA is about 20 extra gates.

While a single level of carry-skip speeds things up a lot a second level of skip, skipping over more than one group can speed things up a little further for very little extra cost.

The reason carry propagation is slow in a ripple adder is because each stage needs to have ai, bi and ci available before it can calculate ci+i. One way of removing this dependency on ci is to calculate both ai + bi + 0, and ai + bi + 1, then choose the appropriate result when ci becomes available.

This is the basic trick of the carry select adder.

For 32bit operands the 32 stage adder could consist of four 8 bit groups.

Group0 is purely an 8bit ripple carry adder, with c0 as it’s carry input. The sum output from this RCA goes directly to the adder output.

Group1 has two 8bit ripple cary adders, one with a Cin of 0, the other with a Cin of 1. The sum outputs from these two RCAs go to a 2:1 multiplexor controlled by the Cout of Group0. The output of the mux goes to the adder output.

Group2 has two 8bit RCAs, the same as group1. The sum outputs of this go to two 2:1 muxes. One is controlled by the Cout of the Group1 chain with a Cin of 0, so the mux output is what the sum would be if the Cout of Group0 were 0. The other is controlled by the Cout of the Group1 chain with a Cin of 1, giving the sum if the Cout of Group1 were 1.

These two mux outputs are fed to another 2:1 mux, controlled by the Cout of Group0, to select the correct sum to send to the adder output.

Group3 is similar. The sums from the two RCAs go through two 2:1 muxes controled by the two Couts of the two RCAs in Group2, giving two values, one if the Cout of Group1 is 1, one if it is 0.

These two values are passed to another pair of 2:1 muxes, controlled by the two Couts of Group1, giving two values, one if the Cout of Group0 is 1, the other if it is 0.

The correct one of these two values is selected by a final 2:1 controlled by the Cout of Group0.

<This is far too complicated for ASCII art at this time of night. If there’s any interest I might HTMLify it>

Carry select adders tend to be slightly faster than skip adders, particularly for wide operands. They will typically consume a little over twice the area of a ripple carry adder. The layout for a select adder tends to be very regular, which can be a big advantage for datapath compilers/tilers.

One of the DEC Alphas uses a slight variant on this approach. There’s a paper in Vol 27, No 11, November 1992 of the IEEE Journal of Solid-State Circuits giving a few paragraphs about the adder, and a nice diagram.

These are bizarre to think about, but very powerful, particularly for longer word lengths.

It’s a little mathematical in places….

Define an operator ‘o’:

(g,p) o (g’,p’) = (g + (p · g’), p · p’)

Define Gi & Pi:

(G1,P1) = (g1,p1)

(Gi,Pi) = (gi,pi) o (Gi-1,Pi-1) 2 <= i <= n

Then

ci = Gi for 1 <= i <= n 

There’s a fairly easy, but not too interesting, inductive proof of .

Next, ‘o’ is associative, ie

(g1,p1) o ( (g2,p2) o (g3,p3) )

= ( (g1,p1) o (g2,p2) ) o (g3,p3) for all (gi,pi) 

Again I’ll miss out the proof - it’s just an ‘expand both sides and notice they are identical’.

So to find ci it suffices to calculate

(Gi,Pi) = (gi,p1) o (gi-1,pi-1) o ··· o (g1,p1)

and by  this can be calculated in any order.

#### The Brent-Kung architecture

(The original reference is Brent & Kung, IEEE Transactions on Computers, Vol C-31,No 3, March 1982)

First consider the simple problem of calculating just cn, for n=16. Since ‘o’ is associative (G16,P16) can be generated by a binary tree:

Each wire in the diagram carries a pair of signals (g,p). (gi,pi) are fed in at the base.

Each node ‘O’ performs this operation:

(gin,pin) o (g’in,p’in)
```            |
|
O
|
|
|
|   </pre>

(gin,pin)  (g’in,p’in)

(G16,P16)
|
|
O
|____
|     ______
|             ________
|
O                       O
|\                      |
| __                   | __
|    __                |    __
|       __             |       __
|          \            |
O           O           O           O
|_         |_         |_         |_
|  _       |  _       |  _       |  _
|    \      |    \      |    \      |
O     O     O     O     O     O     O     O
|\    |\    |\    |\    |\    |\    |\    |\
| \   | \   | \   | \   | \   | \   | \   |
|  \  |  \  |  \  |  \  |  \  |  \  |  \  |  \
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
(gi,pi)

This will give (G16,p16) and hence c16. To generate c15 downto c1
another similar tree is required:

ci
16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  O  |  O  |  O  |  O  |  O  |  O  |  O  |  |
|  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |
|  | |  | |  | |  | |  | |  | |  | |  |
|  |  O  |  |  |  O  |  |  |  O  |  |  |  |  |
|  |  |_|  |  |  |_|  |  |  |_|  |  |  |  |
|  |  |  _ |  |  |  _ |  |  |  _ |  |  |  |
|  |  |  | |  |  |  | |  |  |  | |  |  |  |
|  |  |  |  O  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |_|  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  _|  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  _|  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  \  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |\ |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  | |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
O  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|_|  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  ____ |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  | ________ |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  | |  |  |  |  |  |  |  |
O  |  |  |  |  |  |  |  O  |  |  |  |  |  |  |
|\ |  |  |  |  |  |  |  |\ |  |  |  |  |  |  |
| __ |  |  |  |  |  |  | __ |  |  |  |  |  |
|  | __ |  |  |  |  |  |  | __ |  |  |  |  |
|  |  | __ |  |  |  |  |  |  | __ |  |  |  |
|  |  |  | |  |  |  |  |  |  |  | |  |  |  |
O  |  |  |  O  |  |  |  O  |  |  |  O  |  |  |
|_|  |  |  |_|  |  |  |_|  |  |  |_|  |  |
|  _ |  |  |  _ |  |  |  _ |  |  |  _ |  |
|  | |  |  |  | |  |  |  | |  |  |  | |  |
O  |  O  |  O  |  O  |  O  |  O  |  O  |  O  |
|\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |  |\ |
| |  | |  | |  | |  | |  | |  | |  | |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
(gi,pi)

There are many varying architectures for prefix adders, driven by