# A Fanout Optimization Algorithm based on the Effort Delay Model

Peyman Rezvani and Massoud Pedram

Department of Electrical Engineering - Systems University of Southern California Los Angeles, CA 90089 Tel: +1-213.740-4458 Email: [peyman.pedram]@usc.edu

#### ABSTRACT

The space present LEORER, a Legard Tiphe hand found the finite of the other and the which relies a the analysisking of a finite origination to the finite finite. Since the the spectral logical digits in VISI circuits, the prepared algorithm atomys to institute, the truth logic most and the negatived time and logicar aspectrames constraints by constructing the finance true topoloand marking in the field first is and. However, the prepared algorithm produces the originant finance true administration of the spectra of the spectra of the spectra distorts with Wisey of the first is unstable, the prepared algorithm produces the originant distorts with Wisey of the size true topology is invertised to a studie to fulfiers. For the case that distorts with Wisey of the size is unstable, the prograd dispersion agree processing the first marking the size of given is unstable, the angeous dispersion agree and the size integrite transformer the constraints prediction and advectional products of first original transformer the constraints of product angeous angeous has both for continuous and discrete height characters, LEORBE advecters a significant reduction is the marking the size of given topology in the structure.

## I. INTRODUCTION

Wey often is a V331 design, a signal needs to be doublead to several doubleads on large models finant geometry and the doublead of the several doubleads and the limittation on the load that can be division. If you cance signal, Pison or application is the problem of the data with the several doublead of the several several doublead of the several models. The several doublead doublead of the several doublead of the several several doublead doublead of the several doublead of the several doublead doublead doublead of the several doublead doublead of the several doublead d [10]), note totalization (21) have a also here proposed which are nown accurate delay model or over making intercontent delay in massoure [11] how records, however, securities (11) have a started to use continuous, as opposed to discrete, in the hards, in the same that the optimal matter test is collarisated in the management that there are available in all sizes. This grandy simplifies the problem and allows the application of more powerful equivalation test-hisper-are discussions of a structure of a discrete transfer interview in a pipel a ADE Hard matter the materia in the ambiguing that there are available in all the sizes. This grandy and grands are problem and allows the application of more powerful equivalants trachings-as, the interview of the area of the size of the si

In [1], the summa simplified the fators optimization problem by restricting the survelvage trabins of the result advances way for possible with a disprimine that consider a larger set of topologies. The authors used a dynamic pregnamming approach to implicting and the set of the scalable Li-bers and find the optimis Liferent probaging and sizing Lifeters [1] also nestines the scalable set of the advanced in these still relation an optimal confarint in this scarch space under a gain-based deal scalable. The scalable scalable scalable set of the scalable space under a gain-based deal scalable. The scalable scalable scalable scalable scalable scalable scalable scalable scalable scalable. The scalable scala

In this paper, an algorithm is presented that finds the fanout tree topology and sizes of the buffers on the tree by decomposing the whole problem into subproblems and solving each subproblem separately for each shall. The solutions to the subproblems are then merged to form the solution to the whole problem. Our derivation relies on the notions of logical and electrical effort first proposed in [4].

Subtrivia and Styond (1) minimized the delay along any single path by unigring equal darge burgher see charge are of them. Which this approves an primer to minimize the delay, it did not necessarily symalls in a orphical solution in terms of the statal buffer area. Kong (1), on the other hand, subtred the finance optimization products to minimize the layer capacitance areas at the energy energies of the state of the state of the state of the state of the buffer area. In contant, the approach presented in this paper minimizes for law labels areas angles optimizer constraint of the divers. This is an important distriction bareas (links one to take off the programmed direction of the state of the state of the the state that the state off the programmed direction bareas (links one to take off the programmed direction bareas) and anothers in the state of the buffer two to reduce the tauk buffer areas induces in the state of the state the two to states the state of the state of the state of the state the state of the state the state off the programmed direction bareas and the state of the state the state of the state the state off the programmed direction bareas and the state of the state the state of the state the state that the state the state of the state the state that the state the state of the state the state of the state the state of the state that the state that the state the state of the state that the state the state of the state that the state that the state of the sta The remainder of this paper is organized as follows. In Section II, the effort delay model that is used throughout this paper is explained. Section III explains the details of the algorithm. In Section IV, experimental results are shown, and in Section V, we conclude the paper.

#### II. DELAY MODEL

The delay model used in this paper is based on the concept of logical and electrical efforts presented in [4]. The effort-based model is basically a reformulation of the conventional RC model of CMOS gase delay.

Using the same terminology as in [4], the delay of a gate is defined to be:

$$d = \tau(p + gh)$$
 (1)

where it is itere unit the characterize the semiconductor process their good. It is only such a concert the ank-loop provides plants are inter the straighty; z is no conducted from now on. Parameter p is the practice delay of the gars. The major combinism of the the comparison of the source of the gars and appends on the storage of the gars and the source of the source of

The important point is that p and g are independent of the size of the gate, and the only factor that is affected by sizing is the electrical effort h. Reference [4] shows how p and g are independent of

A preliminary version of this work was presented in [8]. The journal submission includes more theoretical derivation, rosults, and experimental data.

sizing by doing the reformulation to define the four factors  $\tau$ , p, g and h in terms of the resistance and capacitance of a minimum size inverter and a template gate representing the topology of the sate. For details refer to 141.

### III. ALGORITHM

In this section, the fanout optimization problem is stated as two separate problems, and each one is solved separately.

One-Shik Famout Optimization (1910) Problem: Circu the source of a signal Q with maximum driving capability  $C_{in}$  and a sink S with capacitive load  $C_{in}$ , required polarity R and required related in  $R_{in}$  find the optimum number of buffers for a buffer chain and the appropriate sizing for the true minimize the total buffer area such that the delay from Q to S is less than or equal  $R_{in}$ , the incrusive of the interval of the capacity for all memory of n to how the maximum  $R_{in}$  the incrusive of the interval of the capacity for all memory of n to how the matrix  $R_{in}$  the incrusive of the interval of the capacity for all memory of n to how the matrix  $R_{in}$  of the capacity of  $R_{in}$  and  $R_{in}$  is a single of the capacity for all memory of n to how the matrix  $R_{in}$  of  $R_{in}$  is a single of the capacity for all memory of n to how the  $R_{in}$ .

Multiple-Shik Famout Optimization (arXO) Problem: Given the source of a signal q with maximum driving capability  $C_{ab}$  along with a set of *m* sinks  $S_{c}$  acts of *m* shich is meigned a triplet  $(C_{c},T_{c},T_{c})$ , but were  $(S_{c})$  in the experiment of  $T_{c}$  is the required antival time, and  $T_{c}$  is the required polarity for the sinks  $S_{c}$  finds a fanous tree of buffers and the appropriate sizing for them to minimize the total buffer areas such that the timing constraint and the polarity required at each sink is simified and the couplex box allows and the polarity required at each sink

Note that the only difference between the two problems is the number of sinks to be driven. The objective function, area, in both of these problems is considered to be the summation of input capacitances of all the buffers, which is reasonable with the assumption of continuous sizing for the astes.

The rest of this section is organized as follows. The 1FO problem is solved in Section A, and in Section B, the mFO problem is solved based on the solution derived for 1FO problem.

# A. Buffer Chain

For the 1FO problem, the solution is a chain of buffers between the source and the sink. The variables of the problem are defined to be the number of buffers,  $n_c$  and the electrical efforts of these buffers,  $h_c$ ,  $h_{sc}$ .



Fig. 1: Buffer Chain.

Since the logical effort for an inverter is 1, the delay through the buffer chain can be expressed in terms of *n* and *h*'s as follows.

$$delay = np_{inv} + \sum_{i=1}^{n} h_i$$
 (2)

The overall area, which is calculated as the summation of the input capacitances of all buffers on the buffer chain, may subsequently be expressed as:

$$area = \sum_{i=1}^{n} C_i = \sum_{i=1}^{n} \frac{C_L}{\sum_{i=1}^{n} \prod_{j=1}^{n} h_j}$$
 (3)

The goal would be to find n and all h/s to minimize area while both timing and input capacitance constraints are satisfied; that is,

$$Min area$$
  
 $st: delay \le T_R$   
 $C_1 \le C_{in}$ 

Theorem 1: In the 1FO problem, delay through the optimum buffer chain is exactly equal to the specified required time  $T_v$ , i.e.,  $delay = T_v$ .

**Proof:** According to equation (3), sure is a monotonically decreasing function of all  $h_1^{i_1}$ (i=1,...,n). In other words, increasing any  $h_1$  will always result in a buffer chain with smaller area. The delay, on the other bank, is increasing function of all  $h_1^{i_1}$  according to (2). This means that by increasing any arbitrary  $h_2$  area can be decreased and delay can be increased up to the point that delay becomes no larger than the given constraint  $T_{g_1}$  therefore, the order distributive that has delay  $-T_{g_1}$ .

Lemma 1: In the IFO problem, for a fixed number of buffers, u, in the chain, the optimum buffer chain has  $\sum h_i$  equal to a constant  $T_B - np_{inv}$ .

Proof: According to Theorem 1 and equation (2):

$$np_{inv} + \sum_{i=1}^{n} h_i = T_R$$

The first term on the left hand side,  $np_{law}$ , is constant for a given n; therefore,  $\sum h_i$  for the optimum buffer chain with n buffers is also constant and equal to:

$$\sum_{i=1}^{n} h_i = T_R - np_{inv}$$
 (4)

Hence the claim is proved.

To find the optimum number of buffers, n, the maximum input capacitance constraint  $C_1 \le C_{in}$  is used, where  $C_1$  is the input capacitance of the first buffer in the chain being driven by the source signal and  $C_-$  is the niver constraint on the input capacitance.

The input capacitance for the first buffer is computed as follows.

$$C_1 = \frac{C_L}{\prod h_i}$$
 (5)

Let the electrical effort of the chain be defined as the product of electrical efforts of all the buffers, and let it be shown by *H*. Using the above equation, the input capacitance constraint can be restated as follows:

$$H = \prod h_i = \frac{C_L}{C_1} \ge \frac{C_L}{C_{in}}$$
 (6)

Theorem 2: In the 1FO problem, for a fixed number of buffers, n, in the chain, the electrical effort of the buffer chain, H, achieves its maximum value when all h,'s are equal.

Proof: According to Lemma 1, the summation of all h/s is constant for any given number of buffers. Since the product of some variables with a constant summation is maximum when all those variables are equal, all h/s have to be equal to maximize H.

The electrical effort of each buffer for the buffer chain that maximizes H, according to Theorem 2 and equation (4), would then be:

$$\hat{h}_{i} = \hat{h} = \frac{T_{R} - np_{inv}}{n}$$
  $\forall i = 1, ..., n$  (7)

So the maximum of H, named  $\overline{H}$  as a function of n would be:

$$\overline{H} = \left(\frac{T_R - np_{inv}}{n}\right)^n \quad (8)$$

 $\overline{H}$  is drawn in Fig. 2 for  $T_B=14$  and  $p_{inv}=0.6$ .



According to Theorem 2, derrs is a maximum value that if a randomic of a ray given buffer course before, the ordy block mouth our at molitor that confort with the maximum value that H address is not constant the ratio  $C_{ij}C_{ij}$  (supurises (G)) and those conceptual to the buffer course barrens the points of interaction of  $\overline{H}$  and thus  $C_{ij}C_{ij}$  (strings  $\overline{H}_{ij}$ ). As an example, for Carl In Hig , there is a barbon bolico because there are no strict mercersion pairs and  $\overline{H}$  in Below  $G_{ij}C_{ij}$  for all buffer courses. For Cae H1, on the other hand, there are two points of interaction  $\delta_{ij}$  and  $\tilde{n}_{ij}$ : therefore, the out for half buffer courses are breases is and  $\delta_{ij}$ .

With these observations, algorithm OptN in Fig. 3 is proposed for finding the optimum number of buffers and their sizes.

$$\label{eq:absorb} \begin{split} & \operatorname{absorb} \left( \frac{r_{n}}{r_{n}}, \frac{r$$

Fig. 3: Algorithm OptN

To find the optimum number of buffers, the line  $C_{H}C_{la}$  is intersected with the graph  $\overline{H}$  (line 2 of Fig. 3 and Case III in Fig. 2) which results in  $\tilde{n}_{1}$  and  $\tilde{n}_{2}$ . Note that:

$$\lim_{n\to 0} \overline{H} = 1$$
 (9)

Therefore, there always exists an  $A_1$ , unless the line  $C_{LC}$ , is passing before unity, which means that  $C_k$  is toos than or equal to  $C_{40}$  is which case no buffers seed to be used at all. On the other different contrast of the probability of the same of the stress set of the intrinsic buffer delay. According to equation (4), for the eleverical efficience of the stress have a meaning/hap/bayical interpresention,  $T_k \to r_{40}$ , into the positive, which means (time 4 of Hg, Yz):

$$n \le \frac{T_R}{P_{inv}}$$
 (10)

In short, the buffer count is limited by  $\tilde{n}_1$  on one side and by  $\tilde{n}_2$  and  $T_R/p_{inv}$  on the other side. Therefore, the ontimum buffer count,  $n_i$  lies between  $n_i$  and  $n_i$  (lines 3 and 4 of Fig. 3).

There is a possibility that the line  $C_{q}C_{q}$ , could intersect the graph when there is no integer *n* between the points of interaction to satisfy the polarity constraint. This only happens when the fine consets the *Traves* very doins to the post of the graph (Cast III is Fig. 2). In lines 5 and 6, the optimum sizing for the buffers on the chain is found by solving a convex optimization problem as follows:

$$\begin{array}{ll} Min & \displaystyle \frac{C_L}{h_n} + \frac{C_L}{h_k h_{n-1}} + \ldots + \frac{C_L}{h_n h_{n-1} \ldots h_1} \\ st: & \displaystyle h_1 + \ldots + h_n \leq T_R - n \rho_{inv} \\ & \displaystyle h_1 \ldots h_n \geq \frac{C_L}{C_{in}} \end{array}$$
(11)

This is a minimization of a posynomial function with posynomial inequality constraints that can be easily solved in polynomial time [6]. Finally among all the solutions, the one with the minimum area is selected as the optimum solution.

It is interesting to note that by taking the derivative of  $\overline{H}$  and setting it equal to zero, its maximum value is found to be at:

$$\hat{n} = T_{R} \times \lambda(p_{inv})$$
 (12)

where:

$$\lambda(p_{inv}) = \frac{Lambert(p_{inv}/e)}{p_{inv}(Lambert(p_{inv}/e) + 1)}$$
(13)

The function Lamber(10) is the solution to the nonlinear equation .w<sup>4</sup>=40. For further information about Lamber/function refer to [5]. As p<sub>2xy</sub> tends toward zero:

$$\lim_{p_{im} \to 0} \lambda(p_{inv}) = \frac{1}{e}$$
 (14)

and this corresponds to allocating the well-known electrical effort of e to each buffer with the assumption of  $p_{iew} = 0$ .

#### Theorem 3. Algorithm OntN finds the optimum solution for the 1FO problem.

Proof: Since all of the feasible solutions are explicitly considered, the algorithm is guaranteed to find the ontinum solution.

#### B. Buffer Tree

In this section, the more general case of the fanout optimization problem is considered, where the source sized is driving more than one sink.

Reference [3] introduced two transformations that can be performed on a fanout tree, namely *werging and splitting*. It is shown here that these transformations maintain the same area, delay, and caracitance.



Fig. 4: Split/Merge Transformations.

# Theorem 4. The split/merge transformations applied to a fanout tree preserve the input capacitance (thus area) and the delay.

**Proof:** The proof for split transformation is as follows: Suppose the electrical effort of the coginal buffer before splitting is i. A line the delay dromoty the buffer for both of the branches is  $h \cdot h_{20}$ , and the input capacitance is  $(C_1 + C_2)h$  which is also the area of the buffer. After splitting the original buffer to two buffers with equal electrical efforts of h the delay for both branches would all the  $h - h_{20}$ , and the inner cancelines would be  $C_1/h + C_2/h$  thus the same input capacitance and hence the same area. For merge transformation, one can easily verify the same provided that the electrical efforts of the buffers to be merged are equal.

Therefore, if  $T^*$  is the optimal famout new with the proper sizing of buffers, it can be split to a famout free pre-consisting of a set of buffer chains  $\overline{T}$ , which has the same area as  $T^*$ , according to theorem A, and also suifies the timing and input capacitance constraint ( $\overline{P}(t_{2}, S)$ . First,  $\overline{T}$  will be found by using the optimal algorithm presented in section A. The method used to transform  $\overline{T}$  into  $T^*$  will be discussed late.

The 1FO problem was stated such that the maximum input capacitance allowed was given. Therefore, before the mFO problem can be broken down into 1FO problems, different portions of  $C_{in}$ need to be allocated to each branch (Fig. 5).



Fig. 5: Input Capacitance Allocation for a Fanout-free Buffer Tree.

Input Capacitance Allocation (ICA) Problem: Given a number of sinks, each with a required time, equative load, and required polarity, and a total badget on input capacitance C<sub>20</sub> allocate portions of C<sub>10</sub> to each beanch such that the total area is minimized while the given constraints for all sinks are satisfield.

In this section it is first proven that the ICA problem is NP-Complete and then a heuristic is proposed for solving this problem. Intuitively speaking, the input capacitance allocation problem is similar to Knapsack problem where objects of the Knapsack problem correspond to the capacitance budgets of each branch and the total capacitance is limited by the input capacitance constraint  $C_{in}$  which corresponds to the Knamesk volume.

Infere it can be formally powers that this problem is NP-Complete, the behavior of arm must be maded as a function of import conjunctance for each buffer. The valid maps for the huffer count on branch is  $[1, \frac{1}{T_A}/\mu_{ab}]$ , acconding to (10). For each buffer count  $n_i$  this range, done resists a maximum destricted effort for the buffer chain, according to (0). Therefore, because of the capacitic constraint in countion (0). One resists a minimum reasient of mat capacitor.

$$\underline{\underline{C}}_{i} = \frac{\underline{C}_{L_{i}}}{\left(\frac{\underline{T}_{R_{i}} - np_{ini}}{n}\right)^{n}} \quad (15)$$

where the denominator is the maximum value that can be achieved by  $\prod b$ , according to equation (8). On the other hand, there exists a maximum beneficial input capacitance,  $\vec{e}_{jr}$ , for each buffer count which means that allocating an input capacitance larger than  $\vec{e}_{jr}$  will not improve are any further. This value can be calculated using the same optimization problem as in equation (11) has with dropping the experiment can be in the calculated that the in-

$$\{h\} = \begin{cases} Min & area_n \\ st & delay_n \leq T \end{cases}$$

and then calculating  $\overline{C}_i$  as follows:

$$\overline{C}_i = \frac{C_{L_i}}{\prod h}$$

Obviously, any input capacitance larger than  $\overline{C}_i$  will not improve area any further because allocating  $\overline{C}_i$  already results in the same solution as when the capacitance constraint is dropped.

Now that there exists a range for input capacitance for each buffer count, it can be proven that area is a decreasing function of input capacitance in this range.

Theorem 5: For a fixed number of buffers in a buffer chain, the area cost is a decreasing function of input capacitance for  $C_i \leq C_{in} \leq \overline{C}_i$ .

**Proof:** Increasing input capacitance,  $C_{ab}$  for a branch will decrease the truth  $C_{ab}C_{ab}$  in the capacitive constraint of the optimization problem in equation (11). Therefore, three either exists a better solution with smaller area or, if not, the same solution with the same area is still adhereable. Hence, increasing input capacitance will not increase area, and therefore, area is a decreasing function of part capacitance and chain proven.

Area vs. input capacitance for some buffer count will interface bags that the graph of the source o

## Theorem 6: ICA problem is NP-Complete.

Proof: To perform the proof, the 1A Kaapask problem will be reduced to the ICA problem. In the conventional version of the Kaapauk, problem, each item has a size and a value and the objective ito its maximize the total state. In the ICA problem, however, the objective is to minimize area. Therefore, we will consider the negative of area, rather than the area fixed *I*, so a to make the problem a maximization problem rather than a minimization one (for *T*, *T*).



Fig. 6a: Area vs. Input Cap. for Some Buffer Count n.

Fig. 6b: Area vs. Input Cap. for Different Buffer Counts.



Fig. 6c: Minimum Area vs. Input Cap.



The value vs. size curve for some item of 0-1 Knapsack problem is shown in Fig. 7b. The point about this graph is that it is not a continuous one. For sizes below  $s_0$ , the value is zero,

and for size greater than s, the value is s, assuming  $\bar{b}$  to be the accuracy of the models, the proph cash modified to be one obtain rise ( $f_{\rm c}$  to make it a common cose. Note that the proph mass modified to be one obtain rise ( $f_{\rm c}$  to make it a common cose. Note that the proph mass modified to be one obtain rise ( $f_{\rm c}$  to make it a common lines). Since the 0.1 superates a problem is  $R_{\rm c}$  as the mass model is model. This none right is a common line of the size of the mass models of the size of t

After proving that ICA is an NP-Complete problem, this section proceeds by proposing a heuristic method for allocating input capacitances to each branch.

Let *n* denote the number of sikks and thus the number of branches. Consider the 4-b branch (15 4 5 m);  $\overline{B}_{ii}$ , mainsum of electrical effort of the 4-b branch, has its minimial value of 1 at  $\mu_{ii} = \rho_{ii}$  (0). The *i* when we make source (0) is the other hand,  $\overline{B}_{ii}$  cannot be any larger hand  $\mu_{ii}(T_{ii}, \rho_{iii})$ , the value of  $\overline{B}_{ii}(\alpha_{ii})$  when  $n_{ii}$  is a cloudand from equation (12). According to equation (5), the maximum value of  $\overline{B}_{ii}$  charges only in  $\overline{B}_{ii}$  for experiment  $\mu_{ii}$  is a cloudand from equation (12). According to equation (5), the maximum value of  $\overline{B}_{ii}$  charges on the minimum value of  $C_{iii}$  therefore the minimum equation (12).

$$\underline{C}_{k} = \frac{C_{Lk}}{\mu(T_{R}, p_{inv})}$$
(16)

Allocating any capacitance less than  $G_k$  to any branch will make that branch infeasible. Hence, m new positive variables  $x_k$  for  $k=I_{min}$ , m are introduced such that:

$$C_{ik} = C_{2} + x_{k}$$
 (17)

This way, one can be sure that the minimum requiried operionics in distanced to exclude the third operiod ope



Fig. 8: Different Slopes Corresponding to Different Branches.

The proposed heuristic is shown in Fig. 9. Line 4 finds  $x_k$ 's such that the desired ratio between them, as discussed above, is fulfilled.

The slope for each branch is estimated as follows:

$$slope_k = \frac{y_{max_k} - y_{min_k}}{x_{max_k} - x_{min_k}} = \frac{\mu(T_{R_k}, p_{inv}) - 1}{T_{R_k}\lambda(p_{inv}) - 0}$$
 (18)

Fig. 9: Algorithm InCanAlloc.

After finding the allocated input capacitances, w instances of the 1FO problem will be generated that can be optimally solved by the altorithm presented in Section A.

# C. Merging Buffer Chains

So far, a continuous-sized buffer Bhary has been assumed. Its reality the ASC Ibbury has a faile (and hospitally large) number of inverter sizes. So the solution needs to be mapped to one contrition with the fabres). The main problem when rounding the inverter sizes is that it may result in significant errors. To affectate this problem, the merging transformation, which is the opposite of the spit transformation introduce of Fig. 1 is used.

To show how this works, recall Theorem 4. If the electrical efforts of the buffers on two branches are equal, one can merge them and replace them with a single buffer with the same electrical effort. Note that simply because the electrical efforts of the buffers are the same, one cannot conchalle that the latter size are also the same. As shown in Fig. 4. are hirsts of each of the biffers block mortging are  $C_1h$  and  $C_2h$ , respectively, and the size of the biffer after mortging is  $(C_1 + C_2)h$ . Therefore, the size of the biffer after mortging is equal to the summation of buffer sizes before morging. This fact can be used to realise the rounding eners. As an example, combintation size of the size of the biffer after the rounding eners. As an example, combinbuffers of ed. 335 should be merged to a single buffer, the size would be 0.7, and rounding to buffers a fact 335 should be merged to a single buffer, the size would be 0.7, and rounding to buffer size of 1.3 word statis sumpler eners.

Clearly one has to be concerned about satisfying the required time and input capacitance constraints when performing this transformation. The merging should be performed in such a way that all timing constraints are satisfied and the area (as well as the input capacitance of the very first stare) is the same. As noted in the penof of Theorem 4, for the merging transformation to penduce the exact same area and delay, the electrical efforts of the buffers to be merged must be armal. However, bacance each branch of the forcest tree is continued concretely with respect to the corresponding sink the electrical efforts of the buffers may not necessarily be equal. Thus a constant r is defined and two buffers are merzed if the difference between their electrical efforts is less than or equal to a percent. In addition, two buffers are merged if the rounding error after merging the two is smaller than the summation of rounding errors of each buffer before the merge operation. Obviously, the efficiency of this approach is dependent on the order in which the buffers are selected to be merved. The approach presented here is to cluster the buffers into groups of nearly equal electrical efforts and check for the merging possibilities inside each group. Merging is performed starting at the source of the signal, and proceeding toward the sinks, while at the same time preserving the area so as not to increase the canacitive load imposed on the previous stage. The nseudo-code for a recursive merging algorithm is shown in Fig. 10.

```
algonich moMerroe is our rel
1 heain
2
   R - all huffers driven by source
3
     duster huffers in B based on their electrical efforts
4
     formach churren.
S.
6.
7.
          re re at
              nick two buffers:
              memeif it i moreves the rounding error.
8
              add merged buffers othe cluster.
9
          until no more memimis constitue:
10. foreach buffer in every cluster.
11.
          Merce | bufferi:
12. end
```

Fig. 10: Algorithm Merre.

# IV. EXPERIMENTAL RESULTS

Three different sets of experiments were performed. In the first set, the LEOPARD algorithm of Section III was compared with an implementation of the Sutherland algorithm [4], which minimizes delay through a path. The results are reported in Table 1.

| Circuit | Suthe | rland | LEOPARD | LEOPARD<br>with 5% slack |      |  |
|---------|-------|-------|---------|--------------------------|------|--|
|         | Delay | Arra  | AREA    | Delay                    | Area |  |
| 1       | 6.97  | 233   | 23.2    | 7.32                     | 183  |  |
| 2       | 6.86  | 19    | 19      | 7.20                     | 15   |  |
| 3       | 15.05 | 458   | 45.5    | 15.80                    | 277  |  |
| 4       | 12.85 | 183   | 182     | 13.49                    | 123  |  |
| 5       | 8.13  | 22    | 22      | 8.53                     | 17   |  |
| 6       | 11.32 | 143   | 14.2    | 11.89                    | 97   |  |
| 7       | 6.86  | .38   | 38      | 7.20                     | 30   |  |
| 8       | 12.20 | 198   | 197     | 12.81                    | 134  |  |
| 9       | 13.79 | 245   | 245     | 14.48                    | 149  |  |
| 10      | 8.50  | 70    | 69      | 8.93                     | 54   |  |

Table 1: Comparison with Sutherland.

For all of the experiments, the minimization publicant within the LDDINKD algorithm were obtained by the Mad Charling interaction. The other than the same of the Dink Theorem For each circuit, the expective load of the situ hand the maximum operature, that the same of the other sector of the same of the situ hand the maximum operature. The same of the other sector of the same of the other sector of the same of the other same of the other same of the same of

| In ti | ie next | t set of | experiments,   | the i | results | from | LEOPARD | are | compared | with | the | results | of | 25 |
|-------|---------|----------|----------------|-------|---------|------|---------|-----|----------|------|-----|---------|----|----|
| impl  | ement   | ation o  | d Kung's algor | ithm  | [3].    |      |         |     |          |      |     |         |    |    |

| Circuit | Sinks | Kung   |      | LEOPARD | LEOPARD<br>+5% InCap |      |  |
|---------|-------|--------|------|---------|----------------------|------|--|
|         |       | InCap  | Area | Area    | InCap                | Area |  |
| 1       | 5     | 53.28  | 916  | 906     | 55.94                | 739  |  |
| 2       | 4     | 68.34  | 1104 | 1093    | 71.76                | 907  |  |
| 3       | 6     | 34.18  | 462  | 457     | 35.88                | 381  |  |
| 4       | 10    | 236.41 | 1463 | 1451    | 245.05               | 1231 |  |
| 5       | 4     | 153.45 | 1296 | 1284    | 161.12               | 1079 |  |
| 6       | 7     | 156.40 | 1635 | 1619    | 164.22               | 1347 |  |
| 7       | 15    | 158.62 | 5358 | 5 29 5  | 166.55               | 4210 |  |
| 8       | 12    | 29.24  | 4342 | 4290    | 30.70                | 3376 |  |
| 9       | 9     | 21.25  | 3868 | 3820    | 22.31                | 2995 |  |
| 10      | 11    | 21.28  | 5808 | 5735    | 22.34                | 4461 |  |

Table 2: Comparison with Kung.

For each circuit, a number of wide, with equation is and, required inter, and required polarity are from The mather of each force ach circuit, is how in polarism. A construction of the total to minimize possible index on the second term of the second term of the second second term of the equations a calculated by Kang's algorithm was then used as the expactive constrained for any exploration (aLLDDWAD). Thermology are as represent in circuit, and and additional term of each constrained in the second term of the second and mathematical polar explorations and and one of the calc circuit in the first end of the expectations of the second second second second second meeting apper capacitons and are and new on iterations of a 27. An array of the importment in area statistical in the expected of 25 additional input capacitons. Note that is this or of and the second second second second second second second second second and the second second second second second second second second second and the second second second second second second second second and the second second

Findly can be set of experimental resolutions, comput LUCRADD with the SSE famou optimization program SSE and different famou optimization programs, markey  $L^{TDT}$ , *no-Loci* R downby, and Rulemov, and be best one is reported [14]. In this set of experiments, a standard cell library contacting of the different invertees was used. For each strenger  $T_{total and}$  and  $R_{and}$  were equivalent for the SSE strength optimization of the distribution of the distribution model. A very good marks between the SSE doily and logical effort delay model values was enforced.

The fatour optimization programs of SIS were find used to perform fatous optimization. The results are optimization programs of of Table 3. Then the data you and unput capacitance resulting from SIS were used as constrained for LLD/RMDR. The results, assuming a continuous-size buffer library, are reported in column 3. Then menging and mapping to the real buffers in the ASIC library were performed, and the results are shown in column 4 and 5. As shown in the table, in const constrained the area is extremed to terms of the constrained buffer for the table. sized buffers, it is the actual buffer area extracted from the library. Results show an average of 38% area improvement for LEOPARD.

|         |       | LEOPARD<br>cont. sizing | LEOI<br>discret | PARD<br>e sizing | 815    |  |
|---------|-------|-------------------------|-----------------|------------------|--------|--|
| Circuit | Sinks | Σαρ                     | Σсар            | Area             | Area   |  |
| 1       | 12    | 0.093                   | 0.093           | 3920             | 5281   |  |
| 2       | 6     | 0.032                   | 0.039           | 3902             | 4676   |  |
| 3       | 21    | 0.065                   | 0.088           | 6090             | 209.52 |  |
| 4       | 14    | 0.093                   | 0.093           | 3920             | 5281   |  |
| 5       | 21    | 0.060                   | 0.089           | 72.20            | 11952  |  |
| 6       | 12    | 0.045                   | 0.062           | 4814             | 7857   |  |
| 7       | 16    | 0.087                   | 0.110           | 63.27            | 12315  |  |

Table 3: Comparison with SIS.

### V. CONCLUSION

This paper persents an optimal algorithm for buffer chains to minimize nears with the assumption of continuous siting the the buffers. The algorithm finds the optimum based or buffers and the optimum sizing for them by solving a posynomial minimization problem subject to posynomial measurity constraints which can be easily and quickly solved by a convex program store. Budf on this algorithm, bufferstire models was presented for the general case of buffers exc. Statising the fact that the number of discrete sizes for buffers in populal Brazers has highly increased, the assumption of case-columnon. Buffer Brazers is their assess cases.

#### VI. REFERENCES

- C. L. Berman, J. L. Carter, and K. F. Day, "The Fanout Problem: From Theory to Practice", In C. L. Seitz editor, Advanced Research in VLSI: Proceedings of the 1989 Decennial Calitech Conferences, pp. 69 - 99, MIT Press, March 1989.
- [2] K. Kodandapani, J. Grodstein, A. Domic, and H. Tosati, "A Simple Algorithm for Fanout Optimization using High-Performance Buffer Libraries", Proceedings of Int'l Conference on Computer Aided Design, pp. 466–471, 1993.

- [3] D. S. Kung, "A Fast Fanout Optimization Algorithm for Near-Continuous Buffer Libraries", Proceedings of 35<sup>th</sup> Design Automation Conference, pp. 352-355, 1998.
- [4] I. E. Sutherland and R. F. Sproull, "Logical Effort: Designing for Speed on the Back of an Envelope". Advanced Research in VLSL Univ. of Calif. Santa Cruz. 1991.
- [5] R. M. Corless, G. H. Gonnet, D. E. G. Hare D. J. Jeffrey, and D. E. Knuth, "On the Lambert W Function", Advances in Computational Mathematics, volume 5, pp. 329-359, 1996.
- [6] P. M. Vaidya, "A New Algorithm for Minimizing Convex Functions Over Convex Sets", Procerdings of IEEE Foundations of Computer Science, pp. 332–337, Oct. 1989.
- [7] C. Mead and L. Conway, Introduction to VLSI Systems, Addison Wesley, 1980.
- [8] P. Rezvani, A. Ajami, M. Pedram and H. Savoj, "LEOPARD: A logical effort-based fanout optimizer for area and delay", Proc. of Int'l Conf. on Computer Aided Design, pp. 516-519, Nov. 1999.
- [9] M. C. Golumbic, "Combinational Merging", IEEE Trans. on Computers, vol. 25, pp. 1164-1167, Nov. 1976.
- [10] K. J. Singh, A. Sangiovanni-Vincentelli, "A Heuristic Algorithm for the Fanout Problem", Proc. of 27<sup>th</sup> Design Automation Conference, pp. 357-360, June 1990.
- [11] A. Salek, J. Lou, M. Pedram, "A Simultaneous Routing Tree Construction and Funout Optimization Algorithm", Proc. of Int'l Conf. on Computer Aided Design, pp. 625-630, Nov. 1998.
- [12] P. Cocchini, M. Pedram, G. Piccinini, M. Zamboni, "Fanost Optimization Under a Submicron Transistor-Level Delay Model", *IEEE Trans. on Computer Aided Design*, vol. 19, no. 3, pp. 339-349.
- [13] Yu. Nesterov and A. Nemirovsky, "Interior Point Polynomial Methods in Convex Programmina". SIAM, 1994.
- [14] H. J. Touati, "Performance-Oriented Technology Mapping", Ph.D. Disservation, University of California, Berkeley, 1990.