

#### Robust Design of Power-Efficient VLSI Circuits

Massoud Pedram University of Southern California Dept. of Electrical Engineering Presentation at ISPD March 28, 2011



System Power Otimization and Regulation Technology



#### Outline

- Background and Motivation
- Key Problem: Robust & Energy-Efficient Design
- Essential Elements of the Solution Approach
- Changing Landscape and New Opportunities
- Conclusion

# More Functionality and Higher Performance



#### **Power Efficiency**

2010 ITRS's Consumer Stationary Power Consumption







#### M. Pedram

### **Energy Cost**



#### Variations







#### Variations



7

#### **Overarching Goal**

12.5%

3.4 DPMC

12.5%

3.4 DPMO

Cn=2 Cnk=1 5





M. Pedram

Instruction

Fetch

Next PC

Instr. Decode

Reg. Fetch

Execute

Addr. Calc

Next SEQ PC

Memory

Access

Write

Back

# Shrinking Design Space and Increasing Uncertainty



- Double arrows indicate the desired scaling direction
- The design space bounded by the three curves is diminishing
- Region of uncertainty for designs is increasing

# Required Components of a Global Solution

- Better characterizations, models, and calculators
- Multi-corner or statistical optimization
- Augment design for runtime adaptability
- Dynamic control based on in-situ sensing

#### A Current Source Model



- The non-linear behavior of the logic cell:
  - 2-D lookup table to store  $I(V_i, V_o)$
- Parasitic effect of the logic cell:
  - 2-D lookup tables to store  $C_i(V_i, V_o)$ ,  $C_M(V_i, V_o)$  and  $C_o(V_i, V_o)$
- Series of Spice simulation to pre-characterize the components of CSM model

#### Transition to "Physical" Gate Modeling: Controlled Current Source Models



#### **Data Explosion Problem**

- Conventional logic cell delay calculation techniques ignore the actual shape of waveform
- Current Source Modeling (e.g., ECSM)
  - Two-dimensional table of voltage waveforms in terms of input slew and output capacitance
- Size of the CSM library is a serious concern
  - Data volume orders of magnitude greater than a .Lib library
  - Multiple Libraries in the Process, Temperature, Voltage (PVT) space
  - Additionally the CSM library may contain power, noise, and variability
- Goals
  - Reduce library size while maintaining accuracy
  - Parameterize all waveform data vs. slew, cap, and PVT



#### Variational Waveform Modeling

- Sources of variations as input parameter:
  - Such as supply voltage, temperature, L<sub>eff</sub> and V<sub>th</sub>
- Pre-alignment operations
  - V-operators (shift and scale operations in the direction of the voltage axis)



#### **Orthonormal Transformation**



• Each normalized waveform between 0 and 1 is represented by coefficients:  $\alpha_0$ ,  $\alpha_1$ ,  $\alpha_2$ , ...,  $\alpha_n$ 

#### Library Data Compression

- In practice each normalized waveform between 0 and 1 is represented by using fixed number e.g., 4 coefficients: a<sub>0</sub>,...,a<sub>3</sub>
  - Time vector extraction from the cell library
  - Pre-alignment
  - Preprocessing including shifting, scaling, averaging and weighting,
  - Basis set extraction by using (Robust) Principal Component Analysis
     (R)PCA
  - Coefficient calculation for *m* "most significant" basis vectors



#### Variational Waveform Modeling

- A 65nm ECSM library with 43141 waveforms
- Nominal process corner, 1.2 volt, and 25°C
- Each gate characterized for 7x7x5x5 (input slew, output capacitive load, supply voltage and temperature) combinations
- A voltage waveform modeled by 21 uniform point increments
- Used the first five coefficients of PCA (76% compression)



# Required Components of a Global Solution

- Better characterizations, models, and calculators
- Multi-corner or statistical optimization
- Augment design for runtime adaptability
- Dynamic control based on in-situ sensing

# **Optimization flow: Multi-Corners** + Multi-Modes



### **Multiobjective** Optimization

We consider multiobjective optimization problems:

 $f_1(\mathbf{x})$ 

minimize

subject to  $\mathbf{x} \in S$ ,

in other words

minimize  $\{f_1(\mathbf{x}), f_2(\mathbf{x}), \dots, f_k(\mathbf{x})\}$ subject to  $\mathbf{x} \in S$ ,



where

 $f_i: \mathbb{R}^n \rightarrow \mathbb{R} = objective function$ 

- $k (\geq 2)$  = number of (conflicting) objective functions
- x = decision vector (of n decision variables x<sub>i</sub>)
- S ⊂ R<sup>n</sup> = feasible region formed by constraint functions and
  - `minimize'' = minimize the objective functions simultaneously

M. Pedram

#### Concepts

- A decision maker (DM) is needed to identify a final Pareto optimal (PO) solution. (S)he has insight into the problem and can express preference relations
- An *analyst* is responsible for the mathematical side
- *Solution process* = finding a solution
- Final solution = feasible PO solution satisfying the DM
- Ranges of the PO set: *ideal objective vector* z<sup>\*</sup> (lower bounds of the PO set) and approximated *nadir objective vector* z<sup>nad (</sup>upper bounds of the PO set)
  z<sup>2</sup> z<sup>1</sup>
- Utopian objective vector, z<sup>\*\*</sup>, is strictly better

than 
$$z^*$$
  
 $f_i^* = \min_{x \in S} f_i(x)$ 

$$f_{i}^{nad} = \max_{1 \le j \le k} f_{i}(x_{j}^{*})$$
$$x_{j}^{*} = \operatorname*{argmin}_{x \in S} f_{j}(x)$$



M. Pedram

#### Concepts cont.

- Value function U:R<sup>k</sup>→R may represent preferences; other times DM is expected to be maximizing some value (or utility)
  - We use the notation  $u: \mathbb{R}^n \rightarrow \mathbb{R}$  where u(x) = U(f(x))
- If U(z<sup>1</sup>) > U(z<sup>2</sup>), then the DM prefers z<sup>1</sup> to z<sup>2</sup>.
  U(z<sup>2</sup>) then z<sup>1</sup> and z<sup>2</sup> are equally good (indifferent)
- Decision making can be thought of being either value maximization or *satisficing*



 Problems are usually solved by scalarization, where a real-valued objective function is formed (depending on parameters). Then, single objective optimizers can be used!



# A Posteriori Methods: Weighting and ε-Constraint Methods

- Generate the PO set (or a part of it); Present it to the DM; Let the DM select one
- **Problem**  $\sum^n w_i f_i(\mathbf{x})$ minimize subject to  $\mathbf{x} \in S$ ,  $w_i > 0$  for all  $i = 1, \ldots, k$ , where  $\sum_{i=1} w_i = 1.$ Problem minimize  $f_{\ell}(\mathbf{x})$ subject to  $f_j(\mathbf{x}) \leq \varepsilon_j$ , for all  $j = 1, \ldots, k, j \neq \ell$  $\mathbf{x} \in S$ .

# A Priori Methods: Goal Programming

- The DM must specify an aspiration level ž<sub>i</sub> for each objective function
- $f_i$  and aspiration level = a *goal*. Deviations from aspiration levels are minimized ( $f_i(x) \delta_i = \check{z}_i$  where  $\delta_i$  may be positive or negative)
- The deviations can be represented as overachievements  $\delta_i > 0$  if  $f_i(x) \le \check{z}_i$
- Weighted approach:

# Interactive Methods: Satisficing Trade-Off Method

- Idea: To classify the objective functions:
  - Functions to be improved (I<sup><</sup>)
  - Functions whose values can be relaxed (I<sup>></sup>)
  - Acceptable functions (I<sup>=</sup>)
- –Notice that I<sup><</sup> U I<sup>></sup> U I<sup>=</sup> = {1,...,k}
- Assumptions
  - Trade-off information is available in the KT-multipliers
- Aspiration levels for functions in I<sup><</sup> given by the DM, upper bounds for function in I<sup>></sup> from the KT-multipliers
- Satisficing decision making is emphasized

# Satisficing Trade-Off Method cont.

Problem

minimize 
$$\max_{\substack{1 \le i \le k}} \left[ \frac{f_i(\mathbf{x}) - z_i^{\star\star}}{\bar{z}_i - z_i^{\star\star}} \right]$$
subj. to  $\mathbf{x} \in S$  or



### Some Results – Transistor Sizing



M. Pedram

#### Simulation results for the adder



| non-<br>convex<br>/convex | Method | Using of<br>gradient | power<br>( x10 <sup>-4</sup> )<br>delay<br>( x10 <sup>-10</sup> ) | Single-obj<br>Power<br>Optimization | Single-obj<br>Delay<br>Optimization | Multiobjective<br>Optimization |
|---------------------------|--------|----------------------|-------------------------------------------------------------------|-------------------------------------|-------------------------------------|--------------------------------|
| non-<br>convex            | ws     | w/o grad             | Power                                                             | 48.011                              | 48.932                              | 48.719                         |
|                           |        |                      | Delay                                                             | 11.992                              | 10.463                              | 10.537                         |
|                           |        | w grad               | Power                                                             | 47.924                              | 48.908                              | 48.657                         |
|                           |        |                      | Delay                                                             | 12.533                              | 10.279                              | 10.389                         |
|                           | B      | w/o grad             | Power                                                             | 47.924                              | 48.633                              | 48.439                         |
|                           |        |                      | Delay                                                             | 12.533                              | 10.806                              | 11.018                         |
|                           |        | w grad               | Power                                                             | 48.019                              | 48.908                              | 48.148                         |
|                           |        |                      | Delay                                                             | 12.808                              | 10.279                              | 11.317                         |
| convex                    | WS     | w/o grad             | Power                                                             | 48.085                              | 48.495                              | 48.432                         |
|                           |        |                      | Delay                                                             | 13.001                              | 10.647                              | 10.658                         |
|                           |        | w grad               | Power                                                             | 47.809                              | 48.451                              | 48.335                         |
|                           |        |                      | Delay                                                             | 13.642                              | 10.306                              | 10.381                         |
|                           | œ      | w/o grad             | Power                                                             | 47.809                              | 48.464                              | 48.123                         |
|                           |        |                      | Delay                                                             | 13.642                              | 10.603                              | 11.408                         |
|                           |        | w grad               | Power                                                             | 47.809                              | 48.451                              | 48.102                         |
|                           |        |                      | Delay                                                             | 13.642                              | 10.306                              | 10.791                         |

# Required Components of a Global Solution

- Better characterizations, models, and calculators
- Multi-corner or statistical optimization
- Augment design for runtime adaptability
- Dynamic control based on in-situ sensing

# On-Chip Power Delivery Network

- A voltage regulator module (VRM) converts and regulates a DC power source
  - A VRM can typically produce one of a number of distinct voltage levels based on user-specified voltage identification (VID) code
  - A VRM must meet various voltage tolerances such as voltage droops, output ripple and noise, dynamic load limit, etc.
    - Depending on these tolerances, the response time of the VRM can change



An example of voltage droops

### Voltage Regulator Module Tree

- A PDN configuration is defined by a VRM tree, which is a rooted tree with
  - A root node representing the main power source (battery)
  - Internal nodes denoting various VRM's
  - Leaf nodes representing functional blocks (FB's) with their supply voltage and peak current levels



VRM tree with VRM-to-FB mapping

# Energy-Delay Optimal Pipeline Design

- Key idea: Allow the data to pass through a flip flop during some transparency window, instead of on the triggering edge of a clock
- Benefit : Enable opportunistic time borrowing across adjacent pipeline stages
  - The goal is to provide the timing-critical stages with more time to complete their computations
    - Thus, reducing the probability of timing errors
- Determine values of:
  - operating frequency, supply voltage level, transparency window sizes of the individual soft-edge FF-sets
- Problem Formulations:
  - stage delays are modeled as random variables due to process variations
  - allow timing violations to take place but then implement a mechanism to detect and fix it



#### Soft Edge Flip Flops



M. Pedram

#### **Power-Delay Optimal Soft Pipeline**

#### Design of power-delay optimal <u>s</u>oft <u>p</u>ipeline (OSP)

- minimize total power-delay (energy) of an N-stage pipeline, by finding optimal values of:
  - global supply voltage
  - pipeline clock period
  - transparency windows of SEFF sets
- Solution:
  - enumerate all possible values for *v*,
  - for each *v*, optimally solve a quadratic program, OSP-FV:



#### **SEFF** with Built-in Error Detection

- A shadow latch resamples the input data with some phase delay
  - Need a phase-shifted global clock signal, PS-Clk
- The two sampled data are compared to one another to detect and flag errors





#### Error-Tolerant Statistical Power-Delay Optimal Soft Pipeline

#### <u>Error-Tolerant Statistical Power-Delay Optimal - Soft</u> <u>Pipeline (ESOSP)</u>

- minimizes total power-delay
- aggressively scales down the pipeline clock period to improve performance
- employing "SEFF with error detection"
- statistical timing constraints
- variables:
  - global supply voltage
  - pipeline clock period
  - transparency windows of SEFF sets
- expected value of power-delay:

$$\Phi = (1 - q_j)P_jT_{clk,j} + q_j(P_j + P_{p,j})\gamma T_{clk,j}$$

- Solution:
  - enumerate all possible values for v,
  - for each v, optimally solve a quadratic program, ESOSP-FV:



#### Multi-Threshold CMOS

- High-V<sub>th</sub> power switches are connected to low-V<sub>th</sub> logic gates
  - Achieves high performance due to low-V<sub>th</sub> logic gates
  - Reduces leakage power dramatically due to the seriesconnected high-V<sub>th</sub> power switch
- Typically only a header or a footer sleep transistor is used, not both
- A single sleep transistor may be shared among several logic gates



#### **Tri-Modal MTCMOS Switch**



| SLEEP/DROWSY | Multi-Mode Switch<br>Function |
|--------------|-------------------------------|
| 0X           | Active                        |
| 10           | Sleep                         |
| 11           | Drowsy                        |

M. Pedram

#### Multimodal Power-Gated Pipeline Architecture

 Use different TM (Tri-Modal) switches for pipeline registers and for combinational logic gates so as to enable different power-gating modes for these circuit elements



#### **Experimental Results**

- We designed and implemented a 16×16 pipelined carry save multiplier (CSM) using TSMC 0.18um CMOS
  - The circuit is divided into two pipeline stages
  - The 46-bit output of the first stage is latched into the pipeline registers (46 FF's)
  - The first 16 bits out of these 46 bits make the first 16 bits of the product and are passed to the output directly
  - The last 30 bits are passed to the second stage making the last
    16 bits of the product



#### Results, cont'd

- Four circuits, which are in the standby modes, are compared :
  - CMOS
  - Deep-sleep MTCMOS: all the cells (including FF's and logic gates) are in the sleep mode
  - Drowsy MTCMOS: all the cells (including FF's and gates) are in the drowsy mode
  - Data-retentive
    MTCMOS: Logic cells
    are in sleep mode and
    pipeline FF's are in
    drowsy mode

| Circuit Type   | Leakage<br>(nA) | Ground-<br>Bounce<br>(mV) | Wakeup/Ready<br>Latency (ns) |
|----------------|-----------------|---------------------------|------------------------------|
| CMOS           | 63              | -                         | -                            |
| Deep-Sleep     | 0.10            | 473                       | 19.32                        |
| Drowsy         | 48              | 143                       | 4.83                         |
| Data-Retentive | 2.85            | 441                       | 19.32                        |

| Circuit Type | Stage<br>delay | Cell<br>area | Wire length<br>(um) | Wire<br>length<br>(um) |
|--------------|----------------|--------------|---------------------|------------------------|
|              | (ns)           | (um²)        | n <sub>i</sub> =1   | n <sub>i</sub> =2      |
| CMOS         | 4.54           | 54720        | 54402.6             | 54402.6                |
| MTCMOS       | 4.83           | 55710        | 59008.4             | 56077.2                |
| % Increase   | 6.4            | 1.8          | 8.5                 | 3. <mark>1</mark>      |

# Required Components of a Global Solution

- Better characterizations, models, and calculators
- Multi-corner or statistical optimization
- Augment design for runtime adaptability
- Dynamic control based on in-situ sensing

#### Architectures That Tolerate Uncertainty

- Due to variations and computational constraints, there will be significant uncertainty about the real state of a circuit as a result of an optimization decision
- Need solution techniques that can learn from their mistakes and/or successes on similar problem instances encountered in the past to improve the quality of their decision making
- Utilize partially-observable (semi-)Markov decision process model



#### Supervisory Mode and Dynamic Control



 To avoid over constraining a system in terms of its power dissipation or temperature, one must adopt online control and supervision solutions that change the system behavior on the fly so as to maximize performance without violating the constraints (or provide the required performance while minimizing energy consumption).

# Idleness, State Transitions, and Power Savings



# Reality: Non-Energy-Proportional Servers



$$EIF_{U} = \frac{P_{U}}{P_{1}U}$$

Here  $P_1$  denotes server power dissipation at 100% utilization, whereas  $P_U$  is power at utilization level of U

An energy proportional system will have an Energy Inefficiency Factor (EIF) of one at all utilization levels

#### Minimum Energy CMP Design with Core Consolidation and DVFS

- Chip Multiprocessor system with M identical cores
  - Per-core DVFS with N (voltage-frequency) configurations
  - Local Queue
- Power Management Unit
- Global Queue
- Task Dispatcher
- Tasks
  - Expected execution time, τ
  - Expected instructions per cycle
- Objective: Minimize the total Energy consumption of cores in a multicore system
- Constraints: Minimum required throughput, IPSreq
- Variables:
  - Number of active cores –total number of cores: M
  - v-f settings of cores –total settings: J
  - Task distribution
     – total number of tasks: K





#### **A Hierarchical Solution**

- Determine the number of ON cores
  - ON cores are in C0 when active or C2 (halt or sleep) when idle
- Determine operating frequency of ON cores
  - A feedback-based control method is adopted for DVFS setting
    - This is needed due to inherent uncertainty and variability of task characteristics
    - PI Controller: controller adjusts the v-f setting to match the required throughput based on the observed error
    - Feedback control loop determines a single optimum frequency for all cores and then the Quantizer would translate  $f_{opt}$  to the available DVFS levels
- Find a feasible assignment of tasks to the ON cores





#### **Simulation Results**

- Comparison to a relatively energy-efficient baseline PM
  - It implements the same power reduction techniques as our method
  - It utilizes open loop DVFS, and does not support smart wakeup and shutdown
- The figure compares the frequency setting and throughput
  - The baseline PM always runs with all cores ON
  - The closed loop frequency is always lower for the same throughput
  - In the second 100ms, the baseline chooses lower frequency with four cores running while our proposed method uses only three cores



#### Changing Landscape: Smart Grid and Dynamic Pricing COMMERCIAL-NEWS



Source: M. Martonosi

www.commercial-news.com Homepage | Local News | Sports | Obituaries | Opinion | Monster Jobs | Wh

Published: August 09, 2009 11:26 pm



#### Ameren offers power by the hour

Users can track price of electricity

BY MIKE HELENTHAL Commercial-News

DANVILLE — Illinois customers with a day-trader's attitude can save nearly 15 percent on their electricity bills under a new program offered by Ameren.

The two-year-old Power Smart Pricing program was created after Illinois legislators — at the urging of power industry watchdog Citizens Utility Board required state utility companies offer pricing programs rewarding customers for "green" diligence.

Area Ameren customers received information on the program with this month's bill, but so far only 5,000 Illinois customers have signed up. Ameren subsidiaries serve some 2.4 million electric customers in Illinois and Missouri.

"I don't think that many people know about it yet," said Jim Chilsen, CUB communications manager. "It can be a big money-saver for the right customer and there are very specific things you should consider. It's a good program, but it's not for everyone."

The program offers customers the ability to track in real time, via the Web, the day-ending regional commodity price of elec-tricity. And as the rate fluctuates, participants can adjust their usage to avoid peak rates the following day.

"You don't have to turn everything off and you don't have to sit around in the dark," said Stephanie Folk, a spokeswoman for CNT Energy.



Click here to v

#### Resources

Print this stor
 E-mail this sto

More from the L

- Catholic schools t
- Kennekuk Park to
- Grandson, vetera
- Agency to dedica
- Aldermen to oper

#### Ads by Google

Area Newspaper Local News Headl Free Energy Now

### Electrical Energy Storage Systems



M. Pedram

#### Conclusion

- These are exciting times with many new opportunities and challenges due to planned upgrades to the Power Grid, introduction of renewable sources of energy, smart metering and dynamic energy pricing, people's awareness of environmental issues, etc.
- A holistic , cross-layer approach to energy efficiency and robustness is needed, which spans
  - Application efficiency and energy management, micro-architecture and system design, storage and networks, resource management and scheduling
  - Synthesis and physical design
  - Library characterization and cell design