Toward Energy-Efficient Computing Nikos Hardavellas – Parallel Architecture Group Northwestern University.

Slides:



Advertisements
Similar presentations
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
International Symposium on Low Power Electronics and Design Qing Xie, Mohammad Javad Dousti, and Massoud Pedram University of Southern California ISLPED.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scalability 36th International Symposium on Computer Architecture Brian Rogers †‡, Anil Krishna.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Institute of Digital and Computer Systems 1 Fabio Garzia / Finding Peak Performance in a Process23/06/2015 Chapter 5 Finding Peak Performance in a Process.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
Research Directions for On-chip Network Microarchitectures Luca Carloni, Steve Keckler, Robert Mullins, Vijay Narayanan, Steve Reinhardt, Michael Taylor.
Exploiting Dark Silicon for Energy Efficiency Nikos Hardavellas Northwestern University, EECS.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
Computer System Architectures Computer System Software
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
CAD for Physical Design of VLSI Circuits
Energy-Proportional Photonic Interconnects Nikos Hardavellas – Parallel Architecture Group Northwestern University Team: Y. Demir, P. Yan, S. Song,
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
Nikos Hardavellas – Parallel Architecture Group
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
CS2100 Computer Organisation Input/Output – Own reading only (AY2015/6) Semester 1 Adapted from David Patternson’s lecture slides:
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? Wasim Shaikh Date: 10/29/2015.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
The Rise of Dark Silicon
By Islam Atta Supervised by Dr. Ihab Talkhan
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
CS203 – Advanced Computer Architecture
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
CS203 – Advanced Computer Architecture
Lynn Choi School of Electrical Engineering
Seth Pugsley, Jeffrey Jestes,
Green cloud computing 2 Cs 595 Lecture 15.
Lecture: Large Caches, Virtual Memory
Morgan Kaufmann Publishers
Architecture & Organization 1
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Lecture 13: Large Cache Design I
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Architecture & Organization 1
CMSC 611: Advanced Computer Architecture
Adaptive Single-Chip Multiprocessing
CMSC 611: Advanced Computer Architecture
Presentation transcript:

Toward Energy-Efficient Computing Nikos Hardavellas – Parallel Architecture Group Northwestern University

Energy is Shaping the IT Industry #1 of Grand Challenges for Humanity in the Next 50 Years [Smalley Institute for Nanoscale Research and Technology, Rice U.] Computing worldwide: ~408 TWh in 2010 [Gartner] Datacenter energy consumption in US ~150 TWh in 2011 [EPA]  3.8% of domestic power generation, $15B  CO 2 -equiv. emissions ≈ Airline Industry (2%) Carbon footprint of world’s data centers ≈ Czech Republic 20MW: 200x lower energy/instr. (2nJ  10pJ)  3% of the output of an average nuclear plant! 10% annual growth on installed computers worldwide [Gartner] © Hardavellas 2 Exponential increase in energy consumption

More Data  More Energy SPEC, TPC datasets growth: faster than Moore Same trends in scientific, personal computing Large Hadron Collider  March’11: 1.6PB data (Tier-1) Large Synoptic Survey Telescope  30 TB/night  2x Sloan Digital Sky Surveys/day  Sloan: more data than entire history of astronomy before it © Hardavellas 3 Exponential increase in energy consumption

Technology Scaling Runs Out of Steam Transistor counts increase exponentially, but… Can no longer feed all cores with data fast enough (package pins do not scale) Bandwidth Wall Can no longer keep costs at bay (process variation, defects) Low Yield + Errors Can fit 1000 cores on chip, but only a handful will be running 4 © Hardavellas Can no longer power the entire chip (voltage, cooling do not scale) Power Wall

Main Sources of Energy Overhead Useful computation: 0.5pJ for an integer addition Major energy overheads  Data movement: 1000pJ across 400mm 2 chip, 16000pJ memory Elastic caches: adapt cache to workload’s demands  Processing: 2000pJ to schedule the operation Seafire: specialized computing on dark silicon  Circuits: up to 2x voltage guardbands  Low voltages, process variation  timing errors Elastic fidelity: selectively trade accuracy for energy Chips fundamentally limited by power : ~130W for forced air cooling Galaxy: optically-connected disintegrated processors [calculations for 28nm, adapted from S. Keckler’s MICRO’11 keynote] 5 © Hardavellas

Overcoming Circuit and Processing Overheads Elastic caches: adapt cache to workload’s demands  Significant energy on data movements and coherence requests  Co-locate data, metadata, and computation  Decouple address from placement location  Capitalize on existing OS events  simplify hardware  Cut on-chip interconnect traffic, minimize off-chip misses Seafire: specialized computing on dark silicon  Repurpose dark silicon to implement specialized cores  Application cherry-picks a few cores, rest of chip is powered off  Vast unused area  many specialized cores  likely to find good matches  12x lower energy (conservative) 6 © Hardavellas

Elastic fidelity: selectively trade accuracy for energy  We don’t always need 100% accuracy, but HW always provides it  Language constructs specify required fidelity for code/data segments  Steer computation to exec/storage units with appropriate fidelity and lower voltage  35% lower energy Galaxy: optically-connected disintegrated processors  Split chip into chiplets, connect with optical fibers  Spread in space  easy cooling  push away power wall  Similarly for bandwidth, yield  2-3x speedup over best alternative  53% avg. lower Energy x Delay product over best alternative Overcoming Data Movement Overheads and Power Wall 7 © Hardavellas No errors 10% errors

Outline Overview ➔ Energy scalability for server chips Where do we go from here?  Short term: Elastic Caches  Medium term: Specialized Computing on Dark Silicon  Medium-Long term: Elastic Fidelity  Long term: Optically-Connected Disintegrated Processors Summary © Hardavellas 8

Performance Reality: The Free Ride is Over © Hardavellas 9 Physical constraints limit chip scalability [NRC] ???

Pin Bandwidth Scaling © Hardavellas 10 [TU Berlin] Cannot feed cores with data fast enough to keep them busy

Breaking the Bandwidth Wall: 3D-die stacking © Hardavellas 11 [Loh et al., ISCA’08] Delivers TB/sec of bandwidth; use as large “in-package” cache [IBM] [Philips]

Voltage Scaling Has Slowed © Hardavellas 12 In last decade: 10x transistors but 30% lower voltage “Economic Meltdown of Moore’s Law” [Kenneth Brill, Uptime Institute]

Chip Power Scaling © Hardavellas 13 Cooling does not scale! Chips are getting too hot! [Azizi 2010]

The New Cooking Sensation! © Hardavellas 14 [Huang]

Where Does Server Energy Go? Many sources of power consumption: Infrastructure (power distribution, room cooling)  State-of-the art data centers push PUE below 1.1  Facebook Prineville: 1.07  Yahoo! Chillerless Data Center: 1.08  Less than 10% wasted on infrastructure Servers [Fan, ISCA’07]  Processor chips (37%)  Memory (17%)  Peripherals (29%)  … © Hardavellas 15

First-Order Analytical Modeling [Hardavellas, IEEE Micro 2011] [Hardavellas, USENIX ;login: 2012] Physical characteristics modeled after UltraSPARC T2, ARM11  Area: Cores + caches = 72% die, scaled across technologies  Power: ITRS projections of V dd, V th, C gate, I sub, W gate, S 0 o Active: cores=f(GHz), cache=f(access rate), NoC=f(hops) o Leakage: f(area), f(devices) o Devices/ITRS: Bulk Planar CMOS, UTB-FD SOI, FinFETs, HP/LOP  Bandwidth: o ITRS projections on I/O pins, off-chip clock, f(miss, GHz)  Performance: CPI model based on miss rate o Parameters from real server workloads (DB2, Oracle, Apache) o Cache miss rate model (validated), Amdahl & Myhrvold Laws © Hardavellas 16

Caveats First-order model  The intent is to uncover trends relating the effects of technology-driven physical constraints to the performance of commercial workloads running on multicores  The intent is NOT to offer absolute numbers Performance model works well for workloads with low MLP  Database (OLTP, DSS) and web workloads are mostly memory-latency-bound Workloads are assumed parallel  Scaling server workloads is reasonable © Hardavellas 17

Area vs. Power Envelope © Hardavellas 18 Good news: can fit 100’s cores. Bad news: cannot power them all

Pack More Slower Cores, Cheaper Cache © Hardavellas 19 The reality of The Power Wall: a power-performance trade-off VFS

Pin Bandwidth Constraint © Hardavellas 20 Bandwidth constraint favors fewer + slower cores, more cache VFS

Example of Optimization Results © Hardavellas 21 BW: ~2x loss Power + BW: ~5x loss Jointly optimize parameters, subject to constraints, SW trends Design is first bandwidth-constrained, then power-constrained

Performance Analysis of 3D-Stacked Multicores © Hardavellas 22 Chip becomes power-constrained

Core Counts for Peak-Performance Designs © Hardavellas 23 Designs for server workloads > cores impractical B/W + dataset scaling push up cache sizes (cores area << die size) Physical characteristics modeled after UltraSPARC T2 (GPP) ARM11 (EMB)

Short-Term Scaling Implications Caches are getting huge Need cache architectures to deal with >> MB Need to minimize data transfers  Elastic Caches o Adapt behavior to executing workload to minimize transfers o Reactive NUCA [Hardavellas, ISCA 2009][Hardavellas, IEEE Micro 2010] o Dynamic Directories [Das, DATE 2012] © Hardavellas 24 Need to push back the bandwidth wall!!!

Data Placement Determines Performance © Hardavellas 25 core L2 core L2 core L2 core L2 core L2 core L2 core L2 core L2 Goal: place data on chip close to where they are used cache slice core

L2 Directory Placement Also… © Hardavellas 26 Goal: co-locate directories with data core 0 core 1 core 2 core 3 L2 Core 4 core 5 core 6 core 7 L2 core 8 core 9 core 10 core 11 L2 core 12 core 13 core 14 core 15 L2 core 16 core 17 core 18 core 19 L2 core 20 core 21 core 22 core 23 L2 core 24 core 25 core 26 core 27 L2 core 28 core 29 core 30 core 31 L2 x Off-chip access L2 Dir x core 30

Elastic Caches: Cooperate With OS and TLB © Hardavellas 27 Page granularity allows simple + practical HW Core accesses the page table for every access anyway (TLB)  Pass information from the “directory” to the core Utilize already existing SW/HW structures and events VPageAddrPhyPageAddrDir/Ownr IDP/S/T Page Table entry: 2 bits log 2 (N) VPageAddrPhyPageAddrP/S TLB entry: 1 bit Dir/Ownr ID log 2 (N)

Instructions classification: all accesses from L1-I (grain: block) Data classification: private/shared at TLB miss (grain: OS page) Page classification is accurate (<0.5% error) Classification Mechanisms © Hardavellas 28 TLB Miss core L2 Ld A Core i OS A: Private to “i” TLB Miss Ld A OS A: Private to “i” core L2 Core j A: Shared On 1 st accessOn access by another core Bookkeeping through OS page table and TLB

© Hardavellas 29 Elastic Caches Data placement (R-NUCA) [Hardavellas, ISCA 2009] [Hardavellas, IEEE-Micro Top Picks 2010]  Up to 32% speedup (17% avg.)  Within 5% on avg. from an ideal cache organization  No need for HW coherence mechanisms at LLC Directory placement (Dynamic Directories) [Das, DATE 2012]  Up to 37% energy savings on interconnect (16% avg.)  No performance penalty (up to 9% speedup) Negligible hardware overhead  logN+1 bits per TLB entry, simple logic

Outline - Main Sources of Energy Overhead Useful computation: 0.5pJ for an integer addition Major energy overheads  Data movement: 1000pJ across 400mm 2 chip, 16000pJ memory Elastic caches: adapt cache to workload’s demands  Processing: 2000pJ to schedule the operation Seafire: specialized computing on dark silicon  Circuits: up to 2x voltage guardbands  Low voltages, process variation  timing errors Elastic fidelity: selectively trade accuracy for energy Chips fundamentally limited by power : ~130W for forced air cooling Galaxy: optically-connected disintegrated processors [calculations for 28nm, adapted from S. Keckler’s MICRO’11 keynote] 30 © Hardavellas

Exponentially-Large Area Left Unutilized Should we waste it? © Hardavellas 31

Repurpose Dark Silicon for Specialized Cores Don’t waste it; harness it instead!  Use dark silicon to implement specialized cores Applications cherry-pick few cores, rest of chip is powered off Vast unused area  many cores  likely to find good matches © Hardavellas 32 [Hardavellas, IEEE Micro 2011] [Hardavellas, USENIX ;login: 2012]

The New Core Design © Hardavellas 33 From fat conventional cores, to a sea of specialized cores [analogy by A. Chien]

Design for Dark Silicon © Hardavellas 34 Sea of specialized cores, power up only what you need

Core Energy Efficiency © Hardavellas 35 [Azizi 2010]

12x LOWER ENERGY compared to best conventional alternative First-Order Core Specialization Model Modeling of physically-constrained CMPs across technologies Model of specialized cores based on ASIC implementation of H.264:  Implementations on custom HW (ASICs), FPGAs, multicores (CMP)  Wide range of computational motifs, extensively studied Frames per sec Energy per frame (mJ) Performance gap of CMP vs. ASIC Energy gap of CMP vs. ASIC ASIC304 CMP IME x707x FME x468x Intra x157x CABAC x261x [Hameed, ISCA 2010] © Hardavellas 36

Outline - Main Sources of Energy Overhead Useful computation: 0.5pJ for an integer addition Major energy overheads  Data movement: 1000pJ across 400mm 2 chip, 16000pJ memory Elastic caches: adapt cache to workload’s demands  Processing: 2000pJ to schedule the operation Seafire: specialized computing on dark silicon  Circuits: up to 2x voltage guardbands  Low voltages, process variation  timing errors Elastic fidelity: selectively trade accuracy for energy Chips fundamentally limited by power : ~130W for forced air cooling Galaxy: optically-connected disintegrated processors [calculations for 28nm, adapted from S. Keckler’s MICRO’11 keynote] 37 © Hardavellas

100% Fidelity May Not Always Be Necessary © Hardavellas 38 Original Loop Perforation [Sidiroglou, FSE 2011]

100% Fidelity May Not Always Be Necessary © Hardavellas 39 Loop Perforation [Sidiroglou, FSE 2011] 15% distortion, 2.6x speedup

100% Fidelity May Not Always Be Necessary © Hardavellas 40 Loop Perforation [Sidiroglou, FSE 2011] 3/8 cores fail

Elastic Fidelity  We don’t always require 100% accuracy, but HW always provides it  Audio, video, imaging, data mining, scientific kernels  Language constructs specify required fidelity for code/data segments  Steer computation to exec/storage units with appropriate fidelity  Results: Up to 35% lower energy via elastic fidelity on ALUs & caches  Turning off ECC: additional 15-85% from L2 10% error allowed original © Hardavellas 41 Trade-Off Accuracy for Energy [Roy, CoRR arXiv 2011]

Simple Code Example © Hardavellas 42 imprecise[25%] int a[N]; int b[N];... a[0] = a[1] + a[2]; b[0] = b[1] + b[2];... Data Storage (e.g., cache) Voltage legend (color-coded) Execution units (e.g., ALUs)

Estimating Resilience Currently users specify error-resilience of data QoS profilers can automate the fidelity mapping  User-provided function to calculate output quality  User-provided quality threshold Profiler parses source code  Identifies data structures & code segments Software fault-injection wrappers determine error resilience © Hardavellas 43

Outline - Main Sources of Energy Overhead Useful computation: 0.5pJ for an integer addition Major energy overheads  Data movement: 1000pJ across 400mm 2 chip, 16000pJ memory Elastic caches: adapt cache to workload’s demands  Processing: 2000pJ to schedule the operation Seafire: specialized computing on dark silicon  Circuits: up to 2x voltage guardbands  Low voltages, process variation  timing errors Elastic fidelity: selectively trade accuracy for energy Chips fundamentally limited by power : ~130W for forced air cooling Galaxy: optically-connected disintegrated processors [calculations for 28nm, adapted from S. Keckler’s MICRO’11 keynote] 44 © Hardavellas

Galaxy: Optically-Connected Disintegrated Processors Split chip into chiplets, connect with optical fibers  Fibers offer high bandwidth, low latency Spread chiplets far apart to cool efficiently  Thermal model: 10cm are enough for 5 chiplets (80 cores) Mitigate bandwidth, power, yield © Hardavellas 45 [Pan, WINDS 2010]

Galaxy: Optically-Connected Disintegrated Processors Split chip into chiplets, connect with optical fibers  Fibers offer high bandwidth, low latency Spread chiplets far apart to cool efficiently  Thermal model: 10cm are enough for 5 chiplets (80 cores) Mitigate bandwidth, power, yield © Hardavellas 46 [Pan, WINDS 2010]

Nanophotonic Components © Hardavellas 47 off-chip laser source coupler resonant modulators resonant detectors Ge-doped waveguide Selective: couple optical energy of a specific wavelength

Modulation and Detection © Hardavellas wavelengths DWDM 3 ~ 5μm waveguide pitch 10Gbps per link ~100 Gbps/μm bandwidth density !!! [Batten, HOTI 2008]

IBM Technology: Dense Off-Chip Coupling © Hardavellas 49 Dense optical fiber array [Lee, OSA/OFC/NFOEC 2010] <1dB loss, 8 Tbps/mm demonstrated Tapered couplers solved bandwidth problem, demonstrated Tbps/mm

Galaxy Overall Architecture © Hardavellas x speedup, 53% lower Energy x Delay product over best alt. 200mm 2 die, 64 routers/chiplet, 9 chiplets, 16cm fiber: > 1K cores

Conclusions Physical constraints limit chip scaling and performance Major energy overheads  Data movement Elastic caches: adapt cache to workload’s demands  Processing Seafire: specialized computing on dark silicon  Circuits guardbands, process variation Elastic fidelity: selectively trade accuracy for energy Pushing back the power and bandwidth walls Galaxy: optically-connected disintegrated processors Need to innovate across software/hardware stack  Devices, programmability, tools are a great challenge 51 © Hardavellas

Thank You! Parallelism alone is not enough to ride Moore’s Law Overview of our work at  Elastic Caches: adapt cache to workload’s demands  Seafire: specialized computing on dark silicon  Elastic Fidelity: selectively trade-off accuracy for energy  Galaxy: optically-connected disintegrated processors © Hardavellas 52