1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011.

Slides:



Advertisements
Similar presentations
Explicit HW and SW Hierarchies High-Level Abstractions for giving the system what it wants Mattan Erez The University of Texas at Austin Salishan 2011.
Advertisements

Subthreshold SRAM Designs for Cryptography Security Computations Adnan Gutub The Second International Conference on Software Engineering and Computer Systems.
Prith Banerjee ECE C03 Advanced Digital Design Spring 1998
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Computer Abstractions and Technology
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
5-1 Memory System. Logical Memory Map. Each location size is one byte (Byte Addressable) Logical Memory Map. Each location size is one byte (Byte Addressable)
Challenges and Opportunities for System Software in the Multi-Core Era or The Sky is Falling, The Sky is Falling!
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
Parallel Research at Illinois Parallel Everywhere
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.
VLSI Trends. A Brief History  1958: First integrated circuit  Flip-flop using two transistors  From Texas Instruments  2011  Intel 10 Core Xeon Westmere-EX.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
PED Roadmapping Issues Vijaykrishnan Narayanan Dept. of CSE Penn State University GSRC Workshop, March 20-21, 2003.
SoC Design Methodology for Exascale Computing
Energy and Power Lecture notes S. Yalamanchili and S. Mukhopadhyay.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Advanced Computing and Information Systems laboratory Device Variability Impact on Logic Gate Failure Rates Erin Taylor and José Fortes Department of Electrical.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
11 1 Process Variation in Near-threshold Wide SIMD Architectures Sangwon Seo 1, Ronald G. Dreslinski 1, Mark Woh 1, Yongjun Park 1, Chaitali Chakrabarti.
1. 2 Electronics Beyond Nano-scale CMOS Shekhar Borkar Intel Corp. July 27, 2006.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise Group Intel Corporation CHEP ‘09 March 24, 2009 More Computing.
Low Power Techniques in Processor Design
UW-Madison Computer Sciences Vertical Research Group© 2010 A Unified Model for Timing Speculation: Evaluating the Impact of Technology Scaling, CMOS Design.
1 Lecture 1: CS/ECE 3810 Introduction Today’s topics:  Why computer organization is important  Logistics  Modern trends.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
® 1 VLSI Design Challenges for Gigascale Integration Shekhar Borkar Intel Corp. October 25, 2005.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
[Tim Shattuck, 2006][1] Performance / Watt: The New Server Focus Improving Performance / Watt For Modern Processors Tim Shattuck April 19, 2006 From the.
Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise Group Intel Corporation HPC: Energy Efficient Computing April.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Computer Organization & Assembly Language © by DR. M. Amer.
© 2009 IBM Corporation Motivation for HPC Innovation in the Coming Decade Dave Turek VP Deep Computing, IBM.
0 1 Thousand Core Chips A Technology Perspective Shekhar Borkar Intel Corp. June 7, 2007.
Patricia Gonzalez Divya Akella VLSI Class Project.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
1 Dual-V cc SRAM Class presentation for Advanced VLSIPresenter:A.Sammak Adopted from: M. Khellah,A 4.2GHz 0.3mm 2 256kb Dual-V CC SRAM Building Block in.
CS203 – Advanced Computer Architecture
Hardware Architecture
C. Murad Özsert Intel's Tera Scale Processor Architecture.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Innovation at Intel to address Challenges of Exascale Computing
Lynn Choi School of Electrical Engineering
Raghuraman Balasubramanian Karthikeyan Sankaralingam
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
For Massively Parallel Computation The Chaotic State of the Art
NVIDIA’s Extreme-Scale Computing Project
Basics of Energy & Power Dissipation
nZDC: A compiler technique for near-Zero silent Data Corruption
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Morgan Kaufmann Publishers
ECE 154A Introduction to Computer Architecture
Energy Efficient Computing in Nanoscale CMOS
Ultra-Low-Voltage UWB Baseband Processor
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
COSC121: Computer Systems
Mattan Erez The University of Texas at Austin July 2015
Past Practices of Conventional Core Microarchitecture is Dead
A High Performance SoC: PkunityTM
Processor Design Challenges
Welcome to Computer Architecture
Exascale Programming Models in an Era of Big Computation and Big Data
EE 155 / Comp 122 Parallel Computing
CSE 502: Computer Architecture
Artificial Intelligence: Driving the Next Generation of Chips and Systems Manish Pandey May 13, 2019.
Presentation transcript:

1 Runnemede: Disruptive Technologies for UHPC John Gustafson Intel Labs HPC User Forum – Houston 2011

The battle lines are drawn… 2 “Caches are for morons.” —Shekhar Borkar, Intel “We’re going to try to make the entire exascale machine cache-coherent.” —Bill Dally, Nvidia

Intel’s UHPC Approach Design test chips with the idea of maximizing learning. Very different from producing product roadmap processor designs. Going from Peta to Exa is nothing like the last few 1000x increases… 3

4 Building with Today’s Technology 200pJ per FLOP 200W 150W 100W 4450W Decode and control Translations …etc Power supply losses Cooling…etc 5KW Compute Memory Com Disk TFLOP Machine today 10TB 1.5nJ per Byte 100pJ com per FLOP KW Tera, MW Peta, GW Exa?

5 The Power & Energy Challenge 200W 150W 100W 4550W 5KW Compute Memory Com Disk TFLOP Machine today 5W 2W ~5W ~3W 5W TFLOP Machine then With Exa Technology ~20W

6 Scaling Assumptions Technology (High Volume) 45 nm (2008) 32 nm (2010) 22 nm (2012) 16 nm (2014) 11 nm (2016) 8 nm (2018) 5 nm (2020) Transistor density Frequency scaling 15%10%8%5%4%3%2% Vdd scaling -10%-7.5%-5%-2.5%-1.5%-1%-0.5% SD Leakage scaling/micron 1X Optimistic to 1.43X Pessimistic 65 nm Core + Local Memory Memory 0.35MB 5mm 2 (50%) DP FP Add, Multiply Integer Core, RF Router 5mm 2 (50%) 10 mm 2, 3 GHz, 6 GF, 1.8 W 8 nm Core + Local Memory Memory 0.35MB 0.17mm 2 (50%) DP FP Add, Multiply Integer Core, RF Router 0.17mm 2 (50%) 0.34 mm 2, 4.6 GHz, 9.2 GF, 0.24 to 0.46 W ~0.6mm

7 Near Threshold Logic 320mV 65nm CMOS, 50°C 320mV Subthreshold Region 9.6X 65nm CMOS, 50°C Supply Voltage (V) Energy Efficiency (GOPS/Watt) Active Leakage Power (mW) H. Kaul et al, 16.6: ISSCC08

8 Revise DRAM Architecture Page RAS CAS Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins Traditional DRAM New DRAM architecture Page Addr Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of I/Os (3D) Energy cost today: ~150 pJ/bit

9 Data Locality Core-to-core Communication on the chip: ~10 pJ per Byte Chip to memory Communication: ~1.5 nJ per Byte ~150 pJ per Byte Chip to chip Communication: ~100 pJ per Byte Data movement is expensive—keep it local (1) Core to core, (2) Chip-to-chip, (3) Memory

Disruptive Approach to Faults We tend to assume that execution faults (soft errors, hard errors) are rare. And it’s a valid speculation. Currently. Soon, we will need much more paranoia in hardware designs. 10

11 Road to Unreliability? From Peta to Exa Reliability Issues 1,000X parallelism More hardware for something to go wrong >1,000X intermittent faults due to soft errors Aggressive Vcc scaling to reduce power/energy Gradual faults due to increased variations More susceptible to Vcc droops (noise) More susceptible to dynamic temp variations Exacerbates intermittent faults—soft errors Deeply scaled technologies Aging related faults Lack of burn-in? Variability increases dramatically Resiliency will be the cornerstone

12 ResiliencyFaultsExample Permanent faults Stuck-at 0 & 1 Gradual faults VariabilityTemperature Intermittent faults Soft errors Voltage droops Aging faults Degradation Faults cause errors (data & control) Datapath errors Detected by parity/ECC Silent data corruption Need HW hooks Control errors Control lost (Blue screen) Minimal overhead for resiliency Circuit & Design Microarchitecture Microcode, Platform Programming system Applications Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt System Software

Execution Model and Codelets Programming Models/Systems (Rich) Hardware Abstraction Cores Peripherals/Devices Advanced Hardware Monitoring Net Run Time System Sea of Codelets Codelet - Code that can be executed non- preemptively with an “event-driven” model Shared memory model based on LC (Location Consistency – a generalized single- assignment model [GaoSarkar1980]) Codelet - Code that can be executed non- preemptively with an “event-driven” model Shared memory model based on LC (Location Consistency – a generalized single- assignment model [GaoSarkar1980]) 13

14 Summary Voltage scaling to reduce power and energy Explodes parallelism Cost of communication vs computation—critical balance Resiliency to combat side-effects and unreliability Programming system for extreme parallelism Application driven, HW/SW co-design approach Self-awareness & execution model to harmonize