Bridging the Moore’s Law Performance Gap with Innovation Scaling Todd Austin University of Michigan.

Slides:



Advertisements
Similar presentations
Assembly and Packaging TWG
Advertisements

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
TO COMPUTERS WITH BASIC CONCEPTS Lecturer: Mohamed-Nur Hussein Abdullahi Hame WEEK 1 M. Sc in CSE (Daffodil International University)
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
NASA/IBM Partnership Milestones IBM Helps design and build the Automatic Sequence Controlled Calculator for Harvard University.
© Copyright 2013 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. HP StoreOnce How to win.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Claude TADONKI Mines ParisTech – LAL / CNRS / INP 2 P 3 University of Oujda (Morocco) – October 7, 2011 High Performance Computing Challenges and Trends.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Register Allocation (via graph coloring)
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Of Cisco Development. Develop the Network as the Platform for Life’s Experiences Exploit Technology and Market Transitions Operational Excellence Changing.
CSIRO. Paul Roberts Digital Receivers SKANZ 2012 Digital Receivers for Radio Astronomy Paul Roberts CSIRO Astronomy and Space Science Engineering Development.
Cisc Complex Instruction Set Computing By Christopher Wong 1.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Todd Austin University of Michigan X-Stack Energy Optimization: Fact or Fiction.
1 VLSI and Computer Architecture Trends ECE 25 Fall 2012.
4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.
Why do so many chips fail? Ira Chayut, Verification Architect (opinions are my own and do not necessarily represent the opinion of my employer)
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Multicore In Real-Time Systems – Temporal Isolation Challenges Due To Shared Resources Ondřej Kotaba, Jan Nowotsch, Michael Paulitsch, Stefan.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
IT253: Computer Organization
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
March 9, 2015 San Jose Compute Engineering Workshop.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Server Virtualization
IC Products Processors –CPU, DSP, Controllers Memory chips –RAM, ROM, EEPROM Analog –Mobile communication, audio/video processing Programmable –PLA, FPGA.
EE3A1 Computer Hardware and Digital Design
Align Business and Information Technology – with SOA Pradeep Nair Director – Software Group (IBM India/SA)
Marv Adams Chief Information Officer November 29, 2001.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Datalayer Notebook Allows Data Scientists to Play with Big Data, Build Innovative Models, and Share Results Easily on Microsoft Azure MICROSOFT AZURE ISV.
Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Professor Arthur Trew Director, EPCC EPCC: KT in novel computing.
Oct 2005 page 1 The CIO of the Future – Changing the Dialogue Rolf Kubli, EDS EMEA Architects Office, CTO EDS Switzerland EGEE04 Industry Forum.
What is it and why do we need it? Chris Ward CS147 10/16/2008.
Tackling I/O Issues 1 David Race 16 March 2010.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
INTRODUCTION:- The approaching 4G (fourth generation) mobile communication systems are projected to solve still-remaining problems of 3G (third generation)
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Catalyst Introduction FY15 The art of rapidly designing solutions together.
Conclusions on CS3014 David Gregg Department of Computer Science
Lynn Choi School of Electrical Engineering
Big Data A Quick Review on Analytical Tools
Ph.D. in Computer Science
Architecture & Organization 1
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Hyperthreading Technology
Architecture & Organization 1
HIGH LEVEL SYNTHESIS.
Unit -4 Introduction to Embedded Systems Tuesday.
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Bridging the Moore’s Law Performance Gap with Innovation Scaling Todd Austin University of Michigan

Moore’s Law Performance Gap 2 Today, gap is cresting 10x

3

We Investigate: Who’s to Blame? 4 ?

The Programmer? Some say programmers cannot overcome hard parallel programming problems Not the case, as parallel programming techniques and infrastructure is advancing more quickly than ever before Map-reduce Hadoop Spark NoSQL databases Memcached Cassandra 5

Largest NA Bitcoin Miner GPGPU-based system Fills 2000 sq.ft. warehouse Computes 1 petahash/s Reportedly generates $8M in Bitcoins per month Unfortunately soon to be obsolete as Bitcoin difficulty continues to scale 6

The Educator? Perhaps CS education is failing to educate Not the case, CS education is Michigan CS enrollments are up nearly 250% from 5 years ago Average entrant GPA’s are up as well Quality of CS education is on an upward trend Among the highest ranked educators by students Michigan) Young area has less “dead wood” than our more venerable engineering counterparts Active-learning approach to education is the norm for CS in the US 7

CS Education in Ethiopia I have been working with Addis Ababa Institute of Technology to develop CS and IT coursework since 2008 Special focus on building infrastructure and developing active learning Nearly 600 students in the CS program 2 nd most popular major in the university With many job opportunities The first? 8

Silicon Technology? Are transistors failing us, in the past scaling decreased latency, power, cost Yes in part, silicon scaling has stalled for gate oxide layers Due to leakage concerns, gate oxide scaling has stopped This leaves threshold voltages high and prevents large decreases in V dd Because scaling has become uneven with the transistor Dennard scaling has all but stopped But, silicon is still bringing more transistors every generation, and will continue for perhaps a decade or so Ultimately creates the dark silicon dilemma Which prevents all of the chip to be active at the same time 9

The Dark Silicon Dilemma 10 Courtesy Michael UCSD

The Dark Silicon Dilemma 11 Courtesy Michael UCSD

The Dark Silicon Dilemma 12 Courtesy Michael UCSD

Silicon Technology? Are transistors failing us, in the past scaling decreased latency, power, cost Yes in part, silicon scaling has stalled for gate oxide layers Due to leakage concerns, gate oxide scaling has stopped This leaves threshold voltages high and prevents large decreases in V dd Because scaling has become uneven with the transistor Dennard scaling has all but stopped But, silicon is still bringing more transistors every generation, and will continue for perhaps a decade or so Ultimately creates the dark silicon dilemma Which prevents all of the chip to be active at the same time 13

The Architects? For decades, architects relied on technology scaling to provide for much of the generational design benefits Can architects bridge the Moore’s law performance gap with innovative designs? Perhaps, what provides the 10x+ needed to bridge the gap? Initially, many-core was thought to provide a scalable solution Many-core designs provide value, but it is not a universal panacea Architects have other tricks up their sleeves Application-specific architectures, in particular 14

Example: CryptoManiac Processor Circuit-level functional unit design tucks pre- and post- Boolean ops into clock cycle Architecture-level ISA extension exposes pre/post-ops Application-level programming re-expresses algorithms to leverage optimization 20% performance benefit (could be recast as energy benefit) CM Proc CM Proc CM Proc Keystore Req Scheduler In QOut Q requests results Pipelined 32-Bit MUL 1K Byte SBOX Cache 32-Bit Adder 32-Bit Rotator XORAND Logical Unit XORAND Logical Unit {tiny} {short} {tiny} {long} 15 [Austin]

Feature Extraction Object Geometric & Semantic Reasoning Contextual Scene Reasoning Local Features, Global Features Spatial- Temporal Relationships Example: EFFEX Feature Extraction Typical computer vision pipeline Object Geometric & Semantic Reasoning Contextual Scene Reasoning Feature Extraction Local Features, Global Features Spatial- Temporal Relationships Local Features, Global Features Image Preprocess image Scan image for potential feature locations Filter feature locations Generate signature of confirmed features Post process feature descriptors Feature Extraction 16 [Clemons, Austin]

Patch Memory Vector Reduction Units Heterogeneous Multicore Initial EFFEX design 90x greater efficiency for feature extraction algorithms Example: EFFEX Feature Extraction 17

Moving Beyond Homogeneous Parallelism Many-core designs, while helpful, will soon realize their (perhaps full) potential, what big ideas are next that can translate innovation into scaling Heterogeneous parallelism is the key to x energy-efficiency gains in light of slowing silicon C-FAR seeks sustained scaling through highly reusable composable customization, and affordable design and manufacturing 18

The Policy Makers? Perhaps, while solutions exist, policy makers do not have the will to support research into solving this problem Not the case, while all of the engineering sciences have been hit with funding cuts in the last decade Computer engineering has faired better than many other engineering sciences Investments in CE research are still flowing from industry and goverment Example: Center for Future Architectures Research (C-FAR) 27 PIs from 14 US universities US$30M investment over 5 years 19

C-FAR: Center for Future Architectures Research 20 Communication (Margaret Martonosi) Computation (David Brooks) Storage (Steven Swanson) Design Themes Applications-to-Architectures (Krste Asanovic) Resilient System Design (Valeria Bertacco) System Integration (Naveen Verma) Viability Themes What solutions are needed? Will our ideas actually work? Sponsored by SRC/DARPA Goal: Innovation-based scaling for ’s

The Blame? Cost of Design The one idea I want to leave with you today… Successfully bridging the Moore’s Law Performance Gap is less about “How” to do it, and more about “How Much” does it cost to attempt to do it My claim: if we can effect a 100x reduction in the cost to bring a design to market, performance challenges will eventually solve themselves as the market flourishes with orders of magnitude more designs, some of which will be the big design wins of tomorrow 21

Don’t Believe Me? Well, you can’t argue with Mother Nature! r/K selection theory is a biological mechanism that organisms use to better adapt to their environment In unstable environments, r-selection predominates as the ability to reproduce quickly is crucial In stable environments, K-selection predominates as the ability to compete successfully for limited resources is crucial 22

Design Costs Are Skyrocketing Source: International Business Strategies 23

High Costs Kill Innovation Heterogeneous designs often serve smaller markets 24

Outcome: “Nanodiversity” is Dwindling Source: Gartner Group 25

Silicon Today: The Good, the Bad and the Ugly The Good: Moore’s law will continue for the near future It won’t last forever, but that another problem The Bad: Dennard scaling has all but stopped, leaving innovation to fill the performance/power scaling gap E.g., app-specific design, custom accelerators The Ugly: Hardware innovation requires design diversity, which is ultimately too expensive to afford Skyrocketing NREs will necessitate broadly applicable (vanilla and slow) H/W designs 26

The Remedy: Scale Innovation via Lower Design Cost Ultimate goal: make customized design sufficiently inexpensive that anyone can do it anywhere Address all NRE factors: market size, design costs, build costs Take inspiration from Web 2.0, and subsequent innovation explosion Approach #1: Reduce the cost of custom hardware With better tools that understand and leverage the benefits of customization By embracing open-source hardware design solutions Approach #2: Widen the applicability of custom hardware Increasing market applicability with composable customization mitigates potentially higher NREs Approach #3: Reduce the cost of manufacturing hardware Utilize assembly-time customization to slash the cost of customization Approach #4: Slash software development costs Through more effective reuse, wider adoption of open source technologies, and novel approaches to verification 27

1) Reduce the cost of custom hardware Better tools Scalable accelerator synthesis and compilation, generates code and H/W for highly reusable accelerators for a wide range of applications Composable design space exploration, composability permits novel search techniques based on machine learning, enables efficient exploration of highly complex design spaces Embrace open-source concepts Example: Berkeley’s RISC-V architecture 28 Composed design space CPU, GPU, Accelerator Design Spaces automatic composition MxPA/FCUDA RTL

Berkeley’s RISC V Open-Source ISA 29

2) Widen the Applicability of Custom H/W 30 ESP: Ensembles of Specialized Processors Ensembles are algorithmic-specific processors optimized for code “patterns” Patterns capture common operations across many applications, each with unique communication and computation structure Approach has the promise of custom accelerator speed and efficiency that is widely applicable to general purpose programs ILP Engine Dense Engine Sparse Engine Graph Engine Glue Code Dense Code Sparse Code Graph Code DenseGraphSparse … Multimedia Analysis Computer Vision Machine Learning [Asanovic, Keutzer]

3) Reduce the cost of manufacturing H/W H/W brick Brick-and-mortar silicon explores assembly-time customization, i.e., MCMs + 3D + FPGA interconnect Diversity via brick ecosystem & interconnect flexibility Brick design costs amortized across all designs Robust interconnect and custom bricks rival ASIC speeds 31 [Bertacco, Carloni, Mercaldi] Brick-and-mortar silicon design flow: 1)Assemble brick layer 2)Connect with mortar layer 3)Package assembly 4)Deploy software

4) Slash Software Development Costs Implement more software reuse Embrace more fully the open-source movement Audience, other ideas? 32

Conclusions Heterogeneous design could continue Moore’s law scaling through innovation alone But, it requires a diverse hardware ecosystem with affordable customization Affordable customization won’t happen without our help 1. Widen the applicability of customization 2. Reduce the cost of customized design 3. Reduce the cost of custom manufacturing Resulting “nanodiversity” is a good thing Better perf, power, cost, capability More jobs, companies, and students More competition and scalable innovation 33

Questions ? ? ? ? ? ? ? ? ? ? ? ?