A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Slides:

Advertisements

Similar presentations

Declarative Programming Languages for Multicore Architecures Workshop 15 January 2006.

Advertisements

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Your performance improvement partner 2/25/

Boolean Algebra Variables: only 2 values (0,1)

7B Unit 3 Finding your way Integrated skills. Millie is walking across the road.Millie is walking along the road.Sandy is walking over the bridge.Kitty.

The following 5 questions are about VOLTAGE DIVIDERS. You have 20 seconds for each question What is the voltage at the point X ? A9v B5v C0v D10v Question.

Logical conditions If ………………... Logical condition Is something which must either give a "true" or "false" answer / 2 possible results. For example is.

From RegentsEarth.com How to play Earth Science Battleship Divide the class into two teams, Red and Purple. Choose which team goes first. The main screen.

Advanced Concepts in Scheduling SCH02 Stephen Rando.

Inside the binary adder. Electro-mechanical relay A solid state relay is a switch that is controlled by a current. When current flows from A to B, the.

1 Child Nutrition Services Understanding the Child Nutrition Tech Online Application/Agreement Step-By-Step REGION 3 Policy Update Meeting Thursday, February.

All about settlements A: Why are they there?

The Science of Biology The study of living things.

Created by Todd jenkins Amend This Cant Stop Progress Terminator Who am I Potpourri 100.

Created by Todd Jenkins Who am I Terms Potpourri Cause and effect Places 100 Main Screen.

PLAN DU COLLEGE JEAN MONNET RDC1 er étage. Prendre la feuille de papier millimétré dans le sens de la largeur :

PLAN DU COLLEGE JEAN MONNET RDC1 er étage. Prendre la feuille de papier millimétré dans le sens de la largeur :

NEXT A1B1C1D1E1F1 A4B4C4D4E4F4 A2B2C2D2E2F2 A5B5C5D5E5F5 A3B3C3D3E3F3 A6B6C6D6E6F6.

Doc.: IEEE Submission January 2010 Rick Roberts (Intel)Slide 1 Project: IEEE P Working Group for Wireless Personal Area Networks.

Session No. 2 Introduction to Safety Management. The First Ultra-Safe Industrial System Ultra-safe system (mid 1990s onwards) Business management approach.

PIM ECMP Assert draft-hou-pim-ecmp-00 IETF 80, Prague.

Internal Auditing in the Government of Ontario, Canada Stuart Campbell, CIA, CGA, CISA Director, Internal Audit Government of Ontario, Canada.

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.

Agents & Intelligent Systems Dr Liz Black

Statens senter for arkiv, bibliotek og museum Indicators for Public Libraries Libraries in Knowledge Society – Strategies for.

Flextream: Adaptive Compilation of Streaming Applications for Heterogeneous Architectures Amir Hormati1, Yoonseo Choi1, Manjunath Kudlur3, Rodric Rabbah2,

Mai N. And V. AdverbsAdject.VerbsNouns

Prasanna Pandit R. Govindarajan

Garuda-PM July 2011 Challenges and Opportunities in Heterogeneous Multi-core Era R. Govindarajan HPC Lab,SERC, IISc

1 Review of Chapters 3 & 4 Copyright © 2012, Elsevier Inc. All rights reserved.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.

1 A New Multiplication Technique for GF(2 m ) with Cryptographic Significance Athar Mahboob and Nassar Ikram National University of Sciences & Technology,

Free Macro Download from i-present.co.uk by GMARK Ltd.i-present.co.ukGMARK my text Lorem for more information :

1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.

1 Comnet 2010 Communication Networks Recitation 4 Scheduling & Drop Policies.

LT Codes Paper by Michael Luby FOCS ‘02 Presented by Ashish Sabharwal Feb 26, 2003 CSE 590vg.

Workshop on HPC in India Programming Models, Languages, and Compilation for Accelerator-Based Architectures R. Govindarajan SERC, IISc

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic E: Software Pipelining José Nelson Amaral

20S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Spreadsheets Lesson: SS-L4 Creating Spreadsheet Formulas Creating Spreadsheet Formulas Learning.

PH. A- What is pH? A1.pH represents the concentration of H + in a solution A2.pH = the power of Hydrogen pH = -log 10 [H + ] A3.Lower pH means higher.

Joint Information Systems Committee e-Infrastructure Programme Objectives and Overview James R.B. Farnhill JISC Programme Manager (e-Research)

Economics (H) Chapter 1 Review Game Factors of Production Production Possibilities Goods & Services Productivity & Growth Value & Wealth MISC

Economics (H) Chapter 2 Review Game.

Radiology Participant Workshop, Oct 2004 Nuclear Medicine Image (NM) Integration Profile Kevin O’Donnell IHE Radiology Technical Committee Member, Toshiba.

Web-pa – the tutors’ view Web-PA – a tutors’ view Peter Willmot (School of Mechanical and Manufacturing Engineering)

Painting by Numbers: Visualisation of structured IPv6-Addressing.

GCSE Sawston VC Gary Whitton – Head of Science.

(8) I. Word Processing Package (Mail Merge) Ex: MS Word, Open Office II. Spread sheet package (function) Ex: MS Excel III. Presentation software (picture,

Lilian Blot CORE ELEMENTS SELECTION & FUNCTIONS Lecture 3 Autumn 2014 TPOP 1.

SOL 4.21 Patterns By, Jennifer Sagendorf ITRT-Suffolk Public Schools.

IPE – Calendar Seite 1 Application deadline : February 28 Track length: 12 weeks + 2 weeks German class in Heidelberg Semester dates Intensive german class.

Battery Power Conference 2010, 1, Li-Ion Myth-Buster Poking holes into some common beliefs about Li-Ion cells and Li-Ion BMSs. Davide Andrea.

Dynamic Memory Allocation in C.  What is Memory What is Memory  Memory Allocation in C Memory Allocation in C  Difference b\w static memory allocation.

Shortest Paths (1/11)  In this section, we shall study the path problems such like  Is there a path from city A to city B?  If there is more than one.

Synthesis, Analysis, and Verification Lecture 04c Lectures: Viktor Kuncak VC Generation for Programs with Data Structures “Beyond Integers”

The University of Adelaide, School of Computer Science

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Instruction Level Parallelism (ILP) Colin Stevens.

1 Compiling with multicore Jeehyung Lee Spring 2009.

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,

EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining

Jonathan Mak & Alan Mycroft University of Cambridge

Coe818 Advanced Computer Architecture

University of Michigan November 7, 2018

Presentation transcript:

A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

Golden era of computer architecture ~ 3 years behind CPU92 CPU95 CPU2000 CPU2006 Year SPEC CINT Performance (log. Scale) Era of DIY: Multicore Reconfigurable GPUs Clusters 10 Cores! 10-Core Intel Xeon “Unparalleled Performance”

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994) Automatic Speculation Automatic Pipelining Parallel Resources Automatic Allocation/Scheduling Commit

M ULTICORE A RCHITECTURE (C IRCA 2010) Automatic Pipelining Parallel Resources Automatic Speculation Automatic Allocation/Scheduling Commit

Realizable parallelism Parallel Library Calls Time Threads Credit: Jack Dongarra

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. Parallel Programming Automatic Parallelization Parallel Libraries Computer Architecture Implicitly parallel programming with critique-based iterative, occasionally interactive, speculatively pipelined automatic parallelization A Roadmap to restoring computing’s former glory.

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

LD:1 LD:2 W:1 W:3 LD:3 Core 1Core 2Core 3 W:2 W:4 LD:4 LD:5 C:1 C:2 C:3 Core 4 Spec-PS-DSWP P6 SUPERSCALAR ARCHITECTURE

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Program Dependence Graph AB D C Control Dependence Data Dependence PDG

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } B1 C1 A1 Core 1Core 2 Core 3 A2 B2 D1 C2 D2 Time Spec-DOALL SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

Example A: while (node) { B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Spec-DOALL A2 B2C2 D2 A1 B1C1 D1 A3 B3C3 D3 SpecDOALL Program Dependence Graph AB D C Control Dependence Data Dependence

Example B: node = node->next; C: res = work(node); D: write(res); } Core 1Core 2 Core 3 Time Program Dependence Graph AB D C Control Dependence Data Dependence Spec-DOALL A2A1A3 B2 C2 D2 B1 C1 D1 B3 C3 D3 A: while (node) { while (true) { B2 C2 D2 B3 C3 D3 B4 C4 D4 197.parser Slowdown SpecDOALLPerf

Core 1Core 2 Core 3 Time C1 D1 B1 B7 C3 D3 B3 C4 D4 B4 C5 D5 B5 C6 B6 Spec-DOACROSS Core 1Core 2 Core 3 Time Spec-DSWP C2 D2 B2 C1 D1 B1 B3 B4 B2 C2 C3 D2 B5 B6 B7 D3 C5 C6 C4 D5 D4 Throughput: 1 iter/cycle DOACROSSDSWP

Comparison: Spec-DOACROSS and Spec-DSWP Comm.Latency = 2: Comm.Latency = 1: 1 iter/cycle Core 1Core 2 Core 3 Time C1 D1 B1 C2 D2 B2 C3 D3 B3 Core 1Core 2 Core 3 B2 B3 B1 B5 B6 B4 C2 C3 C1 C5 C6 C4 B7 Pipeline Fill time 0.5 iter/cycle 1 iter/cycle D2 D3 D1 D5 D4 Time C4 D4 B4 C5 D5 B5 C6 B6 LatencyProblem B7

TLS vs. Spec-DSWP [MICRO 2010] Geomean of 11 benchmarks on the same cluster

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight. 3.Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

19 char *memory; void * alloc(int size); void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

20 char *memory; void * alloc(int void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6

21 char *memory; void * alloc(int Core 1Core 2 Time Core 3 Execution Plan alloc 1 alloc 2 alloc 3 alloc 4 alloc 5 alloc 6 void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr; } Easily Understood Non-Determinism!

[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11] ~50 of ½ Million LOCs modified in SpecINT 2000 Mods also include Non-Deterministic Branch

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization. New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

24 Sum Reduction Sum Reduction Unroll Rotate 0.90X 0.10X 30.0X 1.1X 0.8X Sum Reduction Sum Reduction Unroll Sum Reduction Sum Reduction Rotate Unroll 1.5X Iterative Compilation [Cooper ‘05; Almagor ‘04; Triantafyllis ’05]

PS-DSWP Complainer PS-DSWP Complainer

Red Edges: Deps between malloc() & free() Blue Edges: Deps between rand() calls Green Edges: Flow Deps inside Inner Loop Orange Edges: Deps between function calls Unroll Sum Reduction Sum Reduction Rotate PS-DSWP Complainer PS-DSWP Complainer Who can help me? Programmer Annotation Programmer Annotation

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative

PS-DSWP Complainer PS-DSWP Complainer Sum Reduction Sum Reduction PROGRAMMER Commutative PROGRAMMER Commutative LIBRARY Commutative LIBRARY Commutative

Multicore Needs: 1.Automatic resource allocation/scheduling, speculation/commit, and pipelining.  2.Low overhead access to programmer insight.  3.Code reuse. Ideally, this includes support of legacy codes as well as new codes.  4.Intelligent automatic parallelization.  New or Existing Sequential Code DSWP Family Optis Parallelized Code Machine Specific Performance Primitives Complainer/Fixer Insight Annotation One Implementation New or Existing Libraries Insight Annotation Other Optis Speculative Optis

Performance relative to Best Sequential 128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Restoration of Trend

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law Compiler Technology Architecture/Devices Era of DIY: Multicore Reconfigurable GPUs Clusters Compiler technology inspired class of architectures?

The End