Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Beyond Auto-Parallelization: Compilers for Many-Core Systems Marcelo Cintra University of Edinburgh
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Organization and Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.
2015/6/21\course\cpeg F\Topic-1.ppt1 CPEG 421/621 - Fall 2010 Topics I Fundamentals.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Performance Driven Crosstalk Elimination at Compiler Level TingTing Hwang Department of Computer Science Tsing Hua University, Taiwan.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
1 Compiling with multicore Jeehyung Lee Spring 2009.
Compiler Research at the Indian Institute of Science Bangalore, India Y.N. Srikant Professor and Chairman Department of Computer Science and Automation.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Thinking in Parallel Adopting the TCPP Core Curriculum in Computer Systems Principles Tim Richards University of Massachusetts Amherst.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
EECS 583 – Class 16 Research Topic 1 Automatic Parallelization University of Michigan November 7, 2012.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Compiler course 1. Introduction. Outline Scope of the course Disciplines involved in it Abstract view for a compiler Front-end and back-end tasks Modules.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
EECS 583 – Class 17 Research Topic 1 Decoupled Software Pipelining University of Michigan November 9, 2011.
GPU Architecture and Programming
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Parallel Portability and Heterogeneous programming Stefan Möhl, Co-founder, CSO, Mitrionics.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Function Level Parallelism Driven by Data Dependencies By Sean Rul, Hans Vandierendonck, Koen De Bosschere dasCMP 2006, December 10.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
… begin …. Parallel Computing: What is it good for? William M. Jones, Ph.D. Assistant Professor Computer Science Department Coastal Carolina University.
What is a compiler? –A program that reads a program written in one language (source language) and translates it into an equivalent program in another language.
UNIT III -PIPELINE.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Protecting C and C++ programs from current and future code injection attacks Yves Younan, Wouter Joosen and Frank Piessens DistriNet Department of Computer.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
Exploiting Detachability Hashem H. Najaf-abadi Eric Rotenberg.
EECS 583 – Class 18 Research Topic 1 Breaking Dependences, Dynamic Parallelization University of Michigan November 14, 2012.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Ghent University Veerle Desmet Lieven Eeckhout Koen De Bosschere Using Decision Trees to Improve Program-Based and Profile-Based Static Branch Prediction.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Probabilistic Pointer Analysis [PPA]
Simone Campanoni A research CAT Simone Campanoni
CS161 – Design and Architecture of Computer Systems
EECS 583 – Class 16 Research Topic 1 Automatic Parallelization
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Antonia Zhai, Christopher B. Colohan,
Multi-Processing in High Performance Computer Architecture:
The University of Texas at Austin
Introduction, Focus, Overview
CSCI1600: Embedded and Real Time Software
Efficient software checkpointing framework for speculative techniques
University of Michigan November 7, 2018
Introduction, Focus, Overview
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Are New Languages Necessary for Manycore? David I. August Department of Computer Science Princeton University

David I. August THIS is the Problem! SPEC CPU INTEGER PERFORMANCE TIME ? 2004

David I. August Why New Multicore Languages Will Fail Money is earned by relieving customer pain The Market Legacy, Legacy, Legacy Programmers adopt new programming models Parallel programming is more difficult Parallel programming models have longevity issues Automatic Thread Extraction (ATE) ‏

David I. August Automatic Thread Extraction “That isn't to say we are parallelizing arbitrary C code, that's a fool's errand!” – Richard Lethin “Compiler can’t determine a tree from a graph…” – Burton Smith “Compiler can’t determine dependences without type information. Even then…” – Burton Smith “Decades of automatic parallelization work has been a failure…” – James Larus “All that icky pointer chasing code...” – Tim Mattson

David I. August How To Get Parallelism For Multicore? Nine months ago, with an open mind… A priori select ALL C programs from SPEC CINT 2000 Our objective function (in priority order): 1.Extract meaningful parallelism 2.Prefer automatic over manual 3.Minimize impact to the programmer when manual

David I. August Our Results BenchmarkThreads at PeakSpeedupLOCs Changed 164.gzip vpr gcc mcf crafty parser perlbmk gap vortex bzip twolf GEOMEAN ARITHMEAN M.L.O.P.: 5 Generations 32 Cores 5.3x Speedup

David I. August Our Recipe Recent Compiler Technology: Decoupled Software Pipelining (DSWP) [MICRO 05] Parallel-Stage DSWP (PS-DSWP) ‏ Speculative DSWP (Spec-DSWP) [PACT 07] Existing Technology: Speculative DOALL, TLS Targeted Memory Profiling Procedure Boundary Elimination [PLDI 06] Hardware Support: Compiler-Controlled Speculation Streaming Communication [MICRO 06]

David I. August Typical Example: 197.parser Threads run on multicore model with Itanium 2 cores. Find English Sentences Parse Sentences (95%)‏ Emit Results DSWP PS-DSWP (Spec DOALL Middle Stage)‏

David I. August What We Learned 1.A new way of thinking about dependences: Go With the Flow 1.TLP is easier to extract than ILP 1.A holistic approach is better 1.A limitation exists in the sequential model: Determinism

David I. August Determinism: A Double Edged Sword while( ): x = Rand()‏ int Rand(): state = f2(state)‏ return f1(state)‏ DOALL SEQUENTIAL 56 LOCs in 11 programs: 22 annotations Only 2 programs needed more Most common culprit: Custom Allocators

David I. August What about Manycore? Multicore New languages aren’t necessary Legacy code easily adjusted Manycore Implicitly Parallel Sequential Programming No optimization for sequential (custom allocators) ‏ Points of non-determinism specified Parallel algorithms in sequential codes Debuggability, Understandability, Sanity

David I. August The Answer Originates with ATE The Old Way: PL folks would write languages, Architecture folks would make HW, and Compiler folks would dutifully connect the two. This will fail for Manycore: Unduly burden the programmer Performance will suffer There’s a New Way…

David I. August DO NOT POST ANYTHING AFTER THIS SLIDE

David I. August How Code Was Transformed BenchmarkLOC (All) ‏ LOC (Model) ‏ Model Techniques Compiler Techniques Applied 164.gzip262Y-BranchTLS Memory, DSWP 175.vpr11PUREAlias, Value, & Control Spec, TLS Mem, DSWP 176.gcc177PUREAlias & Control Spec, TLS MEM, DSWP 181.mcf00Alias, Silent Store, & Control Spec, TLS Mem, DSWP, Nested 186.crafty99PURETLS Mem, DSWP, Nested 197.parser22PURETLS Mem, DSWP 253.perlbmk00Alias, Control, & Value Spec, DSWP 254.gap11PURETLS Memory, DSWP, Alias Spec 255.vortex00Alias & Value Spec, TLS Mem, DSWP 256.bzip200TLS Memory, DSWP 300.twolf11PUREAlias & Control Spec, TLS Mem, DSWP

David I. August PURE

David I. August Y-Branch

David I. August SPEC 2006: 403.gcc Threads run on multicore model with Itanium 2 cores.