Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Steve Blackburn Department of Computer Science Australian National University Perry Cheng TJ Watson Research Center IBM Research Kathryn McKinley Department.
CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Free-Me: A Static Analysis for Individual Object Reclamation Samuel Z. Guyer Tufts University Kathryn S. McKinley University of Texas at Austin Daniel.
A Design Space Evaluation of Grid Processor Architecture Jiening Jiang May, 2005 The presentation based on the paper written by Ramadass Nagarajan, Karthikeyan.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Instruction Level Parallelism (ILP) Colin Stevens.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
Garbage Collection Memory Management Garbage Collection –Language requirement –VM service –Performance issue in time and space.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Taking Off The Gloves With Reference Counting Immix
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
380C Lecture 17 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Why you need to care about workloads.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
Free-Me: A Static Analysis for Automatic Individual Object Reclamation Samuel Z. Guyer, Kathryn McKinley, Daniel Frampton Presented by: Dimitris Prountzos.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
September 11, 2003 Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss Modified by: Weiming Zhao Oct
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Spring 2006 Wavescalar S. Swanson, et al. Computer Science and Engineering University of Washington Presented by Brett Meyer.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
A Common Machine Language for Communication-Exposed Architectures Bill Thies, Michal Karczmarek, Michael Gordon, David Maze and Saman Amarasinghe MIT Laboratory.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 GC Advantage: Improving Program Locality Xianglong Huang, Zhenlin Wang, Stephen M Blackburn, Kathryn S McKinley, J Eliot B Moss, Perry Cheng.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
CS 352H: Computer Systems Architecture
Dynamic Compilation Vijay Janapa Reddi
A Common Machine Language for Communication-Exposed Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
/ Computer Architecture and Design
Yingmin Li Ting Yan Qi Zhao
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
José A. Joao* Onur Mutlu‡ Yale N. Patt*
/ Computer Architecture and Design
CSC3050 – Computer Architecture
Garbage Collection Advantage: Improving Program Locality
Presentation transcript:

Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin

Collaborators Faculty –Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, Graduate Students –Xianglong Huang, Sundeep Kushwaha, Aaron Smith, Zhenlin Wang (MTU) Research Staff –Jim Burrill, Sam Guyer, Bill Yoder

Computing in the Twenty-First Century New and changing architectures  Hitting the microprocessor wall  TRIPS - an architecture for future technology Object-oriented languages  Java and C# becoming mainstream Key challenges and approaches  Memory gap, parallelism  Language & runtime implementation efficiency  Orchestrating a new software/hardware dance  Break down artificial system boundaries

Technology Scaling Hitting the Wall 130 nm 100 nm 70 nm 35 nm 20 mm chip edge Analytically …Qualitatively … Either way … Partitioning for on-chip communication is key

End of the Road for Out-of-Order SuperScalars Clock ride is over –Wire and pipeline limits –Quadratic out-of-order issue logic –Power, a first order constraint Major vendors ending processor lines Problems for any architectural solution –ILP - instruction level parallelism –Memory latency

Where are Programming Languages? High Productivity Languages –Java, C#, Matlab, S, Python, Perl High Performance Languages –C/C++, Fortran Why not both in one? –Interpretation/JIT vs compilation –Language representation Pointers, arrays, frequent method calls, etc. –Automatic memory management costs Ô Obscure ILP and memory behavior

Outline TRIPS –Next generation tiled EDGE architecture –ILP compilation model Memory system performance –Garbage collection influence –The GC advantage Locality, locality, locality Online adaptive copying –Cooperative software/hardware caching

TRIPS Project Goals –Fast clock & high ILP in future technologies –Architecture sustains 1 TRIPS in 35 nm technology –Cost-performance scalability –Find the right hardware/software balance New balance reduces hardware complexity & power –New compiler responsibilities & challenges Hardware/Software Prototype –Proof-of-concept of scalability and configurability –Technology transfer

TRIPS Prototype Architecture

Execution Substrate 0123 I-cache 0 I-cache 1 I-cache 2 I-cache 3D-cache/LSQ 3 D-cache/LSQ 2 D-cache/LSQ 1 D-cache/LSQ 0 Global Ctrl Branch Predictor I-cache H Register banks Execution node Execution array Interconnect topology & latency exposed to compiler scheduler

Large Instruction Window Execution Node opcode src1 src2 opcode src1 src2 opcode src1 src2 Out-of-Order Instruction Buffers form a logical “z-dimension” in each node opcode src1src2 4 logical frames of 4 X 4 instructions Control Router ALU Instruction buffers add depth to execution array –2D array of ALUs; 3D volume of instructions Entire 3D volume exposed to compiler

Execution Model SPDI - static placement, dynamic issue –Dataflow within a block –Sequential between blocks TRIPS compiler challenges – Create large blocks of instructions Single entry, multiple exit, predication –Schedule blocks of instructions on a tile –Resource limitations Registers, Memory operations

Block Execution Model Program execution –Fetch and map block to TRIPS grid –Execute block, produce result(s) –Commit results –Repeat Block dataflow execution –Each cycle, execute a ready instruction at every node –Single read of registers and memory locations –Single write of registers and memory locations –Update the PC to successor block TRIPS core may speculatively execute multiple blocks (as well as instructions) TRIPS uses branch prediction and register renaming between blocks, but not within a block start end A B C D E

Just Right Division of Labor TRIPS architecture – Eliminates short-term temporaries – Out-of-order execution at every node in grid – Exploits ILP, hides unpredictable latencies without superscalar quadratic hardware without VLIW guarantees of completion time Scale compiler - generate ILP –Large hyperblocks - predicate, unroll, inline, etc. –Schedule hyperblocks Map independent instructions to different nodes Map communicating instructions to same or close nodes –Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths

High Productivity Programming Languages Interpretation/JIT vs compilation Language representation –Pointers, arrays, frequent method calls, etc. Automatic memory management costs MMTk in IBM Jikes RVM –ICSE’04, SIGMETRICS’04 –Memory Management Toolkit for Java –High Performance, Extensible, Portable –Mark-Sweep, Copying SemiSpace, Reference Counting –Generational collection, Beltway, etc.

Bump-Pointer Fast (increment & bounds check)  Can't incrementally free & reuse: must free en masse  Relatively slow (consult list for fit) Can incrementally free & reuse cells Free-List Allocation Choices

Bump pointer – ~70 bytes IA32 instructions, 726MB/s Free list – ~140 bytes IA32 instructions, 654MB/s Bump pointer 11% faster in tight loop – < 1% in practical setting – No significant difference (?) Second order effects? – Locality?? – Collection mechanism??

Implications for Locality Compare SS & MS mutator – Mutator time – Mutator memory performance: L1, L2 & TLB

javac

pseudojbb

db

Locality & Architecture

MS/SS Crossover 1.6GHz PPC

MS/SS Crossover 1.9GHz AMD

MS/SS Crossover 2.6GHz P4

MS/SS Crossover 3.2GHz P4

MS/SS Crossover 2.6GHz 1.9GHz 1.6GHz localityspace 3.2GHz

Locality in Memory Management Explicit memory management on its way out –Key GC vs Explicit MM insights 20 yrs old –Technology has and is changing Generational and Beltway Collectors –Significant collection time benefits over full heap collectors –Collect young objects –Infrequently collect old space –Copying nursery attains similar locality effects as full heap

Where are the Misses? Generational Copying Collector

Copy Order Static copy orders –Bredth first - Cheney scan –Depth first, hierarchical –Problem: one size does not fit all Static profiling per class –Inconsistant with JIT Object sampling –Too expensive in our experience OOR - Online Object Reordering –OOPSLA’04

OOR Overview Records object accesses in each method (excludes cold basic blocks) Finds hot methods by dynamic sampling Reorders objects with hot fields in higher generation during GC Copies hot objects into separate region

Static Analysis Example Compiler Hot BB Collect access info Cold BB Ignore Compiler Access List: 1. A.b 2. …. …. Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }

Adaptive Sampling Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } Adaptive Sampling Foo is hot Foo Accesses: 1. A.b 2. …. …. A.b is hot A B b ….. c

Advice Directed Reordering Example –Assume (1,4), (4,7) and (2,6) are hot field accesses –Order: 1,4,7,2,6 : 3,

OOR System Overview Baseline Compiler Source Code Executing Code Adaptive Sampling Optimizing Compiler Hot Methods Access Info Database Register Hot Field Accesses Look Up Adds Entries GC: copying objects Affects Locality Advice GC: Copies Objects OOR addition Jikes RVMInput/Output

Cost of OOR BenchmarkDefaultOORDifference jess % jack % raytrace % mtrt % javac % compress % pseudojbb % db % antlr % gcold % hsqldb % ipsixql % jython % ps-fun % Mean-0.19%

Performance db

Performance jython

Performance javac

Software is not enough Hardware is not enough Problem: inefficient use of cache Hardware limitations: set associativity, cannot predict the future Cooperative Software/Hardware Caching –Combines high level compiler analysis with dynamic miss behavior Lightweight ISA support conveys compiler’s global view to hardware –Compiler-guided cache replacement (evict-me) –Compiler-guided region prefetching –ISCA’03, PACT’02

Exciting Times Dramatic architectural changes –Execution tiles –Cache & Memory tiles Next generation system solutions –Moving hardware/software boundaries –Online optimizations –Key compiler challenges (same old…) ILP and Cache Memory Hierarchy