Polar Opposites: Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin
Collaborators Faculty –Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, Graduate Students –Xianglong Huang, Sundeep Kushwaha, Aaron Smith, Zhenlin Wang (MTU) Research Staff –Jim Burrill, Sam Guyer, Bill Yoder
Computing in the Twenty-First Century New and changing architectures Hitting the microprocessor wall TRIPS - an architecture for future technology Object-oriented languages Java and C# becoming mainstream Key challenges and approaches Memory gap, parallelism Language & runtime implementation efficiency Orchestrating a new software/hardware dance Break down artificial system boundaries
Technology Scaling Hitting the Wall 130 nm 100 nm 70 nm 35 nm 20 mm chip edge Analytically …Qualitatively … Either way … Partitioning for on-chip communication is key
End of the Road for Out-of-Order SuperScalars Clock ride is over –Wire and pipeline limits –Quadratic out-of-order issue logic –Power, a first order constraint Major vendors ending processor lines Problems for any architectural solution –ILP - instruction level parallelism –Memory latency
Where are Programming Languages? High Productivity Languages –Java, C#, Matlab, S, Python, Perl High Performance Languages –C/C++, Fortran Why not both in one? –Interpretation/JIT vs compilation –Language representation Pointers, arrays, frequent method calls, etc. –Automatic memory management costs Ô Obscure ILP and memory behavior
Outline TRIPS –Next generation tiled EDGE architecture –ILP compilation model Memory system performance –Garbage collection influence –The GC advantage Locality, locality, locality Online adaptive copying –Cooperative software/hardware caching
TRIPS Project Goals –Fast clock & high ILP in future technologies –Architecture sustains 1 TRIPS in 35 nm technology –Cost-performance scalability –Find the right hardware/software balance New balance reduces hardware complexity & power –New compiler responsibilities & challenges Hardware/Software Prototype –Proof-of-concept of scalability and configurability –Technology transfer
TRIPS Prototype Architecture
Execution Substrate 0123 I-cache 0 I-cache 1 I-cache 2 I-cache 3D-cache/LSQ 3 D-cache/LSQ 2 D-cache/LSQ 1 D-cache/LSQ 0 Global Ctrl Branch Predictor I-cache H Register banks Execution node Execution array Interconnect topology & latency exposed to compiler scheduler
Large Instruction Window Execution Node opcode src1 src2 opcode src1 src2 opcode src1 src2 Out-of-Order Instruction Buffers form a logical “z-dimension” in each node opcode src1src2 4 logical frames of 4 X 4 instructions Control Router ALU Instruction buffers add depth to execution array –2D array of ALUs; 3D volume of instructions Entire 3D volume exposed to compiler
Execution Model SPDI - static placement, dynamic issue –Dataflow within a block –Sequential between blocks TRIPS compiler challenges – Create large blocks of instructions Single entry, multiple exit, predication –Schedule blocks of instructions on a tile –Resource limitations Registers, Memory operations
Block Execution Model Program execution –Fetch and map block to TRIPS grid –Execute block, produce result(s) –Commit results –Repeat Block dataflow execution –Each cycle, execute a ready instruction at every node –Single read of registers and memory locations –Single write of registers and memory locations –Update the PC to successor block TRIPS core may speculatively execute multiple blocks (as well as instructions) TRIPS uses branch prediction and register renaming between blocks, but not within a block start end A B C D E
Just Right Division of Labor TRIPS architecture – Eliminates short-term temporaries – Out-of-order execution at every node in grid – Exploits ILP, hides unpredictable latencies without superscalar quadratic hardware without VLIW guarantees of completion time Scale compiler - generate ILP –Large hyperblocks - predicate, unroll, inline, etc. –Schedule hyperblocks Map independent instructions to different nodes Map communicating instructions to same or close nodes –Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths
High Productivity Programming Languages Interpretation/JIT vs compilation Language representation –Pointers, arrays, frequent method calls, etc. Automatic memory management costs MMTk in IBM Jikes RVM –ICSE’04, SIGMETRICS’04 –Memory Management Toolkit for Java –High Performance, Extensible, Portable –Mark-Sweep, Copying SemiSpace, Reference Counting –Generational collection, Beltway, etc.
Bump-Pointer Fast (increment & bounds check) Can't incrementally free & reuse: must free en masse Relatively slow (consult list for fit) Can incrementally free & reuse cells Free-List Allocation Choices
Bump pointer – ~70 bytes IA32 instructions, 726MB/s Free list – ~140 bytes IA32 instructions, 654MB/s Bump pointer 11% faster in tight loop – < 1% in practical setting – No significant difference (?) Second order effects? – Locality?? – Collection mechanism??
Implications for Locality Compare SS & MS mutator – Mutator time – Mutator memory performance: L1, L2 & TLB
Locality & Architecture
MS/SS Crossover 1.6GHz PPC
MS/SS Crossover 1.9GHz AMD
MS/SS Crossover 2.6GHz P4
MS/SS Crossover 3.2GHz P4
MS/SS Crossover 2.6GHz 1.9GHz 1.6GHz localityspace 3.2GHz
Locality in Memory Management Explicit memory management on its way out –Key GC vs Explicit MM insights 20 yrs old –Technology has and is changing Generational and Beltway Collectors –Significant collection time benefits over full heap collectors –Collect young objects –Infrequently collect old space –Copying nursery attains similar locality effects as full heap
Where are the Misses? Generational Copying Collector
Copy Order Static copy orders –Bredth first - Cheney scan –Depth first, hierarchical –Problem: one size does not fit all Static profiling per class –Inconsistant with JIT Object sampling –Too expensive in our experience OOR - Online Object Reordering –OOPSLA’04
OOR Overview Records object accesses in each method (excludes cold basic blocks) Finds hot methods by dynamic sampling Reorders objects with hot fields in higher generation during GC Copies hot objects into separate region
Static Analysis Example Compiler Hot BB Collect access info Cold BB Ignore Compiler Access List: 1. A.b 2. …. …. Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c }
Adaptive Sampling Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } Adaptive Sampling Foo is hot Foo Accesses: 1. A.b 2. …. …. A.b is hot A B b ….. c
Advice Directed Reordering Example –Assume (1,4), (4,7) and (2,6) are hot field accesses –Order: 1,4,7,2,6 : 3,
OOR System Overview Baseline Compiler Source Code Executing Code Adaptive Sampling Optimizing Compiler Hot Methods Access Info Database Register Hot Field Accesses Look Up Adds Entries GC: copying objects Affects Locality Advice GC: Copies Objects OOR addition Jikes RVMInput/Output
Cost of OOR BenchmarkDefaultOORDifference jess % jack % raytrace % mtrt % javac % compress % pseudojbb % db % antlr % gcold % hsqldb % ipsixql % jython % ps-fun % Mean-0.19%
Performance db
Performance jython
Performance javac
Software is not enough Hardware is not enough Problem: inefficient use of cache Hardware limitations: set associativity, cannot predict the future Cooperative Software/Hardware Caching –Combines high level compiler analysis with dynamic miss behavior Lightweight ISA support conveys compiler’s global view to hardware –Compiler-guided cache replacement (evict-me) –Compiler-guided region prefetching –ISCA’03, PACT’02
Exciting Times Dramatic architectural changes –Execution tiles –Cache & Memory tiles Next generation system solutions –Moving hardware/software boundaries –Online optimizations –Key compiler challenges (same old…) ILP and Cache Memory Hierarchy