JavaTile: CMP-simulation with a twist Dan Greenfield Computer Architecture Group Internal Presentation, 16 th February 2007
Aim of Talk Introduce JavaTile Show benefits and problems of approach Spark interest in collaboration Invite expertise from multiple areas to solve CMP problems
Quick Background: Exciting Times! Intel 80-core (1+ TFLOPS) [1] Cisco 188-core (50 BIPS) [2]
Parts of a CMP Q: How well do each of the components run? Q: How well does the network run? From Pestata et al 2004 [3]
Parts of a CMP: continued Real Q: How well do Applications run?
Motivations Need more realistic NoC traffic –Current methods: synthetic, limited applications, low PE count, course-grain, OO Superscalar internals –How is the network used? –What is needed in NoC for future CMP? Want System-level view of performance, power and fault-tolerance –Most current metrics concern the NoC and 'guess' what this means for the system-level Want to explore solutions at all levels
Some Existing CMP Approaches SimpleScalar-based CMP simulator –Hydra 4 MIPS-core CMP simulator –CMP-SIM (extension of SimpleScalar) SESC Superscalar (1.5MIPS on 3GHz P4) GEMS (commercial SIMICS-based) ML-RSIM (Sparc RSIM-based)
Java Virtual Machine Platform with standard library Virtual Processor executing Java instruction set 'bytecode' Compilable to native platform
Java Advantages A widely deployed standard platform Its 'machine code' is itself Object Oriented with type information Amenable to static code analysis Tools to run efficiently, or compile to native executable
JavaTile Processing Element
JavaTile System
Bytecode Instrumentation Hook into all instructions that may cause NoC traffic Fibonacci2(); Code: 0: bipush 0 2: bipush -33 4: invokestatic #23; //Method monitor/Monitor.methodStart:(II)V 7: sipush : sipush 0 13: invokestatic #26; //Method monitor/Monitor.jumpMarker:(II)V 16: aload_0 17: sipush 1 20: invokestatic #30; //Method monitor/Monitor.syncCycleCount:(I)V 23: invokespecial #32; //Method java/lang/Object." ":()V 26: sipush : sipush 4 32: invokestatic #35; //Method monitor/Monitor.postMethodCall:(II)V 35: return
Current Flow
Problems Garbage Collection Local memory vs global memory allocation Passing by pointers (ownership) Push versus Pull No Inlining Auto-Parallelization Debugging
Auto-Parallelization Software Pipelining –e.g. MIT RAW Compiler [4] –e.g. Princeton DSWP (Decoupled SWP) [5] Thread-Level Speculation –Loop-level (e.g. Stanford Jrpm) [6] –Method-level (e.g. SableSpMT) [7] Affine Partitioning –e.g. Incorporated in Stanford SUIF [8]
References [1] Intel Polaris, from IDF 2006 slides, photo at [2] W. Eatherton, “The Push of Network Processing to the Top of the Pyramid,” Keynote Slides at: [3] Pestata et al, Cost-Performance Trade-Offs in Networks on Chip: A Simulation- Based Approach, DATE 2004 [4] Waingold et al, Baring it All to Software: Raw Machines, Computer Vol 30, 9, 1997 [5] Ottoni et al, Automatic Thread Extraction with Decoupled Software Pipelining. MICRO 2005 [6] Chen et al, The Jrpm System for Dynamically Parallelizing Sequential Java Programs, IEEE Micro Vol 23, No 6, Nov/Dec 2003 [7] Pickett et al, SableSpMT: a software framework for analysing speculative multithreading in Java, PASTE Workshop 2006 [8] Lim et al, An affine partitioning algorithm to maximize parallelism and minimize communications, ACM SIGARCH 1999