Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute and State University
Motivation & Objectives Background Transactional Memory Jikes RVM Program Reconstruction Architecture Profiler, Builder & Runtime Future Work
Why Multicores? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits ▪ heat problems ▪ speed of light problems ▪ difficult design and verification ▪ large design teams necessary ▪ server farms need expensive air- conditioning
No fast CPUs any more, just more cores! Trend is using multi-core & hyper-threading
At 2005, Sun Niagara (8 cores with HT run 32 HWT) At 2010, Supermicro (48-core AMD Opteron). Now, Sun make boxes with between hardware threads (16 HWT/core, 8 cores/CPU) !! What About Software!!! Are we ready for this HW ?!
Many applications are designed to use few threads Legacy systems were designed to run at a single processor Multi-threading programming is headache for developers (race situations, concurrent access, …) HydraVM: Java Virtual Machine Prototype based on Jikes RVM and targets utilizing large number of cores through detecting automatically possible parallel portions of code
Transactional Memory Jikes RVM (Adaptive Online Architecture)
Atomicity: An operation (or set of operations) appears to the rest of the system to occur instantaneously Example (Money Transfer): …… synchronized { from = from - amount to = to + amount } …… Example (Money Transfer): …… account1.lock() account2.lock() from = from - amount to = to + amount account1.unlock() account2.unlock() …… account1 account2 X Y ≈
Drawbacks Deadlock Livelock Starvation Priority Inversion Non-composable Cost of managing the lock Non-scalable on multiprocessors AB X Y
Simplifies parallel programming by allowing a group of load and store instructions to execute in an atomic way using additional primitives Example (Money Transfer): …… START-TRANSACTION from = from - amount to = to + amount END-TRANSACTION …… …… Commit or Rollback & Retry account1 account2 X Y account1 y account2 y account1 x account2 x
Each transaction has ReadSet & WriteSet Transactions conflict if have the same variable(s) at ReadSet / WriteSet Conflict Resolution using Contention Manager that employs different policies (Aggressive, Polite, Back-Off, Random, …..) Aborted code undo changes (if required) and retries again
Transactions may be nested (multiple levels) Inner transaction share the ReadSet/WriteSet of parent Inner transactions conflicts with each other and with other higher level transactions Aborting parent transaction forces abort for children Inner transactions changes are visible to parents once commit successfully, but hidden from outside world till commit of highest level
Hardware Transactional Memory Modifications in processors, cache and bus protocol ex; unbounded HTM, TCC, …. Software Transactional Memory Software runtime library or the programming language support Minimal hardware support; CAS, LL/SC ex; RSTM, DSTM, ESTM,.. Hybrid Transactional Memory Exploits HTM support to achieve hardware performance for transactions that do not exceed the HTM’s limitations, and STM otherwise ex; LogTM, HyTM, … Distributed Transactional Memory Extends transaction primitives to distributed environment (network of multiple machines) ex; HyFlow, DecentSTM, GenSTM, …
Mature modular open source Java virtual machine designed for research purposes. Unlike most other JVMs it is written in Java! Adaptive Online System
We view program as a set of basic building blocks Each block is a set of instructions Block has single entry and multiple exists Blocks may access the same memory (variables) It is possible to reconstruct the program from these blocks by rearranging it differently with some changes to the control instructions. It is even possible to assign each set of blocks to different thread
int counter = 0; for(int i=0; i 0.3) counter++; else counter--; 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return
public class Test{ public static void foo(){ int counter = 0; for(int i=0; i 0.3) counter++; else counter--; } public static void zoo(){ System.out.println("hi"); } public static void main(String[] args){ int i=6; if(i<10) foo(); else zoo(); } }
Split code into Basic Block Inject loaded classes with additional instructions to monitor: Program Flow (Which Basic Blocks are accessed and in what order?) Memory accessed by each Basic Block Which Basic Block is doing I/O ?
0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 write J write C visit B1 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl read K write K visit B2 14: ifle 23 17: iinc 1, 1 read C write C visit B3 20: goto 26 23: iinc 1, -1 read C write C visit B4 26: iinc 2, 1 read J write J visit B5 29: iload_2 30: bipush 12 visit B6 read J 32: if_icmplt 7 35: return 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return Example: int C = 0; for(int J=0; J 0.3) C++; else C--;
Recompile the Java class bytecode into machine-code Replace and reload class definition at memory
Running the profiled code Collecting flow & memory access information and store it at the knowledge repository
Analyze knowledge repository information and know: Which Blocks can be grouped together Which groups of blocks can be parallelized
Program can be represented as a string (each character is a basic block). Example: for (Integer i = 0; i < DIMx; i++) { for (Integer j = 0; j < DIMx; j++) { for (Integer k = 0; k < DIMy; k++) { C[i][j] += A[i][k] * B[k][j]; } } } abjbhcfefghcfefghijbhcfefghcfefghijk ab(jb(hcfefg) 2 hi) 2 jk
ab(jb(hcfefg) 2 hi) 2 k Externalize common blocks patterns as methods Generated methods may be nested Reconstruct the program as producer-consumer pattern Collector ▪ Provides Executor with suitable blocks as Tasks to execute according to flow up-to time Executor ▪ Allocates core threads ▪ Assign tasks to threads ▪ Requests Collector for more blocks based on program flow, after all threads complete
Problems Threads may conflict when access the same variables Threads may finish out of normal order Collector may generate invalid tasks Lets represents each Thread as Transaction When two transactions conflicts abort one that has newer blocks relative to normal execution Transaction will not commit unless its preceding one in timeline is finished Transaction timeout if not reachable
Collects which transactions conflicts and commit rate We can refine the constructed program Builder re-organize generated blocks and recompile the code again
Complete the implementation of HydraVM Profiling by monitoring memory instead of generating new instructions Automatically uses of Java NIO to handle I/O operations and generate callbacks to process it Using thread scheduling techniques instead of TM Formal verification of reconstructed programs matches desired semantics