Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute.

Slides:



Advertisements
Similar presentations
Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.
Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
wwwcsif.cs.ucdavis.edu/~jacksoni
1 Multithreaded Programming in Java. 2 Agenda Introduction Thread Applications Defining Threads Java Threads and States Examples.
1 Processes and Threads Creation and Termination States Usage Implementations.
Transactional Memory Parag Dixit Bruno Vavala Computer Architecture Course, 2012.
SE-292 High Performance Computing
Multicore Acceleration of Priority-Based Schedulers for Concurrency Bug Detection Santosh Nagarakatte, Sebastian Burckhardt, Milo Martin, Madan Musuvathi.
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)
1.A computer game is an example of A.system software; B.a compiler; C.application software; D.hardware; E.none of the above. 2.JVM stands for: A.Java Virtual.
Lecture plan Transaction processing Concurrency control
OPERATING SYSTEM SUPPORT
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Processes Management.
Executional Architecture
Designing Embedded Hardware 01. Introduction of Computer Architecture Yonam Institute of Digital Technology.
Mohamed. M. Saad.  Java Virtual Machine Prototype based on Jikes RVM  Targets  Code profiling/visualization using execution flow  Utilize large number.
Mohamed M. Saad & Mohamed A. Mohamedin.  Static Execution Graph  Nodes; can be either ▪ Basic Blocks; a sequence of non-branching instructions ▪ Variables.
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
SE-292 High Performance Computing
Chapter 9 Interactive Multimedia Authoring with Flash Introduction to Programming 1.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Introduction to Programming G51PRG University of Nottingham Revision 1
The University of Adelaide, School of Computer Science
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.
CS510 Concurrent Systems Class 13 Software Transactional Memory Should Not be Obstruction-Free.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
Multithreading in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Computer System Architectures Computer System Software
Multi Core Processor Submitted by: Lizolen Pradhan
Java Bytecode What is a.class file anyway? Dan Fleck George Mason University Fall 2007.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Low-Overhead Software Transactional Memory with Progress Guarantees and Strong Semantics Minjia Zhang, 1 Jipeng Huang, Man Cao, Michael D. Bond.
Transactional Memory Lecturer: Danny Hendler. 2 2 From the New York Times…
The ATOMOS Transactional Programming Language Mehdi Amirijoo Linköpings universitet.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Background Computer System Architectures Computer System Software.
RealTimeSystems Lab Jong-Koo, Lim
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Maurice Herlihy and J. Eliot B. Moss,  ISCA '93
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory By McKenney, Michael, Triplett and Walpole.
The University of Adelaide, School of Computer Science
Computer Engg, IIT(BHU)
Challenges in Concurrent Computing
Adaptive Single-Chip Multiprocessing
Hybrid Transactional Memory
Software Transactional Memory Should Not be Obstruction-Free
Locking Protocols & Software Transactional Memory
The University of Adelaide, School of Computer Science
Problems with Locks Andrew Whitaker CSE451.
The University of Adelaide, School of Computer Science
Presentation transcript:

Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute and State University

 Motivation & Objectives  Background  Transactional Memory  Jikes RVM  Program Reconstruction  Architecture  Profiler, Builder & Runtime  Future Work

 Why Multicores?  Difficult to make single-core clock frequencies even higher  Deeply pipelined circuits ▪ heat problems ▪ speed of light problems ▪ difficult design and verification ▪ large design teams necessary ▪ server farms need expensive air- conditioning

 No fast CPUs any more, just more cores!  Trend is using multi-core & hyper-threading

 At 2005, Sun Niagara (8 cores with HT run 32 HWT)  At 2010, Supermicro (48-core AMD Opteron).  Now, Sun make boxes with between hardware threads (16 HWT/core, 8 cores/CPU) !! What About Software!!! Are we ready for this HW ?!

 Many applications are designed to use few threads  Legacy systems were designed to run at a single processor  Multi-threading programming is headache for developers (race situations, concurrent access, …)  HydraVM: Java Virtual Machine Prototype based on Jikes RVM and targets utilizing large number of cores through detecting automatically possible parallel portions of code

 Transactional Memory  Jikes RVM (Adaptive Online Architecture)

 Atomicity: An operation (or set of operations) appears to the rest of the system to occur instantaneously Example (Money Transfer): …… synchronized { from = from - amount to = to + amount } …… Example (Money Transfer): …… account1.lock() account2.lock() from = from - amount to = to + amount account1.unlock() account2.unlock() …… account1 account2 X Y ≈

 Drawbacks  Deadlock  Livelock  Starvation  Priority Inversion  Non-composable  Cost of managing the lock  Non-scalable on multiprocessors AB X Y

 Simplifies parallel programming by allowing a group of load and store instructions to execute in an atomic way using additional primitives Example (Money Transfer): …… START-TRANSACTION from = from - amount to = to + amount END-TRANSACTION …… …… Commit or Rollback & Retry account1 account2 X Y account1 y account2 y account1 x account2 x

 Each transaction has ReadSet & WriteSet  Transactions conflict if have the same variable(s) at ReadSet / WriteSet  Conflict Resolution using Contention Manager that employs different policies (Aggressive, Polite, Back-Off, Random, …..)  Aborted code undo changes (if required) and retries again

 Transactions may be nested (multiple levels)  Inner transaction share the ReadSet/WriteSet of parent  Inner transactions conflicts with each other and with other higher level transactions  Aborting parent transaction forces abort for children  Inner transactions changes are visible to parents once commit successfully, but hidden from outside world till commit of highest level

 Hardware Transactional Memory Modifications in processors, cache and bus protocol ex; unbounded HTM, TCC, ….  Software Transactional Memory Software runtime library or the programming language support Minimal hardware support; CAS, LL/SC ex; RSTM, DSTM, ESTM,..  Hybrid Transactional Memory Exploits HTM support to achieve hardware performance for transactions that do not exceed the HTM’s limitations, and STM otherwise ex; LogTM, HyTM, …  Distributed Transactional Memory Extends transaction primitives to distributed environment (network of multiple machines) ex; HyFlow, DecentSTM, GenSTM, …

 Mature modular open source Java virtual machine designed for research purposes. Unlike most other JVMs it is written in Java!  Adaptive Online System

 We view program as a set of basic building blocks  Each block is a set of instructions  Block has single entry and multiple exists  Blocks may access the same memory (variables)  It is possible to reconstruct the program from these blocks by rearranging it differently with some changes to the control instructions.  It is even possible to assign each set of blocks to different thread

int counter = 0; for(int i=0; i 0.3) counter++; else counter--; 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return

public class Test{ public static void foo(){ int counter = 0; for(int i=0; i 0.3) counter++; else counter--; } public static void zoo(){ System.out.println("hi"); } public static void main(String[] args){ int i=6; if(i<10) foo(); else zoo(); } }

 Split code into Basic Block  Inject loaded classes with additional instructions to monitor:  Program Flow (Which Basic Blocks are accessed and in what order?)  Memory accessed by each Basic Block  Which Basic Block is doing I/O ?

0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 write J write C visit B1 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl read K write K visit B2 14: ifle 23 17: iinc 1, 1 read C write C visit B3 20: goto 26 23: iinc 1, -1 read C write C visit B4 26: iinc 2, 1 read J write J visit B5 29: iload_2 30: bipush 12 visit B6 read J 32: if_icmplt 7 35: return 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return Example: int C = 0; for(int J=0; J 0.3) C++; else C--;

 Recompile the Java class bytecode into machine-code  Replace and reload class definition at memory

 Running the profiled code  Collecting flow & memory access information and store it at the knowledge repository

 Analyze knowledge repository information and know:  Which Blocks can be grouped together  Which groups of blocks can be parallelized

 Program can be represented as a string (each character is a basic block).  Example: for (Integer i = 0; i < DIMx; i++) { for (Integer j = 0; j < DIMx; j++) { for (Integer k = 0; k < DIMy; k++) { C[i][j] += A[i][k] * B[k][j]; } } } abjbhcfefghcfefghijbhcfefghcfefghijk ab(jb(hcfefg) 2 hi) 2 jk

ab(jb(hcfefg) 2 hi) 2 k  Externalize common blocks patterns as methods  Generated methods may be nested  Reconstruct the program as producer-consumer pattern  Collector ▪ Provides Executor with suitable blocks as Tasks to execute according to flow up-to time  Executor ▪ Allocates core threads ▪ Assign tasks to threads ▪ Requests Collector for more blocks based on program flow, after all threads complete

 Problems  Threads may conflict when access the same variables  Threads may finish out of normal order  Collector may generate invalid tasks  Lets represents each Thread as Transaction  When two transactions conflicts abort one that has newer blocks relative to normal execution  Transaction will not commit unless its preceding one in timeline is finished  Transaction timeout if not reachable

 Collects which transactions conflicts and commit rate  We can refine the constructed program  Builder re-organize generated blocks and recompile the code again

 Complete the implementation of HydraVM  Profiling by monitoring memory instead of generating new instructions  Automatically uses of Java NIO to handle I/O operations and generate callbacks to process it  Using thread scheduling techniques instead of TM  Formal verification of reconstructed programs matches desired semantics