Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.
Alias Speculation using Atomic Regions (To appear at ASPLOS 2013) Wonsun Ahn*, Yuelu Duan, Josep Torrellas University of Illinois at Urbana Champaign.
Performance of Cache Memory
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.
Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.
Trace-Based Automatic Parallelization in the Jikes RVM Borys Bradel University of Toronto.
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers Author: Jing Fu, Jennifer Rexford Publisher: ACM CoNEXT 2008 Presenter:
Rajiv Gupta Chen Tian, Min Feng, Vijay Nagarajan Speculative Parallelization of Applications on Multicores.
Parallel Execution Models for Future Multicore Architectures Guri Sohi University of Wisconsin.
Heuristics for Profile-driven Method- level Speculative Parallelization John Whaley and Christos Kozyrakis Stanford University June 15, 2005.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Chapter 1 Computer System Overview Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.
Sutirtha Sanyal (Barcelona Supercomputing Center, Barcelona) Accelerating Hardware Transactional Memory (HTM) with Dynamic Filtering of Privatized Data.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Thread-Level Speculation Karan Singh CS
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Auther: Kevian A. Roudy and Barton P. Miller Speaker: Chun-Chih Wu Adviser: Pao, Hsing-Kuo.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
M. Mateen Yaqoob The University of Lahore Spring 2014.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator 1 黃 翔 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Processes and Virtual Memory
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
Static Identification of Delinquent Loads V.M. Panait A. Sasturkar W.-F. Fong.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Multiscalar Processors
Lu Peng, Jih-Kwon Peir, Konrad Lai
Antonia Zhai, Christopher B. Colohan,
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
rePLay: A Hardware Framework for Dynamic Optimization
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Presentation transcript:

Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization

2 Speculative Parallelization Construct threads from sequential program –Loops, methods, … Execute threads speculatively –Hardware support to enforce program order Application domain –Irregularly parallel Importance now –Single-core performance incremental

3 Speculative Parallelization Execution Execution model –Fork threads in program order for execution –Commit tasks in that order Control-flow Speculative Parallelization T1T1 T2T2 T3T3 T4T4 Limitation –Reaching distant parallelism

4 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

5 Program Demultiplexing Framework M() Call Site Handler Trigger EB Trigger –Begins execution of Handler Handler –Setup execution, parameters Demultiplexed execution –Speculative –Stored in Execution Buffer At call site –Search EB for execution Dependence violations –Invalidate executions M() Sequential Execution PD Execution

6 Program Demultiplexing Highlights Method granularity –Well defined Parameters Stack for local communication Trigger forks execution –Means for reaching distant method –Different from call site Independent speculative executions –No control dependence with other executions –Triggers lead to unordered execution Not according to program order

7 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

8 Example: 175.vpr, update_bb ().. x_from = block [b_from].x; y_from = block [b_from].y; find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to);.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ }` Call Site 2 Call Site 1

9 Handlers Provides parameters to execution Achieves separation of call site and execution Handler code –Slice of dependent instructions from call site –Many variants possible update_bb (inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to);

10 H2H1 Handlers Example.. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ }

11 Triggers Fork demultiplexed execution –Usually when method and handler are ready i.e. when data dependencies satisfied Begins execution of the handler

12 Identifying Triggers M() Sequential Exec. Program state for H+M Handler Generate memory profile Identify trigger point Collect for many executions –Good coverage Represent trigger points –Use instruction attributes PCs, Memory write address Program state for H + M available

13 Triggers Example.. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ } T1T2 H1 M H2 M Minimum of 400 cycles 90 cycles per execution

14 Handlers Example … (2).. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ } H2H1 T1T2 Stack references

15 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

16 Hardware Support Outline Support for triggers Demultiplexed execution Maintaining executions –Storage –Invalidation –Committing Dealt in other spec. parallelization proposals

17 Support for Triggers Triggers are registered with hardware –ISA extensions –Similar to debug watchpoints Evaluation of triggers –Only by Committed instructions PC, address –Fast lookup with filters

18 Main Auxiliary Demultiplexed Execution Hardware: Typical MP system –Private cache for speculative data –Extend cache line with “access” bit Misses serviced by Main processor –No communication with other executions On completion –Collect read set (R) Accessed lines –Collect write set (W) Dirty lines –Invalidate write set in cache P1P1 P1P1 C C P2P2 P2P2 C C P0P0 P0P0 C C P3P3 P3P3 C C

19 Execution buffer pool Holds speculative executions Execution entry contains –Read and write set –Parameters and return value Alternatives –Use cache May be more efficient –Similar to other proposals Not the focus in this paper Read Set Method (Parameters) Write Set Return value Read Set Method (Parameters)......

20 Invalidating Executions For a committed store address –Search Read and Write sets –Invalidate matching executions Invalidation

21 Using Executions For a given call site –Search method name, parameters –Get write and read set –Commit If accessed by program –Use If accessed by another method –Nested methods Search

22 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

23 Reaching distant parallelism M() A B ABAB Fo rk Call Site Call site

24 Performance evaluation –Performance benefits limited by Methods in program Handler implementation

25 Summary of other results (Refer paper) Method sizes –10s to 1000s of instructions. Lower 100s usually Demultiplexed execution overheads –Common case 1.1x to 2.0x Trigger points –1 to 3. Outliers exist: macro usage Handler length –10 to 50 instructions average Cache lines –Read – 20s, Written – 10s Demultiplexed execution –Held average of 100s of cycles

26 Conclusions Method granularity –Exploit modularity in program Trigger and handler to allow “earliest” execution –Data-flow based Unordered execution –Reach distant parallelism Orthogonal to other speculative parallelization –Use to further speedup demultiplexed execution

Backup

28 Average trigger points in call site Small set of trigger points for a given call site –Defines reachability from trigger to the call site

29 Evaluation Full-system execution-based simulator –Intel x86 ISA and Virtutech Simics –4-wide out-of-order processors –64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle) –MSI coherence Software toolchain –Modified gcc-compiler and lancet tool Debugging information, CFG, program dependence graph –Simulator based memory profile –Generates triggers and handlers No mis-speculations occur

30 Reaching distant parallelism A = Cycles between Fork and Call Site M() A

31 Execution Buffer Entries –Storage requirements Max case 284 KB –Minimize entries by better scheduling Avg. Cycles Held

32 Read and write set Cache lines written Cache lines read

33 Demultiplexed execution overheads Overheads due to –Handler –Cache misses due to demultiplexed execution Common case –between 1.1 to 2.0x Small methods  High overheads Execution Time Overhead

34 Length of handlers 14% 10% 9% 100% 16% 4% 40% 4% Handler Instruction Count Overhead

35 Method sizes

36 Methods –Runtime includes frequently called methods crafty gap gzip mcf parser twolf vortex vpr Methods Call Sites Exec. time (%)

37 Loop-level Parallelization fork loop endl Mitosis Unit: Loop iterations Live-ins from –P-slice Similar to handler Fork instruction –Restricted Same basic block level, method –Program order dependent –Ordered forking

38 Method-level parallelization Unit: Method continuations –Program after the method returns Orthogonal to PD call M() ret Method-level

39 Reaching distant parallelism crafty gap gzip mcf pars twolf vortex vpr > 1 (%) M 2 () M 1 () A B BABA BABA

40 Reaching distant parallelism B = Call Time to Earliest execution time (1 outstanding) M 2 () M 1 () B A C C / B = R1 C No params /C = R2

41 Issues with Stack Stack pointer is position dependent –Handler has to insert parameters at right position Same stack addresses denote different variables –Affects triggers Different stack pointers in program and execution –Stack may be discarded –To commit requires relocation of stack results Example: parameters passed by reference

42 Benchmarks SPECint2000 benchmarks –C programs Did not evaluate gcc, perl, bzip2, and eon –No intention of creating concurrency –No specific/ clean Programming style Many methods perform several tasks –May have less opportunities

43 Hardware System Intel x86 simulation –Virtutech Simics based full-system, Bochs decoder –4-processors at 3 GHz –Simple memory system Micro-architecture model –4-wide out of order without cracking into micro-ops –Branch predictors –32K L1 (2-cycle), 1 MB L2 (12-cycle) –MSI, 15-cycle communication cache to cache –Infinite Execution buffer pool

44 Software Modified gcc-compiler tool chain and lancet tool Extract from compiled binary –Debugging information –CFG, Program Dependence Graph Software –Dynamic information from simulator –Generates handler, trigger for call site as encountered Control-flow in handler not included [ongoing work] Perfect control transfer from trigger to method –Handler doesn’t execute if a branch leads to not calling the method

45 Generating Handlers Cannot easily identify and demarcate code –Heuristic to demarcate –Terminate when load address is from heap Handler has –Loads and stores to stack –No stores to heap –Limitation Heuristic. Doesn’t always work

46 Generating Handlers 1: Specify parameters to method –Pushed into stack by program Introduces dependency Prevents separation 2: Computing parameters –Program performs it near call site –Need to identify the code –Deal with Use of stack Control-flow Inter-method dependence 1: G = F (N) 2: if (…) 3: X = G + 2 4: else 5: X = G * 2 6: M (X)

47 Control-flow in Handlers C D CFG (C), Call Graph Depends on call site’s CF Handler for D –Call site in C () BB 3 –Include Loop BB 4 to BB 1 –Include Branch Branch in BB 1 Inclusion depends on trigger –Multiple iterations, diff. triggers Ongoing work

48 Other dependencies in Handlers A(X) B(X) C (X) D(X) Call Graph C calls D, A or B calls C –Dependence (X) extends May need multiple handlers –If multiple call sites

49 Buffering Handler Writes General case –Writes in handler to be buffered –Provided to execution –Discarded after execution Current implementation –Only stack writes EB P1P1 P1P1 C C P2P2 P2P2 C C P3P3 P3P3 C C

50 Methods for Speculative Execution Well encapsulated –Defined by parameters and return value –Stack for local computation –Heap for global state Often performs specific tasks –Access limited global state –Limits side-effects