Presentation is loading. Please wait.

Presentation is loading. Please wait.

Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization.

Similar presentations


Presentation on theme: "Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization."— Presentation transcript:

1 Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization

2 2 Speculative Parallelization Construct threads from sequential program –Loops, methods, … Execute threads speculatively –Hardware support to enforce program order Application domain –Irregularly parallel Importance now –Single-core performance incremental

3 3 Speculative Parallelization Execution Execution model –Fork threads in program order for execution –Commit tasks in that order Control-flow Speculative Parallelization T1T1 T2T2 T3T3 T4T4 Limitation –Reaching distant parallelism

4 4 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

5 5 Program Demultiplexing Framework M() Call Site Handler Trigger EB Trigger –Begins execution of Handler Handler –Setup execution, parameters Demultiplexed execution –Speculative –Stored in Execution Buffer At call site –Search EB for execution Dependence violations –Invalidate executions M() Sequential Execution PD Execution

6 6 Program Demultiplexing Highlights Method granularity –Well defined Parameters Stack for local communication Trigger forks execution –Means for reaching distant method –Different from call site Independent speculative executions –No control dependence with other executions –Triggers lead to unordered execution Not according to program order

7 7 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

8 8 Example: 175.vpr, update_bb ().. x_from = block [b_from].x; y_from = block [b_from].y; find_to (x_from, y_from, block [b_from].type, rlim, &x_to, &y_to);.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ }` Call Site 2 Call Site 1

9 9 Handlers Provides parameters to execution Achieves separation of call site and execution Handler code –Slice of dependent instructions from call site –Many variants possible update_bb (inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to);

10 10 H2H1 Handlers Example.. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ }

11 11 Triggers Fork demultiplexed execution –Usually when method and handler are ready i.e. when data dependencies satisfied Begins execution of the handler

12 12 Identifying Triggers M() Sequential Exec. Program state for H+M Handler Generate memory profile Identify trigger point Collect for many executions –Good coverage Represent trigger points –Use instruction attributes PCs, Memory write address Program state for H + M available

13 13 Triggers Example.. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ } T1T2 H1 M H2 M Minimum of 400 cycles 90 cycles per execution

14 14 Handlers Example … (2).. x_from = block [b_from].x; y_from = block [b_from].y; find_to ( x_from, y_from, block [b_from].type, rlim, &x_to, &y_to );.. for ( k = 0; k < num_nets_affected; k++ ) { inet = nets_to_update [k]; if (net_block_moved [k] == FROM_AND_TO) continue;.. if ( net [inet].num_pins <= SMALL_NET ) { get_non_updateable_bb (inet, &bb_coord_new [bb_index]); } else { if ( net_block_moved [k] == FROM ) update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_from, y_from, x_to, y_to ); else update_bb ( inet, &bb_coord_new [bb_index], &bb_edge_new [bb_index], x_to, y_to, x_from, y_from ); }.. bb_index++ } H2H1 T1T2 Stack references

15 15 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

16 16 Hardware Support Outline Support for triggers Demultiplexed execution Maintaining executions –Storage –Invalidation –Committing Dealt in other spec. parallelization proposals

17 17 Support for Triggers Triggers are registered with hardware –ISA extensions –Similar to debug watchpoints Evaluation of triggers –Only by Committed instructions PC, address –Fast lookup with filters

18 18 Main Auxiliary Demultiplexed Execution Hardware: Typical MP system –Private cache for speculative data –Extend cache line with “access” bit Misses serviced by Main processor –No communication with other executions On completion –Collect read set (R) Accessed lines –Collect write set (W) Dirty lines –Invalidate write set in cache P1P1 P1P1 C C P2P2 P2P2 C C P0P0 P0P0 C C P3P3 P3P3 C C

19 19 Execution buffer pool Holds speculative executions Execution entry contains –Read and write set –Parameters and return value Alternatives –Use cache May be more efficient –Similar to other proposals Not the focus in this paper Read Set Method (Parameters) Write Set Return value Read Set Method (Parameters)......

20 20 Invalidating Executions For a committed store address –Search Read and Write sets –Invalidate matching executions Invalidation

21 21 Using Executions For a given call site –Search method name, parameters –Get write and read set –Commit If accessed by program –Use If accessed by another method –Nested methods Search

22 22 Outline Program Demultiplexing Overview Program Demultiplexing Execution Model Hardware Support Evaluation

23 23 Reaching distant parallelism M() A B ABAB Fo rk Call Site Call site

24 24 Performance evaluation –Performance benefits limited by Methods in program Handler implementation

25 25 Summary of other results (Refer paper) Method sizes –10s to 1000s of instructions. Lower 100s usually Demultiplexed execution overheads –Common case 1.1x to 2.0x Trigger points –1 to 3. Outliers exist: macro usage Handler length –10 to 50 instructions average Cache lines –Read – 20s, Written – 10s Demultiplexed execution –Held average of 100s of cycles

26 26 Conclusions Method granularity –Exploit modularity in program Trigger and handler to allow “earliest” execution –Data-flow based Unordered execution –Reach distant parallelism Orthogonal to other speculative parallelization –Use to further speedup demultiplexed execution

27 Backup

28 28 Average trigger points in call site Small set of trigger points for a given call site –Defines reachability from trigger to the call site

29 29 Evaluation Full-system execution-based simulator –Intel x86 ISA and Virtutech Simics –4-wide out-of-order processors –64K Level 1 caches (2 cycle), 1 MB Level 2 (12 cycle) –MSI coherence Software toolchain –Modified gcc-compiler and lancet tool Debugging information, CFG, program dependence graph –Simulator based memory profile –Generates triggers and handlers No mis-speculations occur

30 30 Reaching distant parallelism A = Cycles between Fork and Call Site M() A

31 31 Execution Buffer Entries –Storage requirements Max case 284 KB –Minimize entries by better scheduling 900 590 70 520 413 244 160 308 Avg. Cycles Held

32 32 Read and write set Cache lines written Cache lines read

33 33 Demultiplexed execution overheads Overheads due to –Handler –Cache misses due to demultiplexed execution Common case –between 1.1 to 2.0x Small methods  High overheads Execution Time Overhead

34 34 Length of handlers 14% 10% 9% 100% 16% 4% 40% 4% Handler Instruction Count Overhead

35 35 Method sizes

36 36 Methods –Runtime includes frequently called methods crafty gap gzip mcf parser twolf vortex vpr 24 16 9 8 12 10 11 11 206 59 27 9 84 26 106 20 85 90 51 30 55 92 88 99 Methods Call Sites Exec. time (%)

37 37 Loop-level Parallelization fork loop endl Mitosis Unit: Loop iterations Live-ins from –P-slice Similar to handler Fork instruction –Restricted Same basic block level, method –Program order dependent –Ordered forking

38 38 Method-level parallelization Unit: Method continuations –Program after the method returns Orthogonal to PD call M() ret Method-level

39 39 Reaching distant parallelism crafty gap gzip mcf pars twolf vortex vpr 60 72 30 80 70 40 63 47 > 1 (%) M 2 () M 1 () A B BABA BABA

40 40 Reaching distant parallelism B = Call Time to Earliest execution time (1 outstanding) M 2 () M 1 () B A C C / B = R1 C No params /C = R2

41 41 Issues with Stack Stack pointer is position dependent –Handler has to insert parameters at right position Same stack addresses denote different variables –Affects triggers Different stack pointers in program and execution –Stack may be discarded –To commit requires relocation of stack results Example: parameters passed by reference

42 42 Benchmarks SPECint2000 benchmarks –C programs Did not evaluate gcc, perl, bzip2, and eon –No intention of creating concurrency –No specific/ clean Programming style Many methods perform several tasks –May have less opportunities

43 43 Hardware System Intel x86 simulation –Virtutech Simics based full-system, Bochs decoder –4-processors at 3 GHz –Simple memory system Micro-architecture model –4-wide out of order without cracking into micro-ops –Branch predictors –32K L1 (2-cycle), 1 MB L2 (12-cycle) –MSI, 15-cycle communication cache to cache –Infinite Execution buffer pool

44 44 Software Modified gcc-compiler tool chain and lancet tool Extract from compiled binary –Debugging information –CFG, Program Dependence Graph Software –Dynamic information from simulator –Generates handler, trigger for call site as encountered Control-flow in handler not included [ongoing work] Perfect control transfer from trigger to method –Handler doesn’t execute if a branch leads to not calling the method

45 45 Generating Handlers Cannot easily identify and demarcate code –Heuristic to demarcate –Terminate when load address is from heap Handler has –Loads and stores to stack –No stores to heap –Limitation Heuristic. Doesn’t always work

46 46 Generating Handlers 1: Specify parameters to method –Pushed into stack by program Introduces dependency Prevents separation 2: Computing parameters –Program performs it near call site –Need to identify the code –Deal with Use of stack Control-flow Inter-method dependence 1: G = F (N) 2: if (…) 3: X = G + 2 4: else 5: X = G * 2 6: M (X)

47 47 Control-flow in Handlers C 1 23 4 D CFG (C), Call Graph Depends on call site’s CF Handler for D –Call site in C () BB 3 –Include Loop BB 4 to BB 1 –Include Branch Branch in BB 1 Inclusion depends on trigger –Multiple iterations, diff. triggers Ongoing work

48 48 Other dependencies in Handlers A(X) B(X) C (X) D(X) Call Graph C calls D, A or B calls C –Dependence (X) extends May need multiple handlers –If multiple call sites

49 49 Buffering Handler Writes General case –Writes in handler to be buffered –Provided to execution –Discarded after execution Current implementation –Only stack writes EB P1P1 P1P1 C C P2P2 P2P2 C C P3P3 P3P3 C C

50 50 Methods for Speculative Execution Well encapsulated –Defined by parameters and return value –Stack for local computation –Heap for global state Often performs specific tasks –Access limited global state –Limits side-effects


Download ppt "Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison Program Demultiplexing: Data-flow based Speculative Parallelization."

Similar presentations


Ads by Google