Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Similar presentations


Presentation on theme: "Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee."— Presentation transcript:

1 Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee and Jongman Kim Presented by Junghee Lee

2 2 Introduction Manycore systems –Number of cores is increasing Challenges in scalability –Memory –Power consumption –Cache coherence protocol –Load balancing

3 3 Contents Introduction Background –Programming models –Motivation IsoNet Fault-tolerance Evaluation Conclusion

4 4 Programming Models Parallel programming models –MPI –OpenMP Fine-grained parallelism –Emerging applications: Recognition, Mining and Synthesis –Execution time of each computation kernel is very short but it has abundant parallelism –Excessive overhead in multithreading

5 5 Job Queuing Creates jobs instead of threads –One thread per core is created –Thread: a set of instructions and states of execution –Job: a set of data that is processed by a thread Job queue –Manages the list of jobs –Maintains load balance CPU Thread Job

6 6 Conflicts in Job Queue Chance of conflicts increases as: –The number of cores increases –The time taken to update the job queue increases –The job queue is accessed more frequently (job is short) Previous approaches –Distributed queues Load balance is maintained by job-stealing The chance of conflicts in one local queue is decreased –Hardware implementation Time spent on updating the queue is reduced

7 7 Profile of SMVM Number of cores 8163264 0 0.2 0.4 0.6 0.8 1.0 Ratio of execution time 4 ConflictsStealing jobProcessing job 128256

8 8 Objectives Requirements of load balancer –Scalability: conflict-free –Fault-tolerance The probability of faults increases exponentially as technology scales Contributions of this paper –Light weight micro-network for load balancing –Scalable even with more than a thousand cores –Comprehensive fault-tolerance support

9 9 Contents Introduction Background IsoNet –Architecture –Implementation Fault-tolerance Evaluation Conclusion

10 10 System View R CPU R R R R R III III

11 11 Microarchitecture of IsoNet Node Comp MUX Comp MUX DEMUX Dual Clock Stack Job Count Job Count Job Max Selector Min Selector Switch

12 12 How It Works 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 11 1 2 2222 2 0 00 0 Tree-based routing: for fault-tolerance

13 13 Single Cycle Implementation Estimated critical path delay –11.38 ns (87.8 MHz) –By Elmore delay model Single cycle implementation offers low hardware cost Leaf node Int. node Root node Int. node Src or Dest Swt Src node Dest node

14 14 Hardware Cost Estimation CountInst Gate count DCStack2041024 Selector Leaf064 1 Child110928 2 Children2562 3 Children48029 4 Children6821 Switch3561024 Root591 Total674.50 674.50 * 240 * 4 = 647.52 K = 0.046% of 1.4 B (NVIDIA GTX285)

15 15 Contents Introduction Background IsoNet Fault-tolerance –Transparent mode –Reconfiguration mode Evaluation Conclusion

16 16 Supporting Fault-Tolerance Transparent mode –For faulty CPUs –Bypass the corresponding IsoNet node Reconfiguration mode –For faulty IsoNet node –Operation When a fault is detected, all IsoNet nodes go into the reconfiguration mode Reconfigure the topology of IsoNet so that the faulty node is excluded Assign a new root node if the root node fails

17 17 Reconfiguration 01 1 1 1 2 2 2 22 3 33 33333 3 3 33 3333 22 Root Node Candidate

18 18 Contents Introduction Background IsoNet Fault-tolerance Evaluation –Experimental setup –Results Conclusion

19 19 Experimental Setup Simulation framework –Wind River’s Simics full-system simulator –CMP with 4~64 x86 compatible cores –Fedora 12 with kernel 2.6.33 Benchmarks from recognition, mining and synthesis applications –GS: Gauss-Seidel –MMM: Dense Matrix-Matrix Multiply –SVA: Scaled Vector Addition –MVM: Dense Matrix Vector Multiply –SMVM: Sparse Matrix Vector Multiply

20 20 Results Number of cores 481632 MMM (6,473 instructions) 64 0 5 10 15 20 25 Execution time (10 7 cycles) 2 4 6 8 10 12 14 Speed up Job stealingCarbonIsoNet Carbon speedup IsoNet speed up Number of cores 481632 SMVM (2,872 instructions) 64 0 1 2 3 4 5 6 7 Execution time (10 7 cycles) 5 10 15 20 25 30 35 Speed up 40 45 50

21 21 Beyond Hundred Cores MMM (6,473 instructions) Number of cores 48163264 0 0.2 0.4 0.6 0.8 1.0 Relative Execution Time CarbonIsoNet 128 2565121024

22 22 Profile of IsoNet Number of cores 8163264 0 0.2 0.4 0.6 0.8 1.0 Ratio of execution time 4 ConflictsStealing jobProcessing job

23 23 Conclusion Scalability is one of key challenges in manycore domain Scalability in load balancing is critical to utilize a number of processing elements This paper proposes a novel hardware-based dynamic load distributor and balancer, called IsoNet IsoNet also provides comprehensive fault-tolerance support Experimental results in a full-system simulation with real applications demonstrate that IsoNet scales better than alternative techniques

24 24 Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology

25 25 Thank you!


Download ppt "Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee."

Similar presentations


Ads by Google