Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Slides:



Advertisements
Similar presentations
1 Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping Chi-Keung (CK) Luk Technology Pathfinding and Innovation Software.
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Do We Need Wide Flits in Networks-On-Chip? Junghee Lee, Chrysostomos Nicopoulos, Sung Joo Park, Madhavan Swaminathan and Jongman Kim Presented by Junghee.
Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.
Chapter 17 Parallel Processing.
1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.
Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
SAGE: Self-Tuning Approximation for Graphics Engines
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
ISO-NET Hardware Based Job Queue Management for Many Core Architecture.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Modeling GPU non-Coalesced Memory Access Michael Fruchtman.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
I2CRF: Incremental Interconnect Customization for Embedded Reconfigurable Fabrics Jonghee W. Yoon, Jongeun Lee*, Jaewan Jung, Sanghyun Park, Yongjoo Kim,
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.
Energy-Efficient Shortest Path Self-Stabilizing Multicast Protocol for Mobile Ad Hoc Networks Ganesh Sridharan
CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.
P-GAS: Parallelizing a Many-Core Processor Simulator Using PDES Huiwei Lv, Yuan Cheng, Lu Bai, Mingyu Chen, Dongrui Fan, Ninghui Sun Institute of Computing.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
A Dynamic Query-tree Energy Balancing Protocol for Sensor Networks H. Yang, F. Ye, and B. Sikdar Department of Electrical, Computer and systems Engineering.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
PERFORMANCE EVALUATION OF LARGE RECONFIGURABLE INTERCONNECTS FOR MULTIPROCESSOR SYSTEMS Wim Heirman, Iñigo Artundo, Joni Dambre, Christof Debaes, Pham.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Optimizing Distributed Actor Systems for Dynamic Interactive Services
Reza Yazdani Albert Segura José-María Arnau Antonio González
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Department of Computer Science University of California, Santa Barbara
Using Packet Information for Efficient Communication in NoCs
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee and Jongman Kim Presented by Junghee Lee

2 Introduction Manycore systems –Number of cores is increasing Challenges in scalability –Memory –Power consumption –Cache coherence protocol –Load balancing

3 Contents Introduction Background –Programming models –Motivation IsoNet Fault-tolerance Evaluation Conclusion

4 Programming Models Parallel programming models –MPI –OpenMP Fine-grained parallelism –Emerging applications: Recognition, Mining and Synthesis –Execution time of each computation kernel is very short but it has abundant parallelism –Excessive overhead in multithreading

5 Job Queuing Creates jobs instead of threads –One thread per core is created –Thread: a set of instructions and states of execution –Job: a set of data that is processed by a thread Job queue –Manages the list of jobs –Maintains load balance CPU Thread Job

6 Conflicts in Job Queue Chance of conflicts increases as: –The number of cores increases –The time taken to update the job queue increases –The job queue is accessed more frequently (job is short) Previous approaches –Distributed queues Load balance is maintained by job-stealing The chance of conflicts in one local queue is decreased –Hardware implementation Time spent on updating the queue is reduced

7 Profile of SMVM Number of cores Ratio of execution time 4 ConflictsStealing jobProcessing job

8 Objectives Requirements of load balancer –Scalability: conflict-free –Fault-tolerance The probability of faults increases exponentially as technology scales Contributions of this paper –Light weight micro-network for load balancing –Scalable even with more than a thousand cores –Comprehensive fault-tolerance support

9 Contents Introduction Background IsoNet –Architecture –Implementation Fault-tolerance Evaluation Conclusion

10 System View R CPU R R R R R III III

11 Microarchitecture of IsoNet Node Comp MUX Comp MUX DEMUX Dual Clock Stack Job Count Job Count Job Max Selector Min Selector Switch

12 How It Works Tree-based routing: for fault-tolerance

13 Single Cycle Implementation Estimated critical path delay –11.38 ns (87.8 MHz) –By Elmore delay model Single cycle implementation offers low hardware cost Leaf node Int. node Root node Int. node Src or Dest Swt Src node Dest node

14 Hardware Cost Estimation CountInst Gate count DCStack Selector Leaf064 1 Child Children Children Children6821 Switch Root591 Total * 240 * 4 = K = 0.046% of 1.4 B (NVIDIA GTX285)

15 Contents Introduction Background IsoNet Fault-tolerance –Transparent mode –Reconfiguration mode Evaluation Conclusion

16 Supporting Fault-Tolerance Transparent mode –For faulty CPUs –Bypass the corresponding IsoNet node Reconfiguration mode –For faulty IsoNet node –Operation When a fault is detected, all IsoNet nodes go into the reconfiguration mode Reconfigure the topology of IsoNet so that the faulty node is excluded Assign a new root node if the root node fails

17 Reconfiguration Root Node Candidate

18 Contents Introduction Background IsoNet Fault-tolerance Evaluation –Experimental setup –Results Conclusion

19 Experimental Setup Simulation framework –Wind River’s Simics full-system simulator –CMP with 4~64 x86 compatible cores –Fedora 12 with kernel Benchmarks from recognition, mining and synthesis applications –GS: Gauss-Seidel –MMM: Dense Matrix-Matrix Multiply –SVA: Scaled Vector Addition –MVM: Dense Matrix Vector Multiply –SMVM: Sparse Matrix Vector Multiply

20 Results Number of cores MMM (6,473 instructions) Execution time (10 7 cycles) Speed up Job stealingCarbonIsoNet Carbon speedup IsoNet speed up Number of cores SMVM (2,872 instructions) Execution time (10 7 cycles) Speed up

21 Beyond Hundred Cores MMM (6,473 instructions) Number of cores Relative Execution Time CarbonIsoNet

22 Profile of IsoNet Number of cores Ratio of execution time 4 ConflictsStealing jobProcessing job

23 Conclusion Scalability is one of key challenges in manycore domain Scalability in load balancing is critical to utilize a number of processing elements This paper proposes a novel hardware-based dynamic load distributor and balancer, called IsoNet IsoNet also provides comprehensive fault-tolerance support Experimental results in a full-system simulation with real applications demonstrate that IsoNet scales better than alternative techniques

24 Questions? Contact info Junghee Lee Electrical and Computer Engineering Georgia Institute of Technology

25 Thank you!