Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Distributed Systems CS
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Transactional Memory (TM) Evan Jolley EE 6633 December 7, 2012.
Background Computer System Architectures Computer System Software.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Strategic Directions in Real- Time & Embedded Systems Aatash Patel 18 th September, 2001.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
1 I/O Management in Representative Operating Systems.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
A Load Balancing Framework for Adaptive and Asynchronous Applications Kevin Barker, Andrey Chernikov, Nikos Chrisochoides,Keshav Pingali ; IEEE TRANSACTIONS.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Threads, Thread management & Resource Management.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
CSE 661 PAPER PRESENTATION
1 Threads, SMP, and Microkernels Chapter 4. 2 Process Resource ownership: process includes a virtual address space to hold the process image (fig 3.16)
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
A summary by Nick Rayner for PSU CS533, Spring 2006
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Department of Computer Science and Software Engineering
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.
Background Computer System Architectures Computer System Software.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Introduction to Operating Systems Concepts
Parallel Programming Models
Introduction to threads
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Chapter 4: Multithreaded Programming
Chapter 4 Threads.
Mach Kernel Kris Ambrose Kris Ambrose 2003.
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Threads, SMP, and Microkernels
Chapter 4: Threads.
Yiannis Nikolakopoulos
Threads Chapter 4.
Multithreaded Programming
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang R. Gao Department of Electrical & Computer Engineering University of Delaware MTAAP’20101

Introduction Modern OS based upon a sequential execution model (the von Neumann model). Rapid progress of multi-core/many-core chip technology. –Parallel Computer systems now implemented on single chips. MTAAP’20102

Introduction Conventional OS model must adapt to the underlying changes. Further exploit the many levels of parallelism. –Hardware as well as Software We introduce a study on how to do this adaptation for the IBM BlueGene/P multi-core system. MTAAP’20103

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’20104

Contributions Isolate traditional OS functions to a single core of the BG/P multi-core chip. Ported the TiNy Thread (TNT) execution model to allow for further utilization of BG/P compute cores. Expanded the design framework to a multi-chip system designed for scalability to a large number of chips. MTAAP’20105

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’20106

TiNy Threads on BG/P TiNy Threads –Lightweight, non-preemptive, threads –API similar to POSIX Threads. –Originally presented in “TiNy Threads: A Thread Virtual Machine for the Cyclopse-64 Cellular Architecture” –Runs on IBM Cyclops64 Kernel Modifications –Alterations to the thread scheduler to allow for non-preemption MTAAP’20107

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’20108

Multinode Thread Scheduler Thread Scheduler allows TNT to run across multiple nodes. –Requests facilitated through IBM’s Deep Computing Messaging Framework’s RPCs. Multiple Scheduling Algorithms –Workload Un-Aware Random Round-Robin –Workload Aware MTAAP’20109

Multinode Thread Scheduling MTAAP’ Node ANode B tnt_create() … tnt_join() tnt_create() … tnt_join() tid … tnt_exit() … tnt_exit() … tnt_join() … tnt_join()

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’201011

Synchronization Three forms –Mutex –Thread Joining –Barrier Similar to thread scheduling –Lock requests, Join requests, and Barrier notifications sent to node responsible for said synchronization MTAAP’201012

Multinode Thread Scheduling MTAAP’ Node ANode B tid … tnt_exit() … tnt_exit() tnt_join() A tnt_exit()

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’201014

Characteristics of TDSM Provides One-Sided access to memory distributed among nodes through IBM’s DCMF. Allows for virtual address manipulation –Maps distributed memory to a single virtual address space. –Allows for array indexing and memory offsets. Scalable to a variety of applications –Size of desired global shared memory set at runtime. Mutability –Memory allocation algorithm and memory distribution algorithm can be easily altered and/or replaced. MTAAP’201015

16 Example of TDSM 0 global global[15] to global[34] 0x Node 6Node 5Node 7 Local Buffer 0x E to 0x A Node 6: 0 to 14 and Node 7: 0 to 4 …

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’201017

Summary of Results The performance of the TNT thread system shows comparable speedup to that of Pthreads running on the same hardware. The distributed shared memory operates at 95% of the experimental peak performance of the network, with distance between nodes not being a sensitive factor. The cost of thread creation shows a linear relationship as the number of threads increase. The cost of waiting at a barrier is constant and independent of the number of threads involved. MTAAP’201018

Single-Node Thread System Performance Based upon Radix-2 Cooley-Tukey algorithm with the Kiss FFT library for the underlying DFT. Underlying TNT thread model performs comparably to POSIX standard when the number of threads does not exceed the number of available processor cores. MTAAP’201019

Memory System Performance Reads and writes of varying sizes between one and two nodes. For inter-node communications, data can be transferred at approximately 357 MB/s. Kumar et al determined experimental peak performance on BG/P to be 374 MB/s in their ICS’08 paper. MTAAP’201020

Memory System Performance Size of Read/Write is a function of the number of nodes across which the data is distributed. Latency linearly increases as the amount of data increases, regardless of distance between nodes. MTAAP’201021

Multinode Thread Creation Cost Approximately 0.2 seconds per thread Remained effectively constant MTAAP’201022

Synchronization Costs Performance of barrier is effectively a constant 0.2 seconds. MTAAP’201023

Outline Introduction Contributions TNT on BlueGene/P Scheduling TNT across nodes Synchronization across nodes TNT Distributed Shared Memory Results Conclusions and Future Work MTAAP’201024

Conclusions and Future Work Proven feasibility of system Benefits of Execution Model-Driven approach Room for Improvement –Improvements to kernel –More rigorous benchmarks –Improved allocation and scheduling algorithms MTAAP’201025

Atlanta, Georgia Thank You MTAAP’201026

Bibliography J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao, “Tiny threads: A thread virtual machine for the cyclops64 cellular architecture,” Parallel and Distributed Processing Symposium, International, vol. 15, p. 265b, S. Kumar, G. Dozsa, G. Almasi et al., “The deep computing messaging framework: generalized scalable message passing on the blue gene/p supercomputer,” in ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing. New York, NY, USA: ACM, 2008, pp. 94–103. M. Borgerding, “Kiss FFT.” December 2009, MTAAP’201027