SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas.

Slides:

Advertisements

Similar presentations

Multiple Processor Systems

Advertisements

Distributed Processing, Client/Server and Clusters

MPI Message Passing Interface

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Distributed Processing, Client/Server, and Clusters

Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.

Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley.

Threads CSCI 444/544 Operating Systems Fall 2008.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.

Course Instructor: Aisha Azeem

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.

Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.

Performance evaluation of component-based software systems Seminar of Component Engineering course Rofideh hadighi 7 Jan 2010.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

1 Qualitative Reasoning of Distributed Object Design Nima Kaveh & Wolfgang Emmerich Software Systems Engineering Dept. Computer Science University College.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.

CORBA1 Distributed Software Systems Any software system can be physically distributed By distributed coupling we get the following:  Improved performance.

1 Choices “Our object-oriented system architecture embodies the notion of customizing operating systems to tailor them to support particular hardware configuration.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Department of Computer Science and Software Engineering

M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.

Operating Systems: Internals and Design Principles

Programmability Hiroshi Nakashima Thomas Sterling.

Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

A Parallel Communication Infrastructure for STAPL

Distributed Shared Memory

SOFTWARE DESIGN AND ARCHITECTURE

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Hierarchical Architecture

Many-core Software Development Platforms

Data, Databases, and DBMSs

Department of Computer Science University of California, Santa Barbara

Fast Communication and User Level Parallelism

Operating Systems : Overview

Prof. Leonardo Mostarda University of Camerino

Operating Systems : Overview

Operating Systems : Overview

Operating Systems : Overview

Department of Computer Science University of California, Santa Barbara

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas A&M N. Amato, B. Stroustrup, M. Adams

SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

STAPL: Standard Template Adaptive Parallel Library STAPL: A library of parallel, generic constructs based on the C++ Standard Template Library (STL). – Components for Program Development  pAlgorithms, pContainers, Views, pRange – Portability and Optimization  STAPL RTS and Adaptive Remote Method Invocation (ARMI) Communication Library  Framework for Algorithm Selection and Tuning (FAST)

Applications Using STAPL l Particle Transport - TAXI l Bioinformatics - Protein Folding l Geophysics - Seismic Ray Tracing l Aerospace - MHD –Seq. “Ctran” code (7K LOC) –STL (1.2K LOC) –STAPL (1.3K LOC)

Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

Consistency Models Processor Consistency (default and currently supported model) –Accesses from a processor on another’s memory are sequential –Requires in-order processing of RMIs l Limited parallelism Object Consistency (future) –Accesses to different objects can happen out of order –Uncovers fine-grained parallelism l Accesses to different objects are concurrent l Potential gain in scalability –Can be made default for specific computational phases Mixed Consistency (future) –Use Object Consistency on select objects l Selection of objects fit for this model can be: u Elective – the application can specify that an object’s state does not depend on others’ states. u Detected – if it is possible to assert the absence of such dependencies –Use Processor Consistency on the rest

Algorithm Selection Problem Description: Given multiple implementations of an abstract operation with a specified execution environment and input data, choose one that maximizes performance l Our Objective: Create general framework for parallel algorithm selection –Applicable to any abstract operation: sorting, matrix multiplication, convex hull, … –Flexible specification of execution environment and input data parameters l Environment: #procs, memory interconnection, cache, thread management, OS policies l Input data: data type, layout, size, other properties (e.g., sortedness of input) –Generic modeling interface – interchange different approaches Use generic machine learning approaches l Our Strategy: FAST: Framework for Algorithms Selection & Tuning –Integrated approach within STAPL –Selection is transparent to end user –Library adaptively choose the best algorithm at run-time from a library of implementation choices

FAST Architecture

Problem Specification User specifies parameters that may affect performance (e.g., num processors, input size, algorithm specific) and ranges (e.g, input size=100M..200M). User supplies list of candidate implementations.

FAST Architecture Metric Collection User provides method to collect data about an implementation execution (e.g., presortedness measure).

FAST Architecture Instance Selection FAST selects instances to execute and use when making decision model.

FAST Architecture Instance Execution FAST invokes user interface to execute test instances and collect metrics and timings.

FAST Architecture Database Insertion FAST inserts results of training instances into database.

FAST Architecture Model Generation FAST uses training instances to generate decision model.

FAST Architecture Selection Querying FAST provides interface for user application to query generated model.

Parallel Sorting: Experimental Results SGI Altix Selection Model SGI Altix Validation Set (V1) – 100% Accuracy N=120M Adaptive Performance Penalty Attributes for Selection Model –Processor Count –Data Type –Input Size –Max Value (impacts radix sort) –Presortedness

Parallel Sorting - Altix Relative Performance (V2) l Model obtains 99.7% of the possible performance. l Next best algorithm (sample) provides only 90.4%.

FAST: Future Work l Improve and refine algorithm selection framework –Refine sampling for training input selection –Algorithm developer specified models –Online feedback/model refinement (incremental learning) –Machine characterization & microbenchmarks l Expand employment in STAPL. –pContainers - locking strategies, data (re)distribution, consistency model –pAlgorithms - more algorithms, parameter tuning –pRange - work granularity and scheduling policies –ARMI Runtime System - dynamic aggregation.

Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

RTS – Current state Application Specific Parameters Smart Application STAPL RTS ARMIExecutor K42 User-Level Dispatcher Kernel Scheduler (no custom scheduling, e.g. NPTL) Operating System Memory Manager Advanced stage Experimental stage: multithreading ARMIExecutor Comm. Thread RMI Thread Task Thread Custom scheduling Kernel scheduling

ARMI – Current State ARMI: Adaptive Remote Method Invocation –Abstraction of shared-memory and message passing communication layer (MPI, pThreads, OpenMP, mixed, Converse). –Programmer expresses fine-grain parallelism that ARMI adaptively coarsens to balance latency versus overhead. –Support for sync, async, point-to-point and group communication. –Automated (de)serialization of C++ classes. ARMI can be as easy/natural as shared memory and as efficient as message passing.

ARMI Communication Primitives Point to Point Communication armi_async - non-blocking: doesn’t wait for request arrival or completion. armi_sync - blocking and non-blocking versions. Collective Operations armi_broadcast, armi_reduce, etc. can adaptively set groups for communication. Synchronization armi_fence, armi_barrier - fence implements distributed termination algorithm to ensure that all requests sent, received, and serviced. armi_wait - blocks until at least at least one request is received and serviced. armi_flush - empties local send buffer, pushing outstanding to remote destinations.

RTS Current Work - Multithreading In ARMI l Specialized communication thread dedicated the emission and reception of messages –Reduces latency, in particular on SYNC requests l Specialized threads for the processing of RMIs –Uncovers additional parallelism (RMIs from different sources can be executed concurrently) –Provides a suitable framework for future work on relaxing the consistency model and on the speculative execution of RMIs In the Executor l Specialized threads for the execution of tasks –Concurrently execute ready tasks from the DDG (when all dependencies are satisfied)

TAXI Experimental Results l uBGL node, 2048 processor unclassified Blue Gene cluster at LLNL. l Results shown for two machine configurations. –Virtual Mode - two independent computation processes per node. –Coprocessor - one processor per node for computation and one for communication.

TAXI Results on uBGL

RTS – The Next Generation Application Specific Parameters Virtualization Layer Provides a unified view of the RTS services Smart Application STAPL RTS implementation ARMI Communicator Executor Smart User-Level Scheduler K42 User-Level Dispatcher Kernel Scheduler (no custom scheduling, e.g. NPTL) Other RTS (E.g., Converse, Marcel …) Unsupported System Best Affinity Supported System Memory Manager M:N Threading Model 1:1 Scheduler Virtualization Synchronization Server (fence, distributed locks …)

RTS Virtualization Another level of portability –Allow porting the Smart Application to another runtime system –Achieving the best possible affinity between RTS and OS If another RTS is better for a specific system, use it! –Abstract the complexity of supporting a heterogeneous group of heterogeneous systems Avoid the special-case implementation paradigm –Providing support for multiple architectures is costly l In time and maintenance l In performance (a system supported as a special case will invariably present poorer performance than the principal targets). –Why emulate the wheel ?

RTS Threading Models 1:1 threading model : (1 user-level thread mapped onto 1 kernel thread) –Default kernel scheduling –Heavy kernel threads M:N threading model : (M user-level threads mapped onto N kernel threads) –Customized scheduling l Enables scheduler-based optimizations (e.g. priority scheduling, better support for relaxing the consistency model …) –Light user-level threads l Smaller threading cost u Can match N with the number of available hardware threads : no kernel-thread swapping, no preemption, no kernel over-scheduling … u User-level thread scheduling requires no kernel trap l Perfect and free load balancing within the node u User-level threads are cooperatively scheduled on the available kernel threads (they migrate freely).

RTS Executor Customized task scheduling –Executor maintains a ready queue (all tasks for which dependencies are satisfied in the DDG) –Order tasks from the ready queue based on a scheduling policy (e.g. round robin, static block or interleaved block scheduling, dynamic scheduling …) –The RTS decides the policy, but the user can also specify it himself –Policies can differ for every pRange Customized load balancing –Implement load balancing strategies (e.g. work stealing) –Allow the user to choose the strategy –K42 : generate a customized work migration manager

RTS Synchronization l Efficient implementation of synchronization primitives is crucial –One of the main performance bottlenecks in parallel computing –Common scalability limitation Fence –Efficient implementation using a novel Distributed Termination Detection algorithm Global Distributed Locks –Symmetrical implementation to avoid contention –Support for logically recursive locks (required by the compositional SmartApps framework) Group-based synchronization –Allows efficient usage of ad-hoc computation groups –Semantic equivalent of the global primitives –Scalability requirement for large-scale systems

Outline l SmartApps Concept l STAPL is vehicle we use to develop the SmartApps Framework l STAPL –Overview –High Level Adaptivity in STAPL l Consistency Models l FAST - Framework for Algorithm Selection & Tuning –RTS –Compiler (Pivot)

The Pivot: Static Analysis of C++ Applications (lead: Bjarne Stroustrup) Compiler IPR XPR Tool 2 Tool 1 C++ source Object code C++ source IDL Tool 4 “information” Tool 3 Specialized representation (e.g. flow graph) Compiler

Context for the Pivot l Semantically Enhanced Library (Language) –Enhanced notation through libraries –Restrict semantics through tools l And take advantage of that semantics C++ Domain Specific Library Semantic Restriction s Bell Labs Proverbs Library design is language design Language design is library design

Context for the Pivot l Provide the advantages of specialized languages –Without introducing new “special purpose” languages –Without supporting special-purpose language tool chains –Avoiding the 99.?% language death rate l Provide general support for the SELL idea –Not just a specialized tool per application/library –The Pivot fits here C++ Domain Specific Library Semantic Restriction s

Current and future work l Complete infrastructure –Complete EDG and GCC interfaces –Represent headers (modularity) directly –Complete type representation in XPR l Initial applications –Style analysis l including type safety and security – Analysis and transformation of STAPL programs l Build alliances l Currently: compiles Firefox l Inserts measurement & control for SmartApps

Conclusion l STAPL is a good platform for developing SmartApps –used on several large-scale, complex applications –We plan to release a version soon to friendly users l Intel TBB – The simple STAPL for multicores  University research into commercial product l More info at