SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Parallel Algorithms in STAPL Implementation and Evaluation Jeremy Vu, Mauro Bianco, Nancy Amato Parasol Lab, Department of Computer.
TAXI Code Overview and Status Timmie Smith November 14, 2003.
SmartApps: Application Centric Computing with STAPL Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas.
Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.
Chapter 17 Parallel Processing.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
WSN Simulation Template for OMNeT++
Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Ekrem Kocaguneli 11/29/2010. Introduction CLISSPE and its background Application to be Modeled Steps of the Model Assessment of Performance Interpretation.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Adaptive MPI Milind A. Bhandarkar
A Fast On-Chip Profiler Memory Roman Lysecky, Susan Cotterell, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
STAPL: A High Productivity Programming Infrastructure for Parallel & Distributed Computing Lawrence Rauchwerger Parasol Lab, Dept.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Thread-Level Speculation Karan Singh CS
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Net-Centric Software and Systems I/UCRC A Framework for QoS and Power Management for Mobile Devices in Service Clouds Project Lead: I-Ling Yen, Farokh.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Review of Parnas’ Criteria for Decomposing Systems into Modules Zheng Wang, Yuan Zhang Michigan State University 04/19/2002.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Communication Optimizations in Titanium Programs Jimmy Su.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
A Parallel Communication Infrastructure for STAPL
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Ph.D. in Computer Science
Supporting Fault-Tolerance in Streaming Grid Applications
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal
Maximizing Speedup through Self-Tuning of Processor Allocation
Department of Computer Science University of California, Santa Barbara
Operating System Overview
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger Parasol Lab, Dept of Computer Science, Texas A&M

Today: System Centric Computing Compilers are conservative OS offers generic services Architecture is generic No Global Optimization No matching between Application/OS/HW intractable for the general case WHAT’s MISSING ? Classic avenues to performance : Parallel Algorithms Static Compiler Optimization OS support Good Architecture Application Compiler HW OS System-Centric Computing Compiler (static) Application (algorithm) System (OS & Arch) Execution Development, Analysis & Optimization Input Data

Our Approach: SmartApps Application Centric Computing Application Compiler HW OS Application-Centric Computing Compiler (static) + run-time techniques Application (algorithm) Run-time System: Execution, Analysis & Optimization Development, Analysis & Optimization Input Data Architecture (reconfigurable) OS (modular) Compiler (run-time) SmartApp Compiler + OS + Architecture + Data + Feedback Application Control Instance-specific optimization

SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

Collaborative Effort: l STAPL (Amato – TAMU) l STAPL Compiler (Stroustrup/Quinlan TAMU - LLNL), Cohen INRIA, France l RTS – K42 Interface & Optimizations (Krieger IBM) l Applications (Amato/Adams TAMU, Novak/Morel LLNL/LANL) l Validation on DOE extreme HW BlueGene (PERCS?)(Moreira/Krieger) Texas A&M Texas A&M (Parasol, NE) + IBM + LLNL + INRIA

SmartApps written in STAPL l STAPL ( Standard Template Adaptive Parallel Library ): –Collection of generic parallel algorithms, distributed containers & run-time system (RTS) –Inter-operable with Sequential Programs –Extensible, Composable by end-user –Shared Object View: No explicit communication –Distributed Objects: no replication/coherence –High Productivity Environment

The STAPL Programming Environment RTS + Communication Library (ARMI) OpenMP/MPI/pthreads/native pAlgorithmspContainers User Code pRange Interface to OS (K42)

SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

Algorithm Adaptivity l Problem: Parallel algorithms highly sensitive to: –Architecture – number of processors, memory interconnection, cache, available resources, etc –Environment – thread management, memory allocation, operating system policies, etc –Data Characteristics – input type, layout, etc l Solution: adaptively choose the best algorithm from a library of options at run-time

Adaptive Framework Overview of Approach l Given Multiple implementation choices for the same high level algorithm. l STAPL installation Analyze each pAlgorithm’s performance on system and create a selection model. l Program execution Gather parameters, query model, and use predicted algorithm. Installation Benchmarks Architecture & Environment Algorithm Performance Model User Code Parallel Algorithm Choices Data Characteristics Run-Time Tests Selected Algorithm Data Repository STAPL Adaptive Executable

Results – Current Status l Investigated three operations –Parallel Reductions –Parallel Sorting –Parallel Matrix Multiplication l Several Platforms –128 processor SGI Altix –1152 nodes, dual processor Xeon Cluster –68 nodes, 16 way IBM SMP Cluster –HP V Class 16 way SMP –Origin 2000

Adaptive Reduction Selection Framework Static Setup Phase Dynamic Adaptive Phase Synthetic experiments Model derivation Algo. selection codeSelect algo. Selected algo. Optimizing compiler Application Characteristics changed? Adaptive executable

Reductions: Frequent Operations Reduction : update operation via associative and commutative operators : x = x  expr FOR i = 1 to M sum = sum + B[i] DOALL i = 1 to M p = get_pid() s[p] = s[p] + B[i] sum = s[1]+s[2]+…+s[#proc] Irregular Reduction : updates of array elements through indirection. FOR i = 1 to M A[ X[i] ] = A[ X[i] ] + B[i] Final Partial acc. Bottleneck for optimization. Many parallellization transformations (algorithms) were proposed and none of them always delivers the best performance.

Parallel Reduction Algorithms l Replicated Buffer : simple but won’t scale when data access pattern is sparse. l Replicated Buffer with Links [ICS02] reduced communication. l Selective Privatization: [ICS02] reduced communication and memory consumption. l Local Write [Han & Tseng] : zero communication, extra work.

Comparison of Parallel Reduction Algorithms ProgramsDescription# Inputs Data size IRREGKernel, CFD 4— 2M NBFKernel, GRAMOS 4— 1M MOLDYNSynthetic, M.D. 4— 100K CharmmKernel, M.D. 3— 600K SPARK 98Sparse sym. MVM 2— 30K SPICE 2G6Circuit Simulation 4— 189K FMA3D3D FE solver for solids 1 175K Total 227K — 2M Experimental setup

Observations: Overall, SelPriv is overall the best performed algorithm (13/22). No single algorithm works well for all the cases.

REAL A[N], pA[N,P] INTEGER X[2,M] DOALL i = 1, M C1 = func1() C2 = func2() pA[X[1,i], p] += C1 pA[X[2,i], p] += C2 DOALL i = 1, N A[i] += pA[ i, 1:P ] Memory Reference Model Number of distinct reduction elements in one iteration. It affects the iteration replication ratio of Local Write. MobilityN Number of shared data elements Other work Instrument light-weight timer (~ 100 clock cycles) in few iterations. Connectivity (M: # iterations, N: # shared data elements)

Model (cont.) Memory access patterns of Replicated Buffer # Clusters = # Clusters of the touched elements in replicated array How efficient are the regional usages ? Sparsity =How efficient is the usage ? # touched elements in replicated arrays Size of replicated arrays

Setup Phase — Model Generation Setup phase – off-line Speedup = F(Parameters) Parameterized Synthetic Reduction Loops Synthetic Parameter Values Factorial experiment Model Generation Experimental Speedups Experimental execution General linear model for each algorithm

Synthetic Experiments double data[1:N] FOR j = 1, N * CON FOR i = 1, OTH Non-reduction work expr[i] = (memory read scalar ops) FOR i = 1, MOB k = index[i,j] data[k] += expr[i] index[*]  Sparsity, #Cluster ParametersSelected values N (data size) 8196 — 4,194,304 Connectivity0.2 — 128 Mobility2 — 8 Other Work1 — 8 Sparsity0.02 — 0.99 # Clusters1, 4, 20~ Total cases ~800 Synthetic Reduction LoopExperimental Parameters

Model Generation C: connectivity N: the size of reduction array M: mobility O: non-reduction-work/reduction-work S: sparsity of replicated array L: # clusters Regression Models Match parameters with speedup of a scheme From a general linear model we sequentially select terms –Final models contain ~30 terms. Other method: Decision Tree Classification

Q1: Can the prediction models select the right algorithm for a given loop execution instance ? Evaluation HP V-Class, P=8 IBM Regatta, P=16 Total loop-input cases2221 Algorithm speedup of model-based recommendation Algorithm speedup of oracle's recommendation Effectiveness = Q2: How far from the best possible performance using our prediction models ? Average effectiveness 98%98.8% Correctly predicted cases 1819

Evaluation (cont.) Speedup of algorithm chosen by alternative selection method Speedup of algorithm recommended by our models Relative-speedup = Alternative Selection Methods RepBuf: always use Replicated Buffer Random: randomly select algorithms (average used) Default: use SelPriv on HP and LocalWr on IBM Q3: performance improvement using our prediction models ?

Adaptive Reductions FOR (t = 1:steps) DO FOR (i = 1:M) DO access x[ index[i] ] Static Irregular Reduction FOR (t = 1:steps) DO IF (adapt(t)) THEN update index[*] FOR (i = 1:M) DO access x[ index[i] ] Adaptive Irregular Reduction Phase behavior Reusability = # steps in a phase Estimate phase-wise speedups by modeling the overheads of the setup phases of SelPriv and LocalWr.

AmrRed2D Input: Start mesh: 300x300; # steps: 300; Adaptation freq.: 15 DynaSel always aligns to the better performed one. A synthetic program motivated from modern CFD code. Performs 2D irregular mesh refinement and reduction among neighbors, iteratively.

Moldyn The performance of algorithms does not change much dynamically →artificially specified the Reuseability of phases. Time steps Adaptation phases Large phases Small phases

PP2D in FEATFLOW PP2D (17K lines) nonlinear coupled equations solver using multi-grid methods. Irregular reduction loop in GUPWD subroutine ~ 11% of program execution time. The distributed input has 4 grids, with the largest one having ~100K nodes Loop invoked with 4 (fixed) distinct memory access patterns in an interleaved manner. Algorithm selection module is wrapped around each invocation of the loop Selection for each grid is reused for later instances. Instrumentation The program (real application)

PP2D in FEATFLOW (cont.) Notes: RepBuf, SelPriv, LocalWr correspond to applying fixed algorithm for all grids. DynaSel dynamic selects once for each grid and reuses the decisions. Relative Speedups are normalized to the best of the fixed algorithms. Result: our framework Introduces negligible overhead (HP system). Can further improve performance (IBM system).

Sorting - Relative Speedup l Model obtains 98.8% of the possible performance. l Next best algorithm (sample) provides only 83%.

SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

RTS needs to provide (among others): l Communication library (ARMI) l Thread management l Application specific Scheduling –based on Data Dependence Graph (DDG) –based on application specifics policies –thread to processor mapping l Memory management l Applications – OS bi-directional Interface Adaptive Apps  Adaptive RTS  Adaptive OS

Optimizing Communication (ARMI) l Adaptive RTS  Adaptive Communication (ARMI) l Minimize Applications Exec Time using application specific info. : – Use parallelism to hide latency (MT…) – Reduce Critical Path Lengths of apps. – Selectively use asynch./synch communication

K42 User-Level Scheduler l RMI service request threads may be created on: –local dispatcher and migrated to the dispatcher of the remote thread –dispatcher of the remote thread l New scheduling logic in the user-level dispatcher –Currently only FIFO ReadyQueue implementation is supported –Implementing different priority-based scheduling policies

SmartApps RTS Scheduler l Integrating Application scheduling with K42 K42 Kernel Kernel Level Dispatchers – Scheduled by the kernel User-level Dispatchers User-level Threads Scheduled by K42 user-level scheduler User Level

Priority-based Communication Scheduling l Based on type of request – SYNC or ASYNC l SYNC RMI - A new high priority thread is created ASYNC RMI – A new thread is created l SYNC RMI –New thread is scheduled to RUN immediately l ASYNC RMI –New thread is not scheduled to RUN until the current thread yields voluntarily

Priority-based Communication Scheduling l Based on application specified priorities l Discrete Ordinates Particle Transport Computation (developed in STAPL): One sweep Eight simultaneous sweeps

Dependence Graph l Numbers are cellset indices l Colors indicate processors angle-set Aangle-set Bangle-set Cangle-set D

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor In the Initial State, each dispatcher has a thread in RUN state P1P2P3 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request On a RMI request, A new thread is created to service the request on the remote dispatcher P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request For SYNC RMI requests, - Current running thread is moved to READY state - The new thread is scheduled to RUN P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request For ASYNC RMI requests, The new thread is not scheduled to RUN until the current thread voluntarily yields P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request RMI Request On multiple pending requests, The scheduling logic prescribed by the application would be enforced to order the service of RMI requests P1 P2P3P2P3 RMI Request Trace

Memory Consistency Issues l Switching between threads to service RMI requests may result in memory consistency issues l Checkpoints need to be defined for stopping the execution of thread to service RMI requests –e.g. completion of a method may be a checkpoint for servicing pending RMI requests