SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger

http://parasol.tamu.edu SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger/ Parasol Lab, Dept of Computer Science, Texas A&M

Today: System Centric Computing Compilers are conservative OS offers generic services Architecture is generic No Global Optimization No matching between Application/OS/HW intractable for the general case WHAT’s MISSING ? Classic avenues to performance : Parallel Algorithms Static Compiler Optimization OS support Good Architecture Application Compiler HW OS System-Centric Computing Compiler (static) Application (algorithm) System (OS & Arch) Execution Development, Analysis & Optimization Input Data

Our Approach: SmartApps Application Centric Computing Application Compiler HW OS Application-Centric Computing Compiler (static) + run-time techniques Application (algorithm) Run-time System: Execution, Analysis & Optimization Development, Analysis & Optimization Input Data Architecture (reconfigurable) OS (modular) Compiler (run-time) SmartApp Compiler + OS + Architecture + Data + Feedback Application Control Instance-specific optimization

SmartApps Architecture Compiled code + runtime hooks Static STAPL Compiler Augmented with runtime techniques Predictor & Optimizer STAPL Application advanced stages development stage Toolbox Toolbox Get Runtime Information (Sample input, system information, etc.) Execute Application Continuously monitor performance and adapt as necessary Predictor & Optimizer Predictor & Evaluator Adaptive Software Runtime tuning (w/o recompile) Compute Optimal Application and RTS + OS Configuration Recompute Application and/or Reconfigure RTS + OS Configurer Predictor & Evaluator Smart Application Small adaptation (tuning) Large adaptation (failure, phase change) DataBase Adaptive RTS+ OS

Collaborative Effort: l STAPL (Amato – TAMU) l STAPL Compiler (Stroustrup/Quinlan TAMU - LLNL), Cohen INRIA, France l RTS – K42 Interface & Optimizations (Krieger IBM) l Applications (Amato/Adams TAMU, Novak/Morel LLNL/LANL) l Validation on DOE extreme HW BlueGene (PERCS?)(Moreira/Krieger) Texas A&M Texas A&M (Parasol, NE) + IBM + LLNL + INRIA

SmartApps written in STAPL l STAPL ( Standard Template Adaptive Parallel Library ): –Collection of generic parallel algorithms, distributed containers & run-time system (RTS) –Inter-operable with Sequential Programs –Extensible, Composable by end-user –Shared Object View: No explicit communication –Distributed Objects: no replication/coherence –High Productivity Environment

The STAPL Programming Environment RTS + Communication Library (ARMI) OpenMP/MPI/pthreads/native pAlgorithmspContainers User Code pRange Interface to OS (K42)

Algorithm Adaptivity l Problem: Parallel algorithms highly sensitive to: –Architecture – number of processors, memory interconnection, cache, available resources, etc –Environment – thread management, memory allocation, operating system policies, etc –Data Characteristics – input type, layout, etc l Solution: adaptively choose the best algorithm from a library of options at run-time

Adaptive Framework Overview of Approach l Given Multiple implementation choices for the same high level algorithm. l STAPL installation Analyze each pAlgorithm’s performance on system and create a selection model. l Program execution Gather parameters, query model, and use predicted algorithm. Installation Benchmarks Architecture & Environment Algorithm Performance Model User Code Parallel Algorithm Choices Data Characteristics Run-Time Tests Selected Algorithm Data Repository STAPL Adaptive Executable

Results – Current Status l Investigated three operations –Parallel Reductions –Parallel Sorting –Parallel Matrix Multiplication l Several Platforms –128 processor SGI Altix –1152 nodes, dual processor Xeon Cluster –68 nodes, 16 way IBM SMP Cluster –HP V Class 16 way SMP –Origin 2000

Adaptive Reduction Selection Framework Static Setup Phase Dynamic Adaptive Phase Synthetic experiments Model derivation Algo. selection codeSelect algo. Selected algo. Optimizing compiler Application Characteristics changed? Adaptive executable

Reductions: Frequent Operations Reduction : update operation via associative and commutative operators : x = x  expr FOR i = 1 to M sum = sum + B[i] DOALL i = 1 to M p = get_pid() s[p] = s[p] + B[i] sum = s[1]+s[2]+…+s[#proc] Irregular Reduction : updates of array elements through indirection. FOR i = 1 to M A[ X[i] ] = A[ X[i] ] + B[i] Final Partial acc. Bottleneck for optimization. Many parallellization transformations (algorithms) were proposed and none of them always delivers the best performance.

Parallel Reduction Algorithms l Replicated Buffer : simple but won’t scale when data access pattern is sparse. l Replicated Buffer with Links [ICS02] reduced communication. l Selective Privatization: [ICS02] reduced communication and memory consumption. l Local Write [Han & Tseng] : zero communication, extra work.

Comparison of Parallel Reduction Algorithms ProgramsDescription# Inputs Data size IRREGKernel, CFD 4— 2M NBFKernel, GRAMOS 4— 1M MOLDYNSynthetic, M.D. 4— 100K CharmmKernel, M.D. 3— 600K SPARK 98Sparse sym. MVM 2— 30K SPICE 2G6Circuit Simulation 4— 189K FMA3D3D FE solver for solids 1 175K Total 227K — 2M Experimental setup

Observations: Overall, SelPriv is overall the best performed algorithm (13/22). No single algorithm works well for all the cases.

REAL A[N], pA[N,P] INTEGER X[2,M] DOALL i = 1, M C1 = func1() C2 = func2() pA[X[1,i], p] += C1 pA[X[2,i], p] += C2 DOALL i = 1, N A[i] += pA[ i, 1:P ] Memory Reference Model Number of distinct reduction elements in one iteration. It affects the iteration replication ratio of Local Write. MobilityN Number of shared data elements Other work Instrument light-weight timer (~ 100 clock cycles) in few iterations. Connectivity (M: # iterations, N: # shared data elements)

Model (cont.) Memory access patterns of Replicated Buffer # Clusters = # Clusters of the touched elements in replicated array How efficient are the regional usages ? 312945678 123451234512345 12345 Sparsity =How efficient is the usage ? # touched elements in replicated arrays Size of replicated arrays

Setup Phase — Model Generation Setup phase – off-line Speedup = F(Parameters) Parameterized Synthetic Reduction Loops Synthetic Parameter Values Factorial experiment Model Generation Experimental Speedups Experimental execution General linear model for each algorithm

Synthetic Experiments double data[1:N] FOR j = 1, N * CON FOR i = 1, OTH Non-reduction work expr[i] = (memory read scalar ops) FOR i = 1, MOB k = index[i,j] data[k] += expr[i] index[*]  Sparsity, #Cluster ParametersSelected values N (data size) 8196 — 4,194,304 Connectivity0.2 — 128 Mobility2 — 8 Other Work1 — 8 Sparsity0.02 — 0.99 # Clusters1, 4, 20~ Total cases ~800 Synthetic Reduction LoopExperimental Parameters

Model Generation C: connectivity N: the size of reduction array M: mobility O: non-reduction-work/reduction-work S: sparsity of replicated array L: # clusters Regression Models Match parameters with speedup of a scheme From a general linear model we sequentially select terms –Final models contain ~30 terms. Other method: Decision Tree Classification

Q1: Can the prediction models select the right algorithm for a given loop execution instance ? Evaluation HP V-Class, P=8 IBM Regatta, P=16 Total loop-input cases2221 Algorithm speedup of model-based recommendation Algorithm speedup of oracle's recommendation Effectiveness = Q2: How far from the best possible performance using our prediction models ? Average effectiveness 98%98.8% Correctly predicted cases 1819

Evaluation (cont.) Speedup of algorithm chosen by alternative selection method Speedup of algorithm recommended by our models Relative-speedup = Alternative Selection Methods RepBuf: always use Replicated Buffer Random: randomly select algorithms (average used) Default: use SelPriv on HP and LocalWr on IBM Q3: performance improvement using our prediction models ?

Adaptive Reductions FOR (t = 1:steps) DO FOR (i = 1:M) DO access x[ index[i] ] Static Irregular Reduction FOR (t = 1:steps) DO IF (adapt(t)) THEN update index[*] FOR (i = 1:M) DO access x[ index[i] ] Adaptive Irregular Reduction Phase behavior Reusability = # steps in a phase Estimate phase-wise speedups by modeling the overheads of the setup phases of SelPriv and LocalWr.

AmrRed2D Input: Start mesh: 300x300; # steps: 300; Adaptation freq.: 15 DynaSel always aligns to the better performed one. A synthetic program motivated from modern CFD code. Performs 2D irregular mesh refinement and reduction among neighbors, iteratively.

Moldyn The performance of algorithms does not change much dynamically →artificially specified the Reuseability of phases. Time steps Adaptation phases Large phases Small phases

PP2D in FEATFLOW PP2D (17K lines) nonlinear coupled equations solver using multi-grid methods. Irregular reduction loop in GUPWD subroutine ~ 11% of program execution time. The distributed input has 4 grids, with the largest one having ~100K nodes Loop invoked with 4 (fixed) distinct memory access patterns in an interleaved manner. Algorithm selection module is wrapped around each invocation of the loop Selection for each grid is reused for later instances. Instrumentation The program (real application)

PP2D in FEATFLOW (cont.) Notes: RepBuf, SelPriv, LocalWr correspond to applying fixed algorithm for all grids. DynaSel dynamic selects once for each grid and reuses the decisions. Relative Speedups are normalized to the best of the fixed algorithms. Result: our framework Introduces negligible overhead (HP system). Can further improve performance (IBM system).

Sorting - Relative Speedup l Model obtains 98.8% of the possible performance. l Next best algorithm (sample) provides only 83%.

RTS needs to provide (among others): l Communication library (ARMI) l Thread management l Application specific Scheduling –based on Data Dependence Graph (DDG) –based on application specifics policies –thread to processor mapping l Memory management l Applications – OS bi-directional Interface Adaptive Apps  Adaptive RTS  Adaptive OS

Optimizing Communication (ARMI) l Adaptive RTS  Adaptive Communication (ARMI) l Minimize Applications Exec Time using application specific info. : – Use parallelism to hide latency (MT…) – Reduce Critical Path Lengths of apps. – Selectively use asynch./synch communication

K42 User-Level Scheduler l RMI service request threads may be created on: –local dispatcher and migrated to the dispatcher of the remote thread –dispatcher of the remote thread l New scheduling logic in the user-level dispatcher –Currently only FIFO ReadyQueue implementation is supported –Implementing different priority-based scheduling policies

SmartApps RTS Scheduler l Integrating Application scheduling with K42 K42 Kernel Kernel Level Dispatchers – Scheduled by the kernel User-level Dispatchers User-level Threads Scheduled by K42 user-level scheduler User Level

Priority-based Communication Scheduling l Based on type of request – SYNC or ASYNC l SYNC RMI - A new high priority thread is created ASYNC RMI – A new thread is created l SYNC RMI –New thread is scheduled to RUN immediately l ASYNC RMI –New thread is not scheduled to RUN until the current thread yields voluntarily

Priority-based Communication Scheduling l Based on application specified priorities l Discrete Ordinates Particle Transport Computation (developed in STAPL): One sweep Eight simultaneous sweeps

Dependence Graph l Numbers are cellset indices l Colors indicate processors angle-set Aangle-set Bangle-set Cangle-set D

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor In the Initial State, each dispatcher has a thread in RUN state P1P2P3 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request On a RMI request, A new thread is created to service the request on the remote dispatcher P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request For SYNC RMI requests, - Current running thread is moved to READY state - The new thread is scheduled to RUN P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request For ASYNC RMI requests, The new thread is not scheduled to RUN until the current thread voluntarily yields P1 P2P3P2 RMI Request Trace

Initial State Ordinary Thread RMI Thread Dispatcher Physical Processor RMI Request RMI Request On multiple pending requests, The scheduling logic prescribed by the application would be enforced to order the service of RMI requests P1 P2P3P2P3 RMI Request Trace

Memory Consistency Issues l Switching between threads to service RMI requests may result in memory consistency issues l Checkpoints need to be defined for stopping the execution of thread to service RMI requests –e.g. completion of a method may be a checkpoint for servicing pending RMI requests

SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger

Similar presentations

Presentation on theme: "SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger

Similar presentations

Presentation on theme: "SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger"— Presentation transcript:

Similar presentations

About project

Feedback