VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University

VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University http://vgrads.rice.edu/site_visit/april_2005/slides/kennedy-tools.pdf

The VGrADS Vision: National Distributed Problem Solving Where We Want To Be —Transparent Grid computing –Submit job –Find & schedule resources –Execute efficiently Where We Are —Low-level hand programming What Do We Need? —A more abstract view of the Grid –Each developer sees a specialized “virtual grid” —Simplified programming models built on the abstract view –Permit the application developer to focus on the problem Database Supercomputer Database Supercomputer

The Original GrADS Vision Config- urable Object Program Execution Environment Program Preparation System Performance Feedback Whole- Program Compiler Libraries Source Appli- cation Software Components Binder Performance Problem Real-time Performance Monitor Resource Negotiator Scheduler Grid Runtime System Negotiation

Lessons from GrADS Mapping and Scheduling for MPI Jobs is Hard –Although we were able to do some interesting experiments Performance Model Construction is Hard —Hybrid static/dynamic schemes are best —Difficult for application developers to do by hand Heterogeneity is Hard —We completely revised the launching mechanisms to support this —Good scheduling is critical Rescheduling/Migration is Hard —Requires application collaboration (generalized checkpointing) —Requires performance modeling to determine profitability Scaling to Large Grids is Hard —Scheduling becomes expensive

VGrADS Virtual Grid Hierarchy

Virtual Grids and Tools Abstract Resource Request —Permits true scalability by mapping from requirements to set of resources –Scalable search produces manageable resource set —Virtual Grid services permit effective scheduling –Fault tolerance, performance stability Look-Ahead Scheduling —Applications map to directed graphs –Vertices are computations, edges are data transfers —Scheduling done on entire graph –Using automatically-constructed performance models for computations –Depends on load prediction (Network Weather Service) Abstract Programming Interfaces —Application graphs constructed from scripts –Written in standard scripting languages (Python,Perl,Matlab)

Programming Tools Focus: Automating critical application-development steps: —Initiating application execution –Optimize and launch application on heterogeneous resources —Scheduling workflow graphs –Heuristics required (problems are NP-complete at best) –Good initial results if accurate predictions of resource performance are available (see EMAN demo) —Constructing performance models –Based on loop-level performance models of the application –Requires benchmarking with (relatively) small data sets, extrapolating to larger cases —Building workflow graphs from high-level scripts –Examples: Python (EMAN), Matlab

Initiating Application Execution Vision: Transparent to the User with Fault Tolerance —Binaries shipped or preinstalled –Reconfigured where necessary —Data moved to computations —Support for fault tolerance —Support for rescheduling, migration and restart Research —Separation of concerns between workflow management and virtual grid —Support for fault tolerance and rescheduling/migration –Checkpointing, load projection, performance modeling —Platform-independent application representation

Application Initiation and Management Data Discovery Workflow Reduction Resource Discovery Mapping Pegasus Abstract Workflow Concrete Workflow DAGMan Virtual Grid Execution System Virtual Grid Resources Virtual Grid Workflow Scheduler

Support for Fault Tolerance and Rescheduling Fault tolerance —Workflows: Use Pegasus mechanisms for recovery –Issue: support for registration of completed steps in vgES —MPI steps: Disk-free checkpointing –See Dongarra talk Rescheduling —Mechanisms for monitoring and extending virtual grids in vgES –Under development —Issue: Determining whether rescheduling will be profitable –Projection of load on current resources without current application –Models to project performance on current and alternative resources

Platform-Independent Applications Supporting a Single Application Image for Different Platforms —Translation and optimization tools that map the application onto the hardware in an effective and efficient way –At the high level, resource allocation and scheduling –At the low level, optimization, scheduling, and runtime reoptimization —We are working with the LLVM system (from Illinois) –Produces good code for a variety of hardware platforms Run-time Reoptimization —The Idea: In response to poor performance, reoptimize —The Vision: To reduce the runtime cost of reoptimization, move analysis and planning to compile time –Example: alternative blocking strategies for a loop nest, with shift triggered by excessive cache misses

Reducing the Price of Abstraction The virtual grid abstracts away details of the underlying hardware execution environment —Users want the abstraction, but at a minimal price —Need translation and optimization tools that map the computation onto the hardware in an effective and efficient way —At the high level, resource allocation and scheduling —At the low level, optimization, scheduling, and runtime reoptimization From an operational perspective, we need a system that can —Produce good code for a range of hardware platforms, and —Serve as an experimental platform for reoptimization work We are working with the LLVM system (from Illinois)

Runtime Reoptimization The Idea: In response to poor performance, reoptimize —Notion has been proved in high-overhead Java environments —It is unproven in low-overhead (scientific) environments —Transformation speed and effectiveness are critical for profitability The Vision: To reduce the runtime cost of reoptimization, move analysis and planning to compile time —Statically planned, dynamically applied optimizations —Initial development work in open source LLVM system –Improvements in basic infrastructure (done) –Improvements in hooks for reoptimization (in process) —Example: alternative blocking strategies for a loop nest, with shift triggered by excessive cache misses

Runtime Reoptimization Status of work in LLVM —New register allocator (Eckhardt, Das Gupta) –Distribution waiting on IP issues —New scheduler planned (Eckhardt) Reoptimization work —Subject of Das Gupta’s thesis –Ideas are sound, implementation not yet begun

Scheduling Workflows Vision: —Application includes performance models for all workflow nodes –Performance models automatically constructed —Software schedules applications onto Grid in two phases –Virtual grid requirement and acquisition –Model based scheduling on the returned vGrid Research —Scheduling strategy: Whole workflow –Dramatic makespan reduction —Two-phase scheduling –Exploring trade-offs –Expectation: dramatic reduction in complexity with little loss in performance

Workflow Scheduling: Results Testbed —64 dual processor Itanium IA-64 nodes (900 MHz) at Rice University Terascale Cluster [RTC] —60 dual processor Itanium IA-64 nodes (1300 MHz) at University of Houston [acrl] —16 Opteron nodes (2009 MHz) at University of Houston Opteron cluster [medusa] Experiment —Ran the EMAN refinement cycle and compared running times for “classesbymra”, the most compute intensive parallel step in the workflow —Determine the 3D structure of the ‘rdv’ virus particle with large input data [2GB]

Results: Efficient Scheduling We compared the following workflow scheduling strategies 1.Heuristic Scheduling with accurate performance models generated semi-automatically - HAP 2.Heuristic Scheduling with crude performance models based on CPU power of the resources - HCP 3.Random Scheduling with no performance models - RNP 4.Weighted random scheduling with accurate performance models - RAP We compared the ‘makespan’ of the “classesbymra” step for the different scheduling strategies

EMAN Scheduling: Varying Performance Models Set of resources: —50 rtc nodes at Rice (IA-64) —13 medusa nodes at U of Houston (Opteron) RDV data set Varying scheduling strategy —RNP - Random / No PerfModel —RAP - Random / Accurate PerfModel —HCP - Heuristic / GHz Only PerfModel —HAP - Heuristic / Accurate PerfModel Scheduling method # instances mapped to rtc (IA-64) # instances mapped to medusa (Opteron) # nodes picked at rtc # nodes picked at medusa Execution time at rtc (min) Execution time at medusa (min) Overall makespan (min) RNP892143911212981121 RAP57533410762530762 HGP58525013757410757 HAP50605013386505

Results: Efficient Scheduling Scheduling method # instances mapped to RTC (IA-64) # instances mapped to medusa (Opteron) # nodes picked at RTC # nodes picked at medusa Execution Time at RTC (minutes) Execution Time at medusa (minutes) Overall makespan (minutes) HAP50605013386505 HCP58525013757410757 RNP892143911212981121 RAP57533410762530762 Set of resources: 50 RTC nodes, 13 medusa nodes HAP - Heuristic Accurate PerfModel HCP - Heuristic Crude PerfModel RNP - Random No PerfModel RAP - Random Accurate PerfModel

Results: Load Balance # instances mapped to RTC (IA-64) # instances mapped to medusa (Opteron) # instances mapped to acrl (IA-64) Execution Time at RTC (minutes) Execution Time at medusa (minutes) Execution time at acrl (minutes) Overall makespan (minutes) 294239383410308410 Set of resources: 43 RTC nodes, 14 medusa nodes, 39 acrl nodes Good load balance due to accurate performance models

Results: Accurateness of Performance Models Our performance models were pretty accurate — rank[RTC_node] / rank[medusa_node] = 3.41 — actual_exec_time[RTC_node] / actual_exec_time[medusa_node] = 3.82 — rank[acrl_node] / rank[medusa_node] = 2.36 — actual_exec_time[acrl_node] /actual_exec_time[medusa_node] = 3.01 Accurate relative performance model values result in efficient load balance of the classesbymra instances

Performance Model Construction Vision: Automatic performance modeling —From binary for a single resource and execution profiles —Models for all target resources Research —Uniprocessor modeling –Can be extended to MPI steps —Memory hierarchy behavior —Role of instruction mix –Application-specific models –Scheduling

Performance Prediction Overview Object Code Binary Instrumenter Instrumented Code Execute BB Counts Communication Volume & Frequency Memory Reuse Distance Binary Analyzer Control flow graph Loop nesting structure BB instruction mix Post Processing Tool Architecture neutral model Scheduler Architecture Description Performance Prediction for Target Architecture Static Analysis Dynamic Analysis Post Processing

Modeling Memory Reuse Distance

Memory Behavior: NAS LU 2D 3.0

TLB page walks increase L2 misses

Execution Behavior: NAS LU 2D 3.0

Memory Behavior: Sweep3D

Execution Behavior: Sweep3D

Memory Behavior: Sweep3D

Execution Behavior: Sweep3D

Building Workflow Graphs Vision: —Application developer writes in scripting language –Candidates: Python (EMAN), Matlab, S-PLUS/R, LabView –Components represent applications —Software constructs workflow and data movement Research —Related project: Telescoping languages –Type-based specialization –Type analysis —Plan: Harvest this work for Grid application development –Just getting started

VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University

Similar presentations

Presentation on theme: "VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University

Similar presentations

Presentation on theme: "VGrADS Programming Tools Research: Vision and Overview Ken Kennedy Center for High Performance Software Rice University"— Presentation transcript:

Similar presentations

About project

Feedback