Download presentation
Presentation is loading. Please wait.
Published byNelson Gilbert Modified over 9 years ago
1
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 1 Sphinx: A Scheduling Middleware for Data Intensive Applications on a Grid Jang Uk In, Paul Avery, Richard Cavanaugh, Laukik Chitnis, Mandar Kulkarni, Sanjay Ranka University of Florida
2
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 2 The Problem of Grid Scheduling oDecentralised ownership oNo one controls the grid oHeterogeneous composition oDifficult to guarantee execution environments oDynamic availability of resources oUbiquitous monitoring infrastructure needed oComplex policies oIssues of trust oLack of accounting infrastructure oMay change with time oInformation gathering and processing is critical!
3
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 3 A Real Life Example oMerge two grids into a single multi-VO “Inter-Grid” oHow to ensure that oneither VO is harmed? oboth VOs actually benefit? othere are answers to questions like: o“With what probability will my job be scheduled and complete before my conference deadline?” oClear need for a scheduling middleware! FNAL Rice UI MIT UCSD UF UW Caltech UM UTA ANL IU UC LBL SMU OU BU BNL
4
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 4 Emerging Challenge: Intra-VO Management oToday’s Grid oFew Production Managers oTomorrow’s Grid oFew Production Managers oMany Analysis Users oHow to ensure that o“Handles” exist for the VO to “throttle” different priorities oProduction vs. Analysis oUser-A vs. User-B oThe VO is able to o“Inventory” all resources currently available to it over some time period oStrategically plan for its own use of those resources during that time period
5
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 5 Some Requirements for Effective Grid Scheduling oInformation requirements oPast & future dependencies of the application oPersistent storage of workflows oResource usage estimation oPolicies oExpected to vary slowly over time oGlobal views of job descriptions oRequest Tracking and Usage Statistics oState information important oResource Properties and Status oExpected to vary slowly with time oGrid weather oLatency measurement important oReplica management oSystem requirements oDistributed, fault-tolerant scheduling oCustomisability oInteroperability with other scheduling systems oQuality of Service
6
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 6 Incorporate Requirements into a Framework VDT Server VDT Client oAssume the GriPhyN Virtual Data Toolkit: oClient (request/job submission) oGlobus clients oCondor-G/DAGMan oChimera Virtual Data System oServer (resource gatekeeper) oMonALISA Monitoring Service oGlobus services oRLS (Replica Location Service) ? ? ?
7
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 7 Incorporate Requirements into a Framework oAssume the GriPhyN Virtual Data Toolkit: oClient (request/job submission) oClarens Web Service oGlobus clients oCondor-G/DAGMan oChimera Virtual Data System oServer (resource gatekeeper) oMonALISA Monitoring Service oGlobus services oRLS (Replica Location Service) VDT Server VDT Client oFramework design principles: oInformation driven oFlexible client-server model oGeneral, but pragmatic and simple oImplement now; learn; extend over time oAvoid adding middleware requirements on grid resources oTake what is offered! ? Recommendation Engine
8
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 8 The Sphinx Framework Sphinx Server VDT Client VDT Server Site MonALISA Monitoring Service Globus Resource Replica Location Service Condor-G/DAGMan Request Processing Data Warehouse Data Management Information Gathering Sphinx Client Chimera Virtual Data System Clarens WS Backbone
9
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 9 Sphinx Scheduling Server oFunctions as the Nerve Centre oData Warehouse oPolicies, Account Information, Grid Weather, Resource Properties and Status, Request Tracking, Workflows, etc oControl Process oFinite State Machine oDifferent modules modify jobs, graphs, workflows, etc and change their state oFlexible oExtensible Sphinx Server Control Process Job Execution Planner Graph Reducer Graph Tracker Job Predictor Graph Data Planner Job Admission Control Message Interface Graph Predictor Graph Admission Control Data Warehouse Data Management Information Gatherer
10
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 10 Quality of Service oFor grid computing to become economically viable, a Quality of Service is needed o“Can the grid possibly handle my request within my required time window?” oIf not, why not? When might it be able to accommodate such a request? oIf yes, with what probability? oBut, grid computing today typically: oRelies on a “greedy” job placement strategies oWorks well in a resource rich (user poor) environment oAssumes no correlation between job placement choices oProvides no QoS oAs a grid becomes resource limited, oQoS becomes more important! o“greedy” strategies not always a good choice oStrong correlation between job placement choices
11
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 11 Policy Constraints oDefined by Resource Providers oActual grid sites (resource centres) oVO management oApplied to Request Submitters oVO, group, user, or even a proxy request (e.g. workflow) oValid over a Period of Time oCan be dynamic (e.g. periodic) or constant oHierarchical/Recursive oVOs ↔ Sub-VOs ↔ Users ↔ Proxies oGrids ↔ CEs,SEs ↔ Machines oMonths ↔ Weeks ↔ Days oVO accounting and book-keeping is necessary SubmittersResourcesTime
12
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 12 Policy Based Scheduling oSphinx provides “soft” QoS through time dependent, global views of oSubmissions (workflows, jobs, allocation, etc) oPolicies oResources oUses Linear Programming Methods oSatisfy Constraints oPolicies, User-requirements, etc oOptimise an “objective” function oProvides the QoS oGOAL: Estimate probabilities to meet deadlines within policy constraints Resources Submissions Time SubmissionsResourcesTime Policy Space
13
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 13 Policy Based Scheduling Simulations using LP Highly biased usage quotas Initial workload Resource allocations Changed workload
14
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 14 Effective Use of Monitoring oPrimary useful Real-time observables oQueue-depth on Compute Elements oNumber of idle jobs oNumber of running jobs oNumber of available CPU slots o“df” on Storage Elements oRTT between data transfer points oResearch: oEstimate and manage information latency oForecast grid weather oUse data-mining techniques oAggregate/Fusion oStatistical oMathematical Modelling oNo better (or worse?) than forecasting Hurricanes! Sphinx Using Grid3 no Monitoring Sphinx Using Grid3 with Monitoring Completion Time (seconds) Number of DAGs
15
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 15 Sphinx Test Results from Grid3 oExperimental Set Up: Two competing Sphinx Servers oOne without monitoring oLP based scheduling oRound-robin load-balancer oOne with monitoring oLP based scheduling oTime dependent objective function oSubmitted the same 30, 10 step workflows to each Sphinx Server simultaneously o300 total jobs per Sphinx Server oExploited 13 sites across Grid3 oEach site had to pass a tight stability “cut” oGrid3 was “loaded” with oATLAS DC2 Production oOther iVDGL work oTo first order, Sphinx minimised the time spent waiting in the remote CE queue oReal-time, knowledge based decisions are important!! Idle Time (seconds) Number of Jobs Sphinx Using Grid3 no Monitoring Sphinx Using Grid3 with Monitoring (queue-depth only)
16
30.09.2004Computing in High Energy Physics - 2004, Interlaken Switzerland 16 Conclusions oScheduling on a grid has unique requirements oInformation oSystem oDecisions based on global views providing a Quality of Service are important oParticularly in a resource limited environment oSphinx is an extensible, flexible grid middleware which oAlready implements many required features for effective global scheduling oProvides an excellent “workbench” for future activities! oFor more information, please visit ohttp://www.griphyn.org/sphinx/http://www.griphyn.org/sphinx/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.