July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH)

July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH) Simulation of distributed computing systems

July, 2000.Simulation of distributed computing systems I.C. Legrand2 u The Design and the Development of a Simulation program for large scale distributed computing systems. u Validation tests è Queuing Theory è Performance measurements based on an Object Oriented data model for specific HEP applications. u A large system simulation: the CMS HLT Farm. u Proposal for a Dynamic, Self Organising scheduling system. u Summary Contents

July, 2000.Simulation of distributed computing systems I.C. Legrand3 The GOALS of the Simulation Program u To perform realistic simulation and modelling of large scale distributed computing systems, customised for specific HEP applications. u To provide a design framework to evaluate the performance of a range of possible computer systems, as measured by their ability to provide the physicists with the requested data in the required time, and to optimise the cost. u To narrow down a region in this parameter space in which viable models can be chosen by any of the LHC-era experiments. u To offer a dynamic and flexible simulation environment.

July, 2000.Simulation of distributed computing systems I.C. Legrand4 The Monarc Simulation Program HThis Simulation program is not intended to be a detailed simulator for basic components such as operating systems, data base servers or routers. Instead, based on realistic mathematical models and measured parameters on test bed systems for all the basic components, it aims to correctly describe the performance and limitations of large distributed systems with complex interactions. HAt the same time it provides a flexible framework for evaluating different strategies for middleware software design, providing dynamic load balancing and optimising resource utilisation as well as turnaround time for high priority tasks.

July, 2000.Simulation of distributed computing systems I.C. Legrand5 Design Considerations of the Simulation Program è The simulation and modelling task for the MONARC project requires to describe many complex programs running concurrently in a distributed architecture. è A process oriented approach for discrete event simulation is well suited to describe concurrent running programs. k “Active objects” (having an execution thread, a program counter, stack...) provide an easy way to map the structure of a set of distributed running programs into the simulation environment.

July, 2000.Simulation of distributed computing systems I.C. Legrand6 Design Considerations of the Simulation Program (2) u This simulation project is based on Java (TM) technology which provides adequate tools for developing a flexible and distributed process oriented simulation. Java has built-in multi-thread support for concurrent processing, which can be used for simulation purposes by providing a dedicated scheduling mechanism. u The distributed objects support (through RMI or CORBA) can be used on distributed simulations, or for an environment in which parts of the system are simulated and interfaced through such a mechanism with other parts which actually are running the real application. The distributed object model can also provide the environment to be used for autonomous mobile agents.

July, 2000.Simulation of distributed computing systems I.C. Legrand7 Data Model It provides: uRealistic mapping for an object data base uSpecific HEP data structure uTransparent access to any data uAutomatic storage management uAn efficient way to handle very large number of objects. uEmulation of clustering factors for different types of access patterns. uHandling related objects in different data bases.

July, 2000.Simulation of distributed computing systems I.C. Legrand8 Multitasking Processing Model Concurrent running tasks share resources (CPU, memory, I/O) “ Interrupt” driven scheme: For each new task or when one task is finished, an interrupt is generated and all “processing times” are recomputed. It provides: Handling of concurrent jobs with different priorities. An efficient mechanism to simulate multitask processing. An easy way to apply different load balancing schemes.

July, 2000.Simulation of distributed computing systems I.C. Legrand9 “Interrupt” driven simulation  for each new message an interrupt is created and for all the active transfers the speed and the estimated time to complete the transfer are recalculated. An efficient and realistic way to simulate concurrent transfers having different sizes / protocols. LAN/WAN Simulation Model

July, 2000.Simulation of distributed computing systems I.C. Legrand10 Arrival Patterns A flexible mechanism to define the Stochastic process of how users perform data processing tasks Dynamic loading of “Activity” tasks, which are threaded objects and are controlled by the simulation scheduling mechanism Physics Activities Injecting “Jobs” Each “Activity” thread generates data processing jobs for( int k =0; k< jobs_per_group; k++) { Job job = new Job( this, Job.ANALYSIS, "TAG”, 1, events_to_process); farm.addJob(job ); // submit the job sim_hold ( 1000 ); // wait 1000 s } Regional Centre Farm Job Activity Job Activity êThese dynamic objects are used to model the users behavior

July, 2000.Simulation of distributed computing systems I.C. Legrand11 Regional Centre Model Complex Composite Object

July, 2000.Simulation of distributed computing systems I.C. Legrand12 Input Parameters for the Simulation Program Response functions are based on “the previous state” of the component, a set of system related parameters (SysP) and parameters for a specific request (ReqP). Such a time response function allows to describe correctly Highly Nonlinear Processes or “Chaotic” Systems behavior (typical for caching, swapping…) It is important to correctly identify and describe the time response functions for all active components in the system. This should be done using realistic measurements. The simulation frame allows one to introduce any time dependent response function for the interacting components.

July, 2000.Simulation of distributed computing systems I.C. Legrand13 Simulation GUI Simulation GUI One may dynamically add and (re)configure Regional Centers parameters

July, 2000.Simulation of distributed computing systems I.C. Legrand14 Simulation GUI (2) Simulation GUI (2) On-line monitoring for major parameters in the simulation. Tools to analyze the data.

July, 2000.Simulation of distributed computing systems I.C. Legrand15 Results repository and the “publishing” procedure Results repository and the “publishing” procedure Web Server afs/nfs file system RMI Server Write Objects

July, 2000.Simulation of distributed computing systems I.C. Legrand16 Queueing theory (1) M | M | 1 Model Queueing theory (1) M | M | 1 Model E[S] arrivalswaitingin service E[N] Mean number of jobs E[R] Mean response time

July, 2000.Simulation of distributed computing systems I.C. Legrand17 Queueing theory (2) M | M | 1 network queue model

July, 2000.Simulation of distributed computing systems I.C. Legrand18 Validation Measurements I The AMS Data Access Case Validation Measurements I The AMS Data Access Case SimulationMeasurements Raw DataDB LAN 4 CPUs Client

July, 2000.Simulation of distributed computing systems I.C. Legrand19 Validation Measurements I Validation Measurements I Local DB access 32 jobs The Distribution of the jobs processing time Simulation mean 109.5 Measurement mean 114.3

July, 2000.Simulation of distributed computing systems I.C. Legrand20 Validation Measurements I Measurements & Simulation Validation Measurements I Measurements & Simulation SimulationMeasurements

July, 2000.Simulation of distributed computing systems I.C. Legrand21 Validation Measurements II Validation Measurements II 2Mbps WAN Client - Server CPU usage Data Traffic 12 3 4 5 6

July, 2000.Simulation of distributed computing systems I.C. Legrand22 Simple Example: Resource Utilisation vs. Job’s Response Time Mean 0.55Mean 0.72Mean 0.93 Physics Analysis Example 180 CPUs200 CPUs250 CPUs

July, 2000.Simulation of distributed computing systems I.C. Legrand23 The CMS HLT Farm Simulation Pileup DB Pileup DB Pileup DB Pileup DB Pileup DB HPSS Pileup DB Pileup DB Signal DB Signal DB Signal DB... 6 Servers for Signal Output Server Output Server Lock Server Lock Server SUN... FARM 140 Processing Nodes 17 Servers 9 Servers Total 24 Pile Up Servers 2 Objectivity Federations

July, 2000.Simulation of distributed computing systems I.C. Legrand24 The HLT Farm Configuration HDistribute Pileup over 24 Linux servers HOther data over 6 Linux servers (70GB disk each) H2 Linux stations used for federations (Metadata, catalog etc) H2 Linux stations used for Journal files (used in locking) Hshift20 (SUN) Used for lockserving and Output (2 x ~250GB disks) The strategy is to use many commodity PCs as data Base Servers

July, 2000.Simulation of distributed computing systems I.C. Legrand25 Network Traffic & Job efficiency Mean measured Value ~48MB/s Measurement Simulation Jet Muon

July, 2000.Simulation of distributed computing systems I.C. Legrand26 Total Time for Jet & Muon Production Jobs

July, 2000.Simulation of distributed computing systems I.C. Legrand27 CPU LOAD Jet Production Job Muon Production Job Read Write

July, 2000.Simulation of distributed computing systems I.C. Legrand28 A Self-Organising Job Scheduling System The aim of this proposal is to describe a possible approach for the scheduling task, as a system able to dynamically learn and cluster information in a large dimensional parameter space. This dynamic scheduling system should be seen as an adaptive middle layer software, aware of current available resources and based on the “past experience” to optimise the job performance and resource utilisation.

July, 2000.Simulation of distributed computing systems I.C. Legrand29 A self organising model This approach is based on using the “past experience” from jobs that have been executed to create a dynamic decision making scheme. A competitive learning algorithm can be used to “cluster” correlated information in the multi-dimensional space. In our case, a feature mapping architecture able to map a high- dimensional input space into a much lower-dimensional structure in such a way that most of similarly correlated patters in the original data remain in the mapped space. Such a clustering scheme seems possible for this decision making scheme as we expert a strong correlation between the parameters involved. Compared with an “intuitive” model we expect that such an approach will offer a better way to analyse possible options and can evolve and improve itself dynamically.

July, 2000.Simulation of distributed computing systems I.C. Legrand30 A simple toy example We assume that the time to execute a job in the local farm having a certain load (  ) is: Where t 0 is the theoretical time to perform the job and f(  ) describes the effect of the farm load in the job execution time. If the job is executed on a remote site, an extra factor (  ) is introduced reducing the response time:

July, 2000.Simulation of distributed computing systems I.C. Legrand31 Clustering of the Self Organising Map

July, 2000.Simulation of distributed computing systems I.C. Legrand32 Evaluating the self organising scheduling with the simulation tool DECISION Intuitive scheduling scheme RCs SIMULATION Activities Simulation Job “Self Organizing Map” Job Warming up The decision for future jobs is based on identifying the clusters in the total parameter space, which are close to the hyper-plane defined in this space by the {Job} {System} subset. In this way the decision can be done evaluating the typical performances of this list of close clusters and chose a decision set which meets the expected / available performance / resources and cost.

July, 2000.Simulation of distributed computing systems I.C. Legrand33 Summary kA CPU- and code-efficient approach for the simulation of distributed systems has been developed for the MONARC project. èprovides an easy way to map the distributed data processing, transport and analysis tasks onto the simulation. ècan handle dynamically any model configuration, including very elaborate ones with hundreds of interacting complex objects. ècan run on real distributed computer systems, and may interact with real components. èThese results are encouraging and motivate us to continue developing this simulation tool. èModelling and understanding current systems, their performance and limitations, is essential for the design of the future large scale distributed processing systems. èA frame to evaluate different strategies for the middle layer software to optimize resource utilization in such distributed systems is under development.

July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH)

Similar presentations

Presentation on theme: "July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH)

Similar presentations

Presentation on theme: "July, 2000.Simulation of distributed computing systems I.C. Legrand1 MONARC Models Of Networked Analysis at Regional Centers Iosif C. Legrand (CALTECH)"— Presentation transcript:

Similar presentations

About project

Feedback