POLITEHNICA University of Bucharest California Institute of Technology National Center for Information Technology Ciprian Mihai Dobre Corina Stratan MONARC 2 - distributed systems simulation -
The Goals of the Project To perform realistic simulation and modelling of large scale distributed computing systems, customised for specific large scale HEP applications. To provide a design framework to evaluate the performance of a range of possible computer systems, as measured by their ability to provide the physicists with the requested data in the required time, and to optimise the cost. To narrow down a region in this parameter space in which viable models can be chosen by any of the future LHC-era experiments. To offer a dynamic and flexible simulation environment.
online system multi-level trigger filter out background reduce data volume level 1 - special hardware 40 MHz (40 TB/sec) level 2 - embedded processors level 3 - PCs 75 KHz (75 GB/sec) 5 KHz (5 GB/sec) 100 Hz ( MB/sec) data processing offline analysis, selection One of the four LHC detectors (CMS) Raw recording rate 0.1 – 1 GB/sec PetaBytes / year LHC Computing: Different from Previous Experiment Generations
Geographical dispersion: of people and resources Complexity: the detector and the LHC environment; Scale: ~100 times more processing power; Petabytes per year of data 1800 Physicists 150 Institutes 32 Countries VERY LARGE SCALE DISTRIBUTED SYSTEM AND IT HAS TO PROVIDE (NEAR) REAL-TIME DATA ACCESS FOR ALL THE PARTICIPANTS CMS Off-Line LHC Computing Data Analysis
Tier2 Center Online System Offline Farm, CERN Computer France Center FNAL Center Italy Center UK Center Institute Institute ~0.25TIPS Workstations 100–1000 MBytes/sec ~2.4 Gbits/sec Mbits/sec Bunch crossing per 25 nsecs. Event is ~1 MByte in size Physicists work on analysis “channels”. Processing power: ~200,000 of today’s fastest PCs Physics data cache ~PBytes/sec ~ Gbits/sec Tier2 Center ~622 Mbits/sec Tier 0 +1 Tier 1 Tier 3 Tier 4 Tier2 Center Tier 2 Experiment Regional Center Hierarchy (Worldwide Data Grid)
The simulation model: abstracts the components of the real system and their interactions must be equivalent to the simulated system Simulation models: continuous time - the system is described by a set of differential equations discrete time - the state changes only at certain time moments In MONARC: one of the discrete time models (Discrete Event Simulation – DES); the events represent important activities from the system, managed with the aid of an internal clock Simulation Models
A Global View for Modelling Simulation Engine Basic Components Specific Components Computing Models LAN WAN DBCPU Scheduler Job Catalog Analysis Distributed Scheduler MetaData Jobs MONITORING REAL Systems Testbeds
Regional Center Model Job Activity Job Scheduler AJob CPU... Link Port AJob CPU... Link Port AJob CPU... Link Port DB Index DB Server Link Port DB Server Link Port FARM REGIONAL CENTER LAN WAN
The Simulation Engine Provides the multithreading mechanism for the simulation The entities with time dependent behavior are mapped on “active objects” In the simulation engine: management of active objects and events Thread reusability (thread pool) Scheduler Task Event EventQueue WorkerThread Pool Activity JobScheduler Farm CPUUnit AJob Job Engine
Multitasking Processing Model Concurrent running tasks share resources (CPU, memory, I/O) “ Interrupt” driven scheme: For each new task or when one task is finished, an interrupt is generated and all “processing times” are recomputed. It provides: Handling of concurrent jobs with different priorities. An efficient mechanism to simulate multitask processing. An easy way to apply different load balancing schemes.
Engine tests Processing a TOTAL of simple jobs in 1, 10, 100, 1000, 2 000, 4 000, CPUs (number of CPUs = number of parallel threads): more tests:
Job Scheduling Dynamically loadable modules for each regional center Basic job scheduler: assigns the jobs to CPUs from the local farm More complex schedulers: allow job migration between regional centers CPU FARM JobScheduler Site A Dynamically loadable module
Centralized Scheduling CPU FARM JobScheduler Site A CPU FARM JobScheduler Site B GLOBAL Job Scheduler
Distributed Scheduling – market model – CPU FARM JobScheduler Site A CPU FARM JobScheduler Site B CPU FARM JobScheduler Site A Request COST DECISION
Example: simple distributed scheduling Very simple scheduling algorithm, based on searching the center with the minimum load We simulated the activity of 4 regional centers When all the centers are heavily loaded, the number of job transfers grows unnecessarily
Network Model WA N LAN LinkPort Farm Simulated local traffic Simulated inter-regional traffic Simulated network components
Node Link Node LAN Node Link Node LAN Node Link Node LAN Internet Connections ROUTER “Interrupt” driven simulation : for each new message an interrupt is created and for all the active transfers the speed and the estimated time to complete the transfer are recalculated. Continuous Flow between events ! An efficient and realistic way to simulate concurrent transfers having different sizes / protocols. LAN/WAN Simulation Model
Network Model Network Access Layer Internet Layer Transport Layer Application Layer Message LinkPort, LAN, WAN Protocol: TCPProtocol UDPProtocol Network Job The TCP/IP layers are closely followed
Data Model Client Database Index LinkPort Database DContainer Database ServerMass Storage Mapare TaskDatabase Entity
Data Model Generic Data Container Size Event Type Event Range Access Count INSTANCE FTP Server Node DB ServerNFS Server FILEData Base Custom Data Server Network FILE META DATA Catalog Replication Catalog Export / Import
Data Model Data Container JOB META DATA Catalog Replication Catalog Data Request Data Container List Of IO Transactions Data Processing JOB Select from the options
Activities: Arrival Patterns A flexible mechanism to define the Stochastic process of how users perform data processing tasks Dynamic loading of “Activity” tasks, which are threaded objects and are controlled by the simulation scheduling mechanism Physics Activities Injecting “Jobs” Each “Activity” thread generates data processing jobs for( int k =0; k< jobs_per_group; k++) { Job job = new Job( this, Job.ANALYSIS, "TAG”, 1, events_to_process); farm.addJob(job ); // submit the job sim_hold ( 1000 ); // wait 1000 s } Regional Centre Farm Job Activity Job Activity These dynamic objects are used to model the users behavior
Output of the simulation Simulation Engine Node DB Router User C Output Listener Filters Output Listener Filters Log Files EXCEL GRAPHICS Any component in the system can generate generic results objects Any client can subscribe with a filter and will receive the results it is Interested in. VERY SIMILAR structure as in MonALISA. We will integrate soon The output of the simulation framework into MonaLISA
Conclusions Modelling and understanding current systems, their performance and limitations, is essential for the design of the large scale distributed processing systems. This will require continuous iterations between modelling and monitoring Simulation and Modelling tools must provide the functionality to help in designing complex systems and evaluate different strategies and algorithms for the decision making units and the data flow management. For future development: efficient distributed scheduling algorithms, data replication, more complex examples.