1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory.

Slides:



Advertisements
Similar presentations
Lectures on File Management
Advertisements

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.
Beowulf Supercomputer System Lee, Jung won CS843.
Presented by: Priti Lohani
OSCAR Jeremy Enos OSCAR Annual Meeting January 10-11, 2002 Workload Management.
A Commodity Cluster for Lattice QCD Calculations at DESY Andreas Gellrich *, Peter Wegner, Hartmut Wittig DESY CHEP03, 25 March 2003 Category 6: Lattice.
HELICS Petteri Johansson & Ilkka Uuhiniemi. HELICS COW –AMD Athlon MP 1.4Ghz –512 (2 in same computing node) –35 at top500.org –Linpack Benchmark 825.
Processes CSCI 444/544 Operating Systems Fall 2008.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Chapter 3 Memory Management: Virtual Memory
CE Operating Systems Lecture 5 Processes. Overview of lecture In this lecture we will be looking at What is a process? Structure of a process Process.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
MSc. Miriel Martín Mesa, DIC, UCLV. The idea Installing a High Performance Cluster in the UCLV, using professional servers with open source operating.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
I/O Systems I/O Hardware Application I/O Interface
Bigben Pittsburgh Supercomputing Center J. Ray Scott
© Pearson Education Limited, Chapter 16 Physical Database Design – Step 7 (Monitor and Tune the Operational System) Transparencies.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
HEPiX Karlsruhe May 9-13, 2005 Operated by the Southeastern Universities Research Association for the U.S. Department of Energy Thomas Jefferson National.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Lattice QCD and the SciDAC-2 LQCD Computing Project Lattice QCD Workflow Workshop Fermilab, December 18, 2006 Don Holmgren,
LHCb-Italy Farm Monitor Domenico Galli Bologna, June 13, 2001.
LQCD Workflow Execution Framework: Models, Provenance, and Fault-Tolerance Luciano Piccoli 1,3, Abhishek Dubey 2, James N. Simone 3, James B. Kowalkowski.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
1 Cluster Development at Fermilab Don Holmgren All-Hands Meeting Jefferson Lab June 1-2, 2005.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.
Next Generation of Apache Hadoop MapReduce Owen
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Computer System Structures
Module 12: I/O Systems I/O hardware Application I/O Interface
Processes and threads.
OpenPBS – Distributed Workload Management System
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
2. OPERATING SYSTEM 2.1 Operating System Function
Belle II Physics Analysis Center at TIFR
Lattice QCD Computing Project Review
Operating System (OS) QUESTIONS AND ANSWERS
Chapter 2: System Structures
Chapter 3: Process Concept
Constructing a system with multiple computers or processors
Virtual Memory Networks and Communication Department.
Applied Operating System Concepts
Operating Systems CPU Scheduling.
Capriccio – A Thread Model
湖南大学-信息科学与工程学院-计算机与科学系
Chapter 2: System Structures
Operating System Concepts
CS703 - Advanced Operating Systems
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Process Description and Control
Constructing a system with multiple computers or processors
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

1 Lattice QCD Clusters Amitoj Singh Fermi National Accelerator Laboratory

2 Introduction The LQCD Clusters Cluster monitoring and response Cluster job types submission, scheduling and allocation Execution Wish List Questions and Answers

3 The LQCD Clusters ClusterProcessorNodesMILC performance qcd2.8 GHz P4E, Intel E7210 chipset, 1 GB main memory, Myrinet MFlops/node 0.1 TFlops pion3.2 GHz Pentium 640, Intel E7221 chipset, 1 GB main memory, Infiniband SDR MFlops/node 0.8 TFlops kaon2.0 GHz Dual Opteron, nVidia CK804 chipset, 4 GB main memory, Infiniband DDR MFlops/node 2.2 TFlops

4 pion and qcd cluster pion cluster front pion cluster back qcd cluster back

5 kaon cluster kaon cluster frontkaon cluster back kaon head-nodes & Infiniband spine

6 Cluster monitoring Worker node nannies monitor critical components/processes such as: health (cpu/system temperature, cpu/system fan speeds) batch queue clients (PBS mom) * disk space NFS mount points high speed interconnects Except for * nannies report via any anomalies that may exist. For * a corrective action is defined. A corrective action needs to be well-defined with sufficient decision paths to fully automate the error diagnosis and recovery process. Users are sophisticated enough to report any performance related issues. Head-node nanny monitors critical processes such as: mrtg graph plotting scripts * automated scripts to generate cluster status pages * batch queue server (PBS server) NFS server * Except for * nanny will restart processes that may have exited abnormally. All unhealthy nodes are reported as blinking on the cluster status pages. Cluster administrators can then analyze the mrtg plots to isolate the problem. Network fabric For the high speed network interconnects: Nannies monitor and plot health of critical components (switch blade temperature, chassis fan speeds) on the 128 port myrinet spine switch. No automated corrective action has been defined for any anomalies that may occur. Cluster administrators can run Infiniband cluster administration tools to locate bad Infiniband cables, failing spine or leaf switch ports, failing Infiniband HCAs. The Infiniband hardware has been reliable.

7 Cluster job types A large fraction of the jobs that are run on the LQCD clusters are limited by: Memory-bandwidth Network-bandwidth Memory bandwidth boundNetwork bandwidth bound

8 Cluster job execution Open PBS (Torque) and the Maui scheduler schedule jobs using the "FIFO" algorithm as follows: Jobs are queued in the order of submission Maui will run the highest (oldest) jobs in the queue in order, except it will not start a job if any of the following are true: a) the job will put the number of running jobs by a particular user over the limit b) the job will put the total number of nodes used by a particular user over the limit c) the job specifies resources that cannot be fulfilled (e.g. a specific set of nodes requested by the user) If there are jobs that are not eligible for any of the above, Maui will run the next eligible job. Under certain conditions, Maui may run the next eligible job if only limit (c) holds. This is called backfilling. Maui will look at the state of the queue and the running jobs, and based on the requested and used wall-clock times predict when the job blocked by (c) will be able to run. If job(s) lower in the queue can run without extending the start time for the job blocked by (c), Maui will run that (those) jobs. Once a job is ready to run, a set of nodes are allocated to the job exclusively, for the requested wall- time. Almost all jobs run on the LQCD clusters are MPI jobs. Users can explicitly refer to the PBS_NODEFILE environment variable OR it is coded into the mpirun launch script.

9 Cluster job execution (cont’d) Typical user jobs are 8, 16 or 32 nodes which run for a maximum wall time of 24 hours. A user nanny job running on the head-node executes job streams. Each job stream is a PBS job which: on the job head-node (MPI node 0) copies a lattice (problem) stored in dCache to the local scratch disk. divides the lattice into the number of nodes and copies the sub-lattices to each node local scratch disk. launches an MPI process on each node which computes it’s sub-lattice. the main process (MPI process 0) gathers the results from each node onto the job head-node (MPI node 0) and copies the output into dCache. marks checkpoints at regular intervals for error recovery. Output from one job stream is the input lattice for the next job stream. If a job stream fails, the nanny job restarts the stream from the most recent saved checkpoint.

10 Wish List Missing link between the monitoring process and the scheduler. Scheduler could do better by being node and network aware. Ability to monitor factors that are critical to application performance (e.g. Thermal instabilities can cause throttling of cpu speed which ultimately affects performance). Very few automated corrective actions defined for components and processes that are currently being monitored. Using current health data, ability to predict node failures rather than just updating mrtg plots.