Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS 13.5.2014 Pre-GDB 2014: Christian Nieke1.

Slides:



Advertisements
Similar presentations
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Advertisements

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 17 Scheduling III.
Operating Systems 1 K. Salah Module 2.1: CPU Scheduling Scheduling Types Scheduling Criteria Scheduling Algorithms Performance Evaluation.
OS Spring ’ 04 Scheduling Operating Systems Spring 2004.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
Chapter 8 Operating System Support
Performance Evaluation
Processes and Resources
Lecture 3: Computer Performance
CHEP 2015 Analysis of CERN Computing Infrastructure and Monitoring Data Christian Nieke, CERN IT / Technische Universität Braunschweig On behalf of the.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.
State coverage: an empirical analysis based on a user study Dries Vanoverberghe, Emma Eyckmans, and Frank Piessens.
Enables businesses achieve greater efficiency by sharing data and processes Shared application data across legal entities— party, location, products…
1 Computer Performance: Metrics, Measurement, & Evaluation.
IIT Indore © Neminah Hubballi
CPU Scheduling Chapter 6 Chapter 6.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
11 SYSTEM PERFORMANCE IN WINDOWS XP Chapter 12. Chapter 12: System Performance in Windows XP2 SYSTEM PERFORMANCE IN WINDOWS XP  Optimize Microsoft Windows.
Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
Scheduling Basic scheduling policies, for OS schedulers (threads, tasks, processes) or thread library schedulers Review of Context Switching overheads.
Event Management & ITIL V3
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
© 2008 Quest Software, Inc. ALL RIGHTS RESERVED. Perfmon and Profiler 101.
Chapter 5: Implementing Intrusion Prevention
11 MANAGING PERFORMANCE Chapter 16. Chapter 16: MANAGING PERFORMANCE2 OVERVIEW  Optimize memory, disk, and CPU performance  Monitor system performance.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web (Book, chap. 6)
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
INFORMATION SYSTEM-SOFTWARE Topic: OPERATING SYSTEM CONCEPTS.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Ó 1998 Menascé & Almeida. All Rights Reserved.1 Part V Workload Characterization for the Web.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Morgan Kaufmann Publishers
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Capacity Planning - Managing the hardware resources for your servers.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
CPU Scheduling Operating Systems CS 550. Last Time Deadlock Detection and Recovery Methods to handle deadlock – Ignore it! – Detect and Recover – Avoidance.
Accounting John Gordon WLC Workshop 2016, Lisbon.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Enables businesses achieve greater efficiency by sharing data and processes Shared application data across legal entities – Party, Location, Products,
Chapter 7 Scheduling Chien-Chung Shen CIS/UD
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
SQL Database Management
Network Data Collection Infrastructure to Detect Security Anomalies
OPERATING SYSTEMS CS 3502 Fall 2017
Lecture 2: Performance Evaluation
Cluster Optimisation using Cgroups
Diskpool and cloud storage benchmarks used in IT-DSS
Outline Introduction Related Work
Hardware Counter Driven On-the-Fly Request Signatures
CS703 – Advanced Operating Systems
Jia-Bin Huang Virginia Tech
Presentation transcript:

Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS Pre-GDB 2014: Christian Nieke1

What do we want to do? Understand our systems Which parameters influence the behaviour? What are our use cases? Which metrics can measure performance? Quantify Improve Describe our expectations and current status Detect problems Understand what causes them Predict Future demand Effect of system changes Pre-GDB 2014: Christian Nieke2

What do we have already? Data Sources EOS logs per file access some 20 metrics per open/close sequence very similar to xroot f-Stream (collected by Matevz) LSF - another 20 metrics per job cpu, wall, os, virt/phys, RSS, swap, local I/O Dashboards experiment side classification and (in some cases) event rate But combining them is tricky… Lack of unique identifier for cross matches Combine information via xroot plugin? Pre-GDB 2014: Christian Nieke3

Important parameters Analysis and Effects of: Remote/Federated access e.g.CERN Wigner I/O & Network CPU: AMD/Intel Configuration Vector Read/TtreeCache, OS etc. Use Cases Analysis/Production Per experiment Internal (EOS-balancing etc.) … Pre-GDB 2014: Christian Nieke4

Metrics - How to measure performance? Pre-GDB 2014: Christian Nieke5 IO CPU IOCPU CPU efficiency: 50%CPU efficiency: 67% IO CPU IOCPU CPU efficiency: 50%CPU efficiency: 60% IO CPU IO

Example: Throughput vs. CPU-Ratio The idea is that low CPU ratio points to slow I/O But the relation is not that straightforward… Pre-GDB 2014: Christian Nieke6

Metrics - How to measure performance? Runtime? If the same job runs faster, this is obviously better But hard to compare: Different jobs Different data sets (and sizes) Event rate? Intuitive: Events / second More stable than just time(ratios) For the same kind of job For similar data Useful for comparison of parameters E.g. CERN/Wigner for similar job mix Right now not reported by all experiments Pre-GDB 2014: Christian Nieke7

Metrics - How to measure performance? Job Statistics: Collect key metrics (time spend in CPU, I/O and network transfer rate, event rate…) per job Allows optimization for users and system User: “50% time spent in network? I should check this! ” System: “95% time spent in CPU? We should upgrade CPUs rather than switches.” Categorize into Job Profiles Allow users to compare their jobs to the average in their category Pre-GDB 2014: Christian Nieke8

Proposal for xroot Proposal: add xroot client plugin to collect cpu and user metrics per session Correlated (by session id) with f-stream eg: cpu / wall / memory / swap / local IO event rate from ROOT / exp. framework Define app info field for main workloads Eg: app_info := “app/version/specific_setting” EOS internally uses: /eos/gridftp, /eos/balancing, /eos/draining, etc This would allow passive analysis of experiment throughput per ‘app’ statistical comparison of sites/os/sw_versions One would not be able to average evt/s over different apps but one would be able to sum-up relative increases/decreases Pre-GDB 2014: Christian Nieke9

How to improve the system? Model for expected behaviour For now: Avg. values within a given group Goal: Models for specific, well defined groups E.g. Analysis/Production per Experiment, Internal(EOS-balancing) Detection of unexpected behaviour For now: detection of outliers Goal: automatically detect deviation from models Pre-GDB 2014: Christian Nieke10

Example: Results based on simple detection LSF-Group: Automatic detection of low CPU-Rate (<50%) Users are informed personally Experiments above 80% Pre-GDB 2014: Christian Nieke11

Example: Users lacking information One example from new LSF low-CPU-usage probes one user with often only 2-3% CPU utilisation kB/s average transfer rate - all data at US-T2 After vector-read/TTC was enabled improvement by factor 4.5 in turn-around and CPU utilisation remaining factor 6 wrt local EOS access Low “visible” impact on users (given sufficient batch slots) even slow jobs are running stable in parallel no concrete expectation about what their job duration / transfer speed should be Pre-GDB 2014: Christian Nieke12

Summary Understanding the system: Several parameters and use cases identified Good metrics are still an open question Improving the system: Currently semi-automatic detection of anomalies Working towards higher automatisation But already some success stories Prediction: Still in the future… Proposal: xRoot plugin for better integration of monitoring information Pre-GDB 2014: Christian Nieke13