Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
File Systems (2). Readings r Silbershatz et al: 11.8.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
For more notes and topics visit:
Networked File System CS Introduction to Operating Systems.
Computers in the real world Objectives Understand what is meant by memory Difference between RAM and ROM Look at how memory affects the performance of.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
CS 1308 Computer Literacy and the Internet. Introduction  Von Neumann computer  “Naked machine”  Hardware without any helpful user-oriented features.
Distributed systems [Fall 2014] G Lec 1: Course Introduction.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Guide to Linux Installation and Administration, 2e1 Chapter 2 Planning Your System.
1 Computer and Network Bottlenecks Author: Rodger Burgess 27th October 2008 © Copyright reserved.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
P3 - prepare a computer for installation/upgrade By Ridjauhn Ryan.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
Manish Kumar,MSRITSoftware Architecture1 Remote procedure call Client/server architecture.
Background Computer System Architectures Computer System Software.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
Backup and Disaster Dr Stuart Petch CeG IT/IS Manager
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Memory Management Damian Gordon.
RAID.
Introduction to Operating Systems
Chapter 1: Introduction
Applied Operating System Concepts
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
2. OPERATING SYSTEM 2.1 Operating System Function
Distributed File Systems
Chapter 2: System Structures
File System Implementation
Operating System I/O System Monday, August 11, 2008.
File Transfer and access
CSI 400/500 Operating Systems Spring 2009
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Introduction to Operating Systems
Computer Networks and Operating Systems Lecture-3
湖南大学-信息科学与工程学院-计算机与科学系
An Introduction to Computer Networking
Design Unit 26 Design a small or home office network
Chapter 17: Database System Architectures
Chapter 2: System Structures
Distributed File Systems
Lecture 15 Reading: Bacon 7.6, 7.7
Distributed File Systems
Distributed File Systems
Chapter 15: File System Internals
MAINTAINING SERVER AVAILIBILITY
Chapter 2: Operating-System Structures
Distributed File Systems
Distributed File Systems
MapReduce: Simplified Data Processing on Large Clusters
Client/Server Computing and Web Technologies
Presentation transcript:

Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager University of California San Diego

OSG Summer School 2010 Real resources They have limits They break!  Sometimes in very strange ways Don't always know what they are You need to share them

OSG Summer School 2010 What resources you use? Compute resources Storage resources Network resources

OSG Summer School 2010 Real resources Compute resources

OSG Summer School 2010 Use them or loose them CPU cycles cannot be stored  They are either used or wasted Batch processing tries to minimize the waste  But it cannot do miracles  User jobs still need to be efficient

OSG Summer School 2010 CPU inefficiencies Disk operations  Disk is orders of magnitude slower than CPU  Reading many files in parallel a killer Network operations  If you wait for data over the network, you are likely wasting CPU  WAN much slower than LAN We can assume you use efficient algorithms, right?

OSG Summer School 2010 Sharing Most of the time you will be sharing a compute box with other people  Accessing the same physical disk  Competing for the same network link Memory (RAM) may become a limit  Most OS hide it from you through virtual memory  But that can be very slow! (disk access)

OSG Summer School 2010 Job requirements CPU type Operating system (flavor) Installed libraries Memory needs Do you know them? Can you minimize them?

OSG Summer School 2010 Where do you get the resources? Your desktop/laptop? Local cluster? The Grid?  Which one? (OSG, TeraGrid, NYGrid, EGEE,...) The Cloud?  Which one? (Amazon, Magellan, Microsoft, Google,...)

OSG Summer School 2010 On OSG Which sites support me? What tools will I use? How do I request the right resources?  I know what I need, right? How do I partition my work?  Cannot be a single serial job  Then you need to split it across sites (OSG MM and pilots can help here) Remember you may want to use other resources, too!

OSG Summer School 2010 Site selection Can I trust the information system?  Who the site supports?  Are resource attributes correct? Do I get all the information needed?  What about sites with different resources? How do I express my requirements?  Globus uses RSL  But the semantics is site specific!

OSG Summer School 2010 My job is not working Is the problem 1/1000 or 1000/1000  I hope you tested with a single one first! Do you know how to debug it?  Interactive access is usually not an option (although pilots can partially help)  Does it work on your laptop? What do the logs say?  You do write to them, right? (could be just stderr)

OSG Summer School 2010 What can go wrong? You don't understand your requirements The batch system did not honor your requirements  Or you expressed them in a wrong way Corrupted software Corrupted data Hardware problems (lots of components involved) You already ruled out bugs in your code, right?

OSG Summer School 2010 My jobs don't finish Nope  Too restrictive requirements?  Permissions?  Batch system bug?  Just need to wait a little more? Yes  Your job has a bug?  Hung connection?  Have they been restarted several times? (preemption) Have they ever started?

OSG Summer School 2010 Grid related problems Permissions! “Globus error 7!” Gatekeeper problems "Globus error 3: an I/O operation failed" "Globus error 4: jobmanager unable to set default to the directory requested" "Globus error 17: the job failed when the job manager attempted to run it" "Globus error 22: the job manager failed to create an internal script argument file" "Globus error 47: the gatekeeper failed to run the job manager" "Globus error 121: the job state file doesn't exist"

OSG Summer School 2010 Grid related problems (2) Black holes  Worker nodes that “eat” your jobs Worker node problems  Misconfigured OS  Missing OSG software  Missing VO software  Disk full  Preemption

OSG Summer School 2010 How you fix Grid problems? Very little you can do by yourself Most problems can only be solved by administrators at the Grid site  GOC can act as an intermediary Help me, please!

OSG Summer School 2010 Real resources Storage resources

OSG Summer School 2010 A two dimensional problem Storage is used for extended periods of time  Not instantaneous like CPU  People like to cling to it as long as possible Two dimensions  Space (nr. bytes)  Time (from-to)

OSG Summer School 2010 Uniform yet heterogeneous Storage is much more uniform than CPU  Everyone stores bytes But the interface can be heterogeneous  What protocol does it talk?  How do I access it?  Do I need to request explicit permission before I can write?  How long will my data stay there?

OSG Summer School 2010 What storage is available? OSG has a few standard areas  Local dir  $OSG_APP, $OSG_DATA May have local storage element  But not guaranteed Additional site specific areas  Ask!

OSG Summer School 2010 Storage selection Several storage areas  Which one should I use? Sometimes little choice  Where is the needed data  Which one has enough free space Careful about performance; depends on  Locality  Architecture  API

OSG Summer School 2010 Access pattern Just read from the original source?  Or should I make a local copy? Just write to the target area?  Or should I first create a local copy? How do I move the data?  In your job or externally scheduled?  How do I handle errors?

OSG Summer School 2010 Tools to use POSIX interface System command line tools (possibly overloaded) Storage specific tools (dccp) Grid tools (srm-cp)

OSG Summer School 2010 Error types Cannot reach storage area Wrong permissions Data is corrupted Access is painfully slow Disk full

OSG Summer School 2010 Access problems Local disks  NFS stale mount?  Disk failure? Remote storage area  Authentication failure?  Networking issue?  Server overload?  Server down? (different administrative domain)

OSG Summer School 2010 Wrong permissions Authentication != Authorization Can read but not write? Were able to write through one interface, but cannot read from another? Can copy whole file, but not access only a piece of it? What about group access?

OSG Summer School 2010 Data corruption Hardware does misbehave Someone could have overwritten (or deleted) your files Can affect both data and code Do you know how to check for corruption? What do you do if either code or data gets corrupted?

OSG Summer School 2010 Way too slow? How many of your jobs access the same area at the same time? What is you access pattern? Can you schedule file transfers? Can you use a different storage area?

OSG Summer School 2010 Disk full Storage capacity is limited  And often shared with other users Can your quota be increased? Can you use a different area? Can you delete some of your old files?

OSG Summer School 2010 Real resources Network resources

OSG Summer School 2010 Multi-dimensional problem AB Composed of many pieces  You at best see the two ends  But many other segments in between Network is always shared  And often you don't even know it

OSG Summer School 2010 Use-it-or-loose-it The network is like the CPU  If you don't use it, it is lost However, most network admins are not happy if you use all the bandwidth  Due to the massively shared nature of it

OSG Summer School 2010 Networking problems No connectivity Unreliable connectivity Slow speed

OSG Summer School 2010 No Connectivity AB ✘ ✘ Firewalls are becoming common  Mostly blocking incoming connectivity At least one side must allow incoming connections  Or it is impossible to use the network Can your tool work with firewalls?  e.g. Grid submission requires incoming connections for both server and client

OSG Summer School 2010 Unreliable connectivity Connection established just fine  But dropped after 20s  Did not even notify the two ends! Works with up to 20 clients  The 21 cannot connect anymore  The other 20 may get killed as well Massive UDP packet loss

OSG Summer School 2010 Slow speed A bottleneck at one of the two ends? A bottleneck somewhere in between? One of the two ends needs OS tunning?  WAN very different from LAN Are you sending very small packets?  Using a very inefficient protocol?

OSG Summer School 2010 Summary Life in distributed computing is hard  Many things can go wrong Grid computing is even harder  You have little control over most resources To keep it manageable:  Use debug-friendly tools whenever possible  Log as much as you can

OSG Summer School 2010 OSG Users Experience See for yourself ProblemsEncounteredByVOsDuringJobSubmission

OSG Summer School 2010 Questions? Questions? Comments? Feel free to ask me questions later: Igor Sfiligoi, Upcoming sessions  None!  Enjoy the evening 40