Download presentation
Presentation is loading. Please wait.
Published byJean Ellis Modified over 9 years ago
1
Site Management & Support for 1.2 Technical Problems Other Problems What is a Sysadmin? What is wrong with the way we work? Me as a Site Manger Me as an Iteam member Me as an angry old man What worked!!!
2
Site Management & Support for 1.2 Technical Problems Installing and Configuration Problems (Known) defects of the install tools Many services only partially/wrong configured by the delivered objects Some services delivered without configuration object (WP1) Some configuration scripts delivered untested (syntax errors) Some configuration scripts acted destructive Localization sometimes complicated Almost no tools provided by WPs to verify correctness of installation/configuration (strictly not a technical problem) RPMs not correctly build (rare) or not re-locatable For one tool the correct configuration was not known by edg (Not a technical problem) Manual checks & interaction with each node Many errors, many iterations with developers and ITeam
3
Site Management & Support for 1.2 Technical Problems Running the Testbed Problems Stability of core services (Many!!!!!) No test procedures to identify faulty service (not a technical problem) Core services do not manage limited resources: Memory, Disk, Files, etc and some make it impossible for the sysadmins to do it. The system is working only for short times in an “All Hands on Deck” environment (demos) The day to day user experiences the service provided by the sites as almost not existing (it worked in the demo, why not for me??)
4
Site Management & Support for 1.2 Examples Resources Not Managed On the RB /tmp holds the sandboxes of all jobs owned by the same user (sysadmins cannot set quotas per VO or user, a single user submitting 100 jobs can bring the system down, impossible for the sysadmin to resolve!!!) Answer to a resource exhaustion (ports, open files) is often a crash of the service, this is a sign of simplistic error management (lazy programming) On the CE No concept of scratch space + users technical cannot manage the space in their home dirs (status information + temp data in the same dir) On the SE No clean way for user or sysadmin to manage the disk space. Quotas possible, but the system provides no means to safely remove a file. No production operation possible without permanent exchange between admins and users
5
Site Management & Support for 1.2 Other Problems Getting Information Problems To trace a problem is very hard (Error messages for users are often uesless, no unit tests to run, logfiles are scattered, different format of messages, sometimes no time stamps, confusing) The iteam mailing list is sometimes single source of information The iteam mailing list is too busy (Discussions on the architecture, general end user problems, love and hate, integration of version XX….) The WPs often feel not responsible after releasing the software to the Iteam to provide information on configuration and help tracing problems (this has improved recently) Users get frustrated, Sysadmins get even more frustrated
6
Site Management & Support for 1.2 Other Problems User Support Problems Users often have problems to find the user documentation. Users tend to ignore the user guide (80% of messages we get) There is no working first level user support apart from the site managers (might be getting better soon) There is no “Message of the day” for the end users (web page??) Some of the provided documentation is too complex for the end user (More HowTo’s needed) End users should not (need to) read the Iteam mailing list Users send us too many identical mails and we have not enough time to do our real work More later on developing, testing, debugging
7
Site Management & Support for 1.2 Other Problems Build in Robustness Problems Single point of failures (MDS, GridMap Files, Proxy, RB, Lbserver) no automatic fail over. Programming is not defensive (Crashes on errors, …) There is no working first level user support apart from the site managers (might be getting better soon) There is no “Message of the day” for the end users (web page??) Some of the provided documentation is too complex for the end user (More HowTo’s needed) End users should not (need to) read the Iteam mailing list Failing services bring other services down Finding the real reason for crashes is hard
8
Site Management & Support for 1.2 What is a Sysadmin/Site Manager? Has To: Provides Resources Installs systems Maintain systems Maintain hardware Manage resources Manage users Install software packages Run provided tests Maintain services following provided procedures Report problems with services Provide basic user support Manage Security Normally has not to: Debug applications and services Keep dysfunctional services running Discover procedures to maintain services Discover the functionality of services Provide user support on all levels including the usage of certain applications Take responsibility for the deficits of the software installed (make it work) If not convinced compare edg with ORACLE or SAP software. Edg has to clarify the role of site managers!!!
9
What is wrong with the way we work? Warning: Much of what follows has been addressed before. Almost all can be summarized by Fab’s new mission statement on quality and usability Some of my favorites: Development, testing, integration, deployment and debugging What is an SE? Vacation Conferences, project x All is good in version 2 It is WPa’s responsibility I never got a job run….
10
What is wrong with the way we work? Development, testing, integration, deployment and debugging General Development, testing and debugging has to take place on a system with a well defined software base installed. The build process of the software has to be documented and has to be predictable. (The steps from source to binary to RPM has to be reliable) Testing: Unit testing was insufficient, test cases not complex enough and stress test missing Testing during integration not done with use cases close enough to applications Focus only on one problem package at a time (jobsubmission, data management) For each version and package the required functionality has to be described by UseCases and goals have to be set for reliability Poor performance has to be addressed by the project in a more constructive way. Internal review(s) of design, implementation and procedures.
11
What is wrong with the way we work? Development, testing, integration, deployment and debugging Integration&Deployment: We have to be more strict! Software released without usable configuration information has to be rejected Software without unit test protocols is not allowed to start integration Software can’t technically be released for integration without implemented test software The WPs have to understand that problems that block a required UseCase have to be addressed by them at any time
12
What is wrong with the way we work? Development, testing, integration, deployment and debugging Debugging: Complex problems can be only found if teams of users (Iteam, Loose Cannons, WP8) and developers form small teams to get the problem resolved (See JJ and WP1) What can the core site managers do? Provide more developers and verification testbeds Participate in the testing Run the provided tests on their site (the test groups framework will help) Report all problems found
13
What is wrong with the way we work? Vacation, Conferences During 1.2 we got as an answer to a cry for help very often variations to the theme: The expert is on vacation The only one who can do this is on a conference, working for his other project We can’t do that now because this is done by Any One and she is not here, she is on a workshop A project of this size has to stay operational during conferences and vacation Work package managers have to guarantee that the core competency of the WP is covered. Knowledge has to be spread wide enough. Is a 20% person worth the communication overhead? For how many grid projects can you work?
14
What is wrong with the way we work? All is good in version 2 or 3 We got this answer when we discovered bizarre features. 1.2.x has to be usable for some reasonable form of production !! It is WPa’s responsibility We got this from two WPs at the same (critical) time. More good will? The UN? A stick? I never got a job run…. We have heard that from a few users repeated all over. We are not LxBatch… But without knowing the cmds it even doesn’t work there. What is an SE?? During the release of 1.1.4 we had several discussions on the Iteam list. For 1.2 there seem to be still open questions.
15
Site Management & Support for 1.2 Other Problems Readiness for Production Problems Core functionality still missing in some areas (Accounting, Storage management) Robustness still not good enough (Too many interventions needed) Users have to have too much “Site Awareness” (Grid should hide this) Too many developers are not users (Eat your own dog food!) Get more users in the development and people with production experience
16
What worked? A lot (just a few examples). The WP1 debugging (well it worked almost) The close cooperation between JJ and the WP1 team was a good example of how we can make progress. We learned a lot from this. (Still Francesco has to train his people) It proved how important clear, short term goals are (the infamous 1%) It proved how useful a test suite is (Ingo’s job storms (even if they are limited)) The simple IRC tool proved to be almost perfect for the communication between the testers, developers and the sysadmins. Resolving the Objectivity problem The communication between Iteam and CMS worked very well (Common Sense) The initial start of the ATLAS data challenge test To have a real customer with real requirements and a challenging deadline sparked a lot of activities in edg pointing into the right direction.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.