Download presentation
Presentation is loading. Please wait.
1
GridKa Cloud in WLCG Grid
Andrzej Olszewski Institute of Nuclear Physics PAN, Kraków LCGFrance Meeting, Grenoble, 27 November 2008
2
GridKa Cloud
3
Tier2 Sites in GridKa Cloud
Country CPU [kSi2k] Disk [TB] LHC VOs Desy-HH Germany 300 130 Atlas, CMS Desy-ZN GOEGRID 400 100 Atlas LMU/LRZ MPI/RZG 330 150 Wuppertal Freiburg CYFRONET Poland Atlas, Alice, LHCb PSNC 10 Atlas, Alice CSCS Switzerland 200 90 Atlas, LHCb, CMS FZU Czech 160 40 HEPHY-UIBK Austria 50
4
Cloud in Atlas Experiment
WLCG Grid is a hierarchy of Tier1,2,3 sites Each Tier1 site has associated Cloud of Tier2 sites Tier1 is in a team with the Cloud as a standalone unit Together they participate in a world-wide data distribution and simulation production which in case of Atlas is driven centrally at CERN The role of Tier1 and Tier2 sites varies for different experiment’s computing model In general Tier1 could work alone, but Tier2 sites depend on Tier1 services Tier1 runs file transfer service and data catalogs (FTS/LFC) for its Cloud Tier1 runs also DDM and production client services (DQ2/Pilot factory) Services run at Tier1 are essential to the Cloud health Tier1 site with the associated Cloud have an assigned share of Atlas production to do Tier1 handles its production tasks and its share of data storage and data distribution Tier1 must also handle data from production in the Cloud Tier2 Cloud provides resources and services for simulation production Tier2 Cloud provides storage and access to analysis data for physicists Tier2 Cloud provides resources and services for distributed analysis
5
Tasks in the Atlas Cloud
Keep Cloud in contact with WLCG Grid Install and update required services Monitor availability of pledged resources Provide and monitor availability of basic services Keep Cloud in contact with Atlas experiment Keep flow of information in both directions Participate in designing Atlas operations Supervise implementing Atlas requirements Run Atlas DDM and Production client software Detect (monitor) problems with Site Services, Atlas Services, Operations Diagnose (provide understanding) of problems on the expert level Cooperate with Atlas/sites/users to fix problems What this means? Keep the system running according to Atlas model requirements, by: Monitoring sites’ and services’ readiness and performance Quick response to problems Who is doing this? Tier2's and their associated manpower are providing much of the driving force
6
Atlas People in GridKa Cloud
Activities started 2005/2006 by people from Munich Guenter Duckeck, John Kennedy, Cedric Serfon + volunteers from Czech Republic: Jiri Chudoba and from Poland: Andrzej Olszewski Guenter+John doing organization, John working everywhere – in particular on simulation production, Cedric on data management, Andrzej on testing and monitoring Tier2 sites, Jiri helping everywhere + doing user support In 2006 the idea of Cloud organization was clarified Cloud coordinator + deputy coordinator Several Cloud support groups Atlas contact at Tier1 FZK Atlas contacts at Tier2 sites
7
Cloud Coordinator Cloud coordinator in charge of overall (co)operation + Deputy to take over after the end of term Keeps Cloud in contact with WLCG and Atlas plans and requirements Coordinates cooperation between site contacts and Cloud support groups Monitors, keeps track of and coordinates solving problems Organizes internal Cloud meetings Represents Cloud in ATLAS operations meeting High priority task! First GridKa Cloud coordinator: John Kennedy (2006 – mid 2008) 2008: Proposal of a rotating Cloud coordinator with 6 month long term Current coordinator: Andrzej Olszewski accepted first term: June-Dec 2008 Current deputy: Simon Nderitu In Germany a parallel to WLCG is D-Grid national grid organization Cloud needs to cooperate with D-Grid since both work for Atlas Guenter Duckeck is a D-Grid representative of HEPCG (HEP Computing Grid) Guenter contributes a lot with respect to organizational and practical matters especially when German sites are involved
8
Cloud Support Groups DDM operations Cedric Serfon
Management and configuration of storage resources Running DDM services and operations at the Cloud level Monitoring DDM operations and solving problems Production operations John Kennedy, Kendall Reeves -> Rod Walker Configuring sites for Panda production system Running pilot factory Monitoring and solving problems SW & DB installation Gernot Krobath->Julien de Graat, Andrzej Olszewski, John Kennedy Configuring sites for Atlas software requirements Monitoring status of software installation and solving problems Monitoring Stefan Birkholz, Arnulf Quadt, Cano Ay Cloud custom monitoring for quick check of status of basic activities
9
Atlas GridKa Tier1 Contact
Tier1 has many tasks that are specific to this large unit of WLCG Grid Tier1 is required to provide its services with a high level of availability and reliability Instabilities/interruptions cause long tail of GridKa and the Cloud going out of production, low efficiency in data transfers, low reliability in distributed analysis performed by users Cloud and Tier2 sites belonging to the Cloud heavily rely on functioning of Tier1 Cloud performance strongly depends on Tier1 GridKa Tier1 contact: first Andreas Heiss, currently Simon Nderitu Dedicated Atlas contact Active involvement in ATLAS GridKa cloud meetings & mailing lists Assures a closer contact and cooperation with GridKa admin on FTS, LFC, dCache GGUS trouble ticket system not useful to initiate & manage operations, we need fast feedback & single 1st contact Follows problems that need solution at Tier1 site admin’s level Organizes Atlas Tier1 specific activities
10
Atlas Tier2 Site Contacts
There is no direct communication from Atlas to Tier2 sites Except for pure Grid type trouble tickets Tier2 sites are integrated by participating in Cloud activities Each Tier2 site participating in Cloud needs a single main contact participates in GridKa cloud meetings & lists high priority task Main tasks Transfer Atlas requirements information to local site Organize Atlas computing Chase-up problems, manage trouble-shooting by good contacts with local admins
11
Atlas Accounting of Cloud Tasks
Coordinator 0.5 FTE: John Kennedy, Andrzej Olszewski, Guenter Duckeck Atlas Tier1 contact 1 FTE: Simon Nderitu as ATLAS contact + various GridKa admins DDM Cloud manager 1 FTE: Cedric Serfon mostly + Kai Leffhalm and others Production Cloud manager 0.5 FTE: Kendall Reeves + John Kennedy SW Installation 0.3 FTE (proposal not accepted): Gernoth Krobath mostly + A. Olszewski, J. Kennedy Atlas Tier2 contacts 2 FTE / 4 * 0.5 FTE per average Tier2, distributed over 8 Tier2 sites in GridKa Cloud
12
Organization Structure
WLCG Atlas Atlas Teams Atlas Operations Atlas Shifts D-Grid GridKa Cloud FZK Representative Cloud Coordinator Tier1 Contact GridKa Cloud Soft Installation Production DDM D-Grid CYF PSNC FZU CSCS HEPHY DESY-HH FREI WUPP MPI/RZG LMU/LRZ DESY-ZN GOETT
13
Separate coordination & technical meetings
Cloud Communication Role of Coordinator Primary contact person for Atlas, Cloud Support groups and Tier2 sites Cooperates with Atlas Tier1 contact person Cooperates with national representatives (Guenter Duckeck for Germany) Collects and provides info about Atlas requirements and Cloud status Proposes and implements global Cloud policies in consultation with Cloud members Collaborates with Cloud Support groups by monitoring problems, discussing solution proposals and setting priorities Tools Static WebPages Atlas/WLCG operations meetings, Atlas/WLCG conferences, high level meetings (GDB, ICB, ...) TAB nad FZK weekly meeting (Simon and Guneter) Cloud Meetings: Monthly, Weekly and F2F Mail to Atlas Cloud mailing list Mail to Cloud Support Groups via Cloud management list Mail to Site Contacts (exceptionally to Site Admins) Separate coordination & technical meetings
14
Communication Tools Cloud Meetings Face2Face (F2F) 1-2 times a year
Discuss major topics and strategies as proposed by Atlas and Cloud Coordinator Overview of sites’ development plans Monthly 1 time a month, on first Wednesday of a month Summary of Atlas plans and activities during last month Any special hot topics and Atlas requirements with respect to sites Summary of current plans and issues seen by sites during the last month Weekly Meeting of Coordinator, Tier1 contact and Cloud Support group reps Discussing issues from the last week, follow problems indicated at monthly meetings Creating and following weekly actions to improve Cloud operations and performance Cloud mailing lists General Cloud mailing list of Atlas collaborators and users in GridKa Cloud Cloud management mailing list: Coordinators, Tier1 and Cloud Support Groups
15
Cloud Communication Cloud Atlas: Info on Atlas requirements, Monitoring and Support for Atlas Production Cloud coordinator and support group coordinators participate in Atlas meetings, subscribe to mailing lists, collect info about Atlas plans We should request Atlas sending clear summary info to Cloud coordinators (similar to Tier1 list) if there was one – and was indeed actively used, maybe participating in so many meetings and subscribing to so many mailing lists would not be necessary Cloud coordinators report to Atlas during conferences, in operation and different management meetings Cloud regional support should cooperate closely with central Atlas support groups. We need to use central communication tools (Savannah for bug reports, eLog webs for shifts & experts to note the followed problems) to be aware of the other side actions GGUS tickets should be forwarded to Cloud mailing list as well as to the sites Tools Atlas Web Pages with requirements ADC and ADCOS mailing lists and meetings Cloud Coordinator mailing list Savannah and eLog tools to notify Atlas about Cloud support activity in solving problems Mails to GridKa Cloud management and Cloud Support mailing lists
16
Cloud Communication Cloud Sites: Cloud sending Atlas info to sites, sites responding Mail exchange when transferring info about Atlas requirements, problems, status of work But no quick info from sites on current problems unless discussion is initiated by Cloud Evaluation of sites use, performance and problems found during Atlas operations, review results of site availability and usage by Atlas from (EGEE,...) accounting Info about sites changes, plans for upgrade provided during the meetings Review of action items for sites and Atlas from monthly meetings Discussing long term Atlas, Cloud and Site plans Tools Mails on atlas-germany and to specific site contact people and directly to site admins Mails from site contacts to Cloud mailing list Support Mailing Lists GridKa (dCache, FTS, LFC) GGUS, DECH tickets Slow transfer to site admins – more then 24h on weekends Now with direct site reference should be processed faster Meetings: Monthly and F2F accounting and availability on a monthly basis
17
Cloud Communication Cloud Users: User Support
Notification of problems from users to Atlas Cloud support and DECH support Also those transferred from Atlas global user support Tools Hypernews GridKa Forum Atlas GridKa Cloud mailing list to reach users DECH support
18
Cloud Communication Internal Cloud Support: internal communication
No scheduled Cloud monitoring Internal cloud support via mail exchange, notifying Cloud about problems under investigation, notifying about status and problem solutions Mail exchange difficult to follow – no place with permanent eLog? of actions Cloud coordinator+Cloud support group+Tier1 coordinators weekly meeting discussing current issues, relaying info from specialized Atlas meetings that support coordinators attend Tools Cloud support Wiki page: current support group details GridKa Cloud mailing list for notifying about taking tasks Cloud support Wiki page: support group solutions for problems on sites Use Atlas general communication tools, like: eLog, IM, Team GGUS tickets Create and start using specific Cloud channels? Weekly Meetings: started in June
19
Cloud Info: static Cloud presentation page: link Sites
Site grid info links Contact people at sites Cloud status page: link Description of Atlas plans, requirements Status of Cloud readiness with respect to Atlas requirements and operations Information updated more often than once a month during Cloud meeting Info provided by support subgroups during weekly meetings Updates from sites as soon as situation changes Cloud details Resources: link, Software, Services: link etc. Cloud support: link Description of a support model in the cloud Contact people for cloud support subgroups Site specific info page from support subgroups software installation tricks on special sites production site not obvious specific solutions links to Atlas help pages These pages will be a great help for coordinator, cloud support team, atlas support team, cloud users. Will save us time spent for keeping our own information notes.
20
Cloud Info: static Cloud presentation page: link Sites
Site grid info links Contact people at sites Cloud status page: link Description of Atlas plans, requirements Status of Cloud readiness with respect to Atlas requirements and operations Information is updated more often than once a month after Cloud meeting Additional updates are provided by support groups during weekly meetings Updates from sites as soon as situation changes Cloud details Resources: link, Software, Services: link etc. Cloud support: link Description of a support model in the cloud Contact people for cloud support subgroups Site specific info page from support subgroups software installation tricks on special sites production site not obvious specific solutions links to Atlas help pages These pages will be a great help for coordinator, cloud support team, atlas support team, cloud users. Will save us time spent for keeping our own information notes.
21
Cloud Info: static Cloud presentation page: link Sites
Site grid info links Contact people at sites Cloud status page: link Description of Atlas plans, requirements Status of Cloud readiness with respect to Atlas requirements and operations Information updated more often than once a month during Cloud meeting Info provided by support subgroups during weekly meetings Updates from sites as soon as situation changes Cloud details Resources: link, Software, Services: link etc. Cloud support: link Description of a support model in the cloud Contact people for cloud support subgroups Site specific info page from support subgroups software installation tricks on special sites production site not obvious specific solutions links to Atlas help pages These pages will be a great help for coordinator, cloud support team, atlas support team, cloud users. Will save us time spent for keeping our own information notes.
22
Cloud Info: static Cloud presentation page: link Sites
Site grid info links Contact people at sites Cloud status page: link Description of Atlas plans, requirements Status of Cloud readiness with respect to Atlas requirements and operations Information updated more often than once a month during Cloud meeting Info provided by support subgroups during weekly meetings Updates from sites as soon as situation changes Cloud details Resources: link, Software, Services: link etc. Cloud support: link Description of a support model in the cloud Contact people for cloud support subgroups Site specific info page from support subgroups software installation tricks on special sites production site not obvious specific solutions links to Atlas help pages These pages will be a great help for coordinator, cloud support team, atlas support team, cloud users. Will save us time spent for keeping our own information notes.
23
Cloud Info: dynamic Cloud Monitoring Cloud Monitoring page
For quick detecting and diagnosing problems by Cloud support groups and sites Provides first glance info on status of Atlas activities on Cloud sites indicated by status color + links to more detailed info for diagnosing problems Monitors Atlas DDM and Production Monitors SAM tests on Cloud sites including GangaRobot (analysis test) Fetaures wish list Monitoring of Atlas DDM and Production tests Monitoring of Atlas software installation status Links to monitoring of FZK Tier1, Tier2 site services Links to network: status and current load List of tickets send to sites in the Cloud Improve bad site status detection by selecting only SAM tests critical for Atlas production Automate problem detection and add sending internal alarms, tickets Implementation wishes Should provide and use more general development frameworks and tools Should be more configurable and customizable Should allow customized monitoring at the Experiment, Cloud, Site levels
24
Cloud Collaboration with Atlas Shifts
Scheduled monitoring by Cloud experts not feasible Not enough support people Would replicate work done by shift groups Difficult to get accounted for Cloud monitoring by Atlas Need to participate and draw manpower from global Atlas support and shift groups Benefit of earning Atlas credit for work done for the benefit of the whole Atlas community Avoid duplication of work in doing monitoring shifts, diagnosing problems etc. Interface with global Atlas operations and shifts Clearly assigns responsibilities Insures that all areas are covered all the time Use tools and procedures used in global Atlas support Requires good communication between support people inside Cloud and with Atlas groups Provide Cloud information details for people outside participating in debugging problems Cloud experts should concentrate on a good communications with Tier1 and Tier2 site contacts Should be on a mailing list receiving problems detected by Atlas shift people ADCoS ELOG + T0 ELOG + ELOG Savannah for DDM problem tracking Team Tickets for all shifts
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.