Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.

Similar presentations


Presentation on theme: "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET."— Presentation transcript:

1 EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET

2 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 2 CYFRONET - overview Site scale –150 dual CPU nodes,  40 CPUs ia64, Itanium2  260 CPUs ia32, Xeon Production and PPS service Core Services –Toplevel BDII, WMS, RB –LFC, VOMS for Gaussian, Turbomole VOs –Regional SAM instance Tools we use for site management –OS deployment: local RPM repository + network install –YAIM for site configuraion –Ganglia for monitoring

3 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 3 Daily operations Tools used in daily grid operations –Monitoring  SAM, GStat  Regional Nagios instance http://nagios.ce-egee.org E-mail notifications (Core Service arrives as SMS)‏ Smart test hierarchy – always report the right problem –Documentation  Wikis: GOC, CE ROC, SEE ROC, other  Other web pages –Support  lcgrollout  Regional 1st line support team: just need to answer tickets and cooperate

4 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 4 Missing features Documentation: –Proper operations manuals for grid middleware (e.g. host certificate copies on WMS)‏ –Better information access – documentation too dispersed and hidden in local pages Deployment: –Grid middleware could be more similar to standard network services used in servers  Standard/consistent paths and files for logs, pids etc.  Standard/consistent logging format so logwatch, logcheck can use them  No hard-coded values in installation and configuration scripts  Only components needed for standard operation of a service in distribution Monitoring –SAM web interface improvements:  failures for all VOs for a given site on one page,  all sensors for a given site on one page,  all sensors for the whole region on one page –Some monitoring could be done at site to avoid failures that are outside of a site

5 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 5 Most Frequent Failures Most frequent scheduled interventions –HW/SW upgrade, maintenance, reconfiguration Most frequent unscheduled interventions –HW/power problems, bug in middleware update (WMS, RB)‏ Percentage of real problems detected by COD –A few percents –Majority detected and reported by regional 1st line support team + regional monitoring system

6 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 6 Seamless Deployment Deploying Core Services –Seamless toplevel BDII deployment ready: we use special setup for toplevel BDII – several machines, supports failover and maintenance works –Update instances one after another Ideas for seamless deployment –Sometimes services fails just after upgrade so need for thorough testing of releases by ROC –Own testbed and repository with updates only after proper testing, so sites can safely autoupdate.

7 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 7 Communication Communication with users –Direct e-mails, via GGUS or regional helpdesk Correlation of cross-site issues –Enough for the moment –Via regional mailing list, Broadcast tool, operations meetings Points to improve communication –Tickets and e-mail notifications with operational problems should be answered more timely esp. if Central Service is failing –Better web pages with cross links with some user friendly collaboration technologies for exchanging experience and knowledge. Ticketing system are often inhuman and too formal

8 Enabling Grids for E-sciencE EGEE-II INFSO-RI-031688 Operations Workshop, Stockholm, 13-15.06.2007 8 Usefulness of bodies –COD – problem notification body, makes sure problems are solved, takes care of monitoring tools –1st line support team:  assists site admins till problem solution or software bug report  very helpful for non-experiences admins  overlapping with COD: some cooperation between 1 st line support and COD team on-duty could be established –Operations meeting  tackles with unsolvable problems detected by 1st line support and operational problems with core services and middleware  to make a pressure on stuck GGUS tickets


Download ppt "EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET."

Similar presentations


Ads by Google