Frontier Status Alessandro De Salvo on behalf of the Frontier group

Slides:

Advertisements

Similar presentations

Refeng Wu CQ5 WCM System Administrator

Advertisements

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.

Virtualization for the LHCb Online system CHEP Taipei Dedicato a Zio Renato Enrico Bonaccorsi, (CERN)

RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.

DELETION SERVICE ISSUES ADC Development meeting

CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.

CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.

Tier-1 Andrew Sansum Deployment Board 12 July 2007.

CERN IT Department CH-1211 Geneva 23 Switzerland t WLCG Operation Coordination Luca Canali (for IT-DB) Oracle Upgrades.

Complete VM Mobility Across the Datacenter Server Virtualization Hyper-V 2012 Live Migrate VM and Storage to Clusters Live Migrate VM and Storage Between.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

Sergey Baranov: PanDA Infrastructure at CERN 3 Sep PanDA Infrastructure at CERN Status Sergey Baranov 3 Sep 2013.

Virtual Machine Movement and Hyper-V Replica

LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1 how to profit of the ATLAS HLT farm during the LS1 & after Sergio Ballestrero.

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

John Samuels October, Why Now?  Vista Problems  New Features  >4GB Memory Support  Experience.

ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.

CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.

Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.

RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.

EGI-Engage is co-funded by the Horizon 2020 Framework Programme of the European Union under grant number Federated Cloud Update.

INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre

OIS Progress on Drupal pilot service ENTICE meeting, 30 th September 2010 Jarosław (Jarek) Polok IT-OIS Operating systems and Internet services.

Dynamic Extension of the INFN Tier-1 on external resources

Extending the farm to external sites: the INFN Tier-1 experience

WLCG IPv6 deployment strategy

Monitoring Evolution and IPv6

WLCG Workshop 2017 [Manchester] Operations Session Summary

EGI Operations Management Board

The Beijing Tier 2: status and plans

IT Services Katarzyna Dziedziniewicz-Wojcik IT-DB.

LCG Service Challenge: Planning and Milestones

Virtualization and Clouds ATLAS position

INFN CNAF TIER1 Network Service

Lee Lueking WLCG Workshop DB BoF 22 Jan. 2007

IT-DB Physics Services Planning for LHC start-up

ATLAS Cloud Operations

HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL

Andrea Chierici On behalf of INFN-T1 staff

Database Services at CERN Status Update

Elizabeth Gallas - Oxford ADC Weekly September 13, 2011

Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007

Enrico Bonaccorsi, (CERN) Loic Brarda, (CERN) Gary Moine, (CERN)

Update on Plan for KISTI-GSDC

Support for IPv6-only CPU – an update from the HEPiX IPv6 WG

Generator Services planning meeting

WLCG Management Board, 16th July 2013

ATLAS Software Installation redundancy Alessandro De Salvo Alessandro

Olof Bärring LCG-LHCC Review, 22nd September 2008

WLCG Service Interventions

HPEiX Spring RAL Site Report

Update from the HEPiX IPv6 WG

AGLT2 Site Report Shawn McKee/University of Michigan

Conditions Data access using FroNTier Squid cache Server

Workshop Summary Dirk Duellmann.

Network Monitoring Update: June 14, 2017 Shawn McKee

Discussions on group meeting

GridPP Tier1 Review Fabric

HEPiX IPv6 Working Group F2F Meeting

Workflow Best Practices

Oracle Storage Performance Studies

Tech Inside Extended Document Management System (EDMS)

11/17/ :39 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Обзор Windows Azure Connect

ETHZ, Zürich September 1st , 2016

2/24/2019 7:49 PM BRK2198 Four new Azure management experiences to run your business critical applications Dushyant Gill | Jan Kalis.

CHIPP - CSCS F2F meeting CSCS, Lugano January 25th , 2018.

Pete Gronbech, Kashif Mohammad and Vipul Davda

Presentation transcript:

Frontier Status Alessandro De Salvo on behalf of the Frontier group 25-6-2019 A. De Salvo – June 25th 2019

Sites’ news: CERN CERN Investigating the possibility to upgrade to CentOS Migrate keeping the same Ips, if possible, or change the Ips Both options are viable, but keeping the old Ips would be diserable Smooth operation, no relevant problems observed since March 2

Emmanouil Vamvakopoulos Sites’ news: Lyon Lyon Stable infrastructure 4 Frontier-Lpads 3xVM 4VPU 8GB RAM 200G on CC-openstack 1 physical machine with 2 x E5-2623 v3 @ 3.00GHz 32GB RAM 2x1TB disk , (DELL R430) Frontier squid version 4.4-1.1 Frontier tomcat version 7.0.90_3.40-1 All monitoring tools configured and installed ( snmp mrtg, awstats, max-threads, filebeat to send log on ElasticSearch) Hardware Upgrade of Oracle infrastructure The 4 dedicated machines shared between DBATL ( ATRL replica) and DBAMI ( AMI ) replaced on Sep 2015 ( DELL 630 servers) with 7 year warranty The SAN storage back-end and the Fabric switches were replaced on 2014 with 5 year warranty, ( Hitachi, HS130) Plan to replace the SAN storage backend by next year Recent problems with the DB infrastructure Some isolated issues with storage backend I/O saturation on FEB 2019 The backend SAN storage is shared between among many DBs/Groups, some other oracle DB (different nodes) trigger a I/O issue Tuning performed by the DBMaster (on the other VO DB which trigger the issue) Emmanouil Vamvakopoulos 3

Sites’ news: RAL Completed migration of RAL Frontier Service from old Hyper-V to new VMWare Hypervisors [1 Apr 2019] 3 new servers on LHC-OPN network, allowing logstash monitoring through the CERN firewall All monitoring now working: MRTG, logstash (Kibana), awstats, maxthreads Added IPv6 support [4 June 2019] All running smoothly One incident [2 June 2019] when service degraded for 14 hours due to disk full on 2 of 3 servers. Fixed by enabling rotate for one logfile. RAL Frontier service supported until end of 2019. As agreed with ADC, will be decommissioned in January as Oracle is phased out at RAL. Tim Adye 4

ES/Kibana Monitoring Smooth operation New filters, enabling the parsing of the SQL queries The performance of the filters is enough to guarantee no delay in operations Very useful for debugging queries at all levels, including the analytics Lost data incident in Chicago affecting temporarly the ES/Kibana monitoring operations Solved thanks to the prompt restore of the data Data from May are not available, not clear why they haven’t been recorded, still under investigation, but it’s not a critical problem Some progress on the CMS dashboards and data collection 5

Backup Proxies and Failover monitor Backup proxies in production since a few months No problem observed and they were very useful to track down misconfigured sites or general problems AGIS cleanup completed, removing the off-site squids from the AGIS site configs Plans to inhibit the direct connections to the launchpads starting in July Need to coordinate with ADC in order to test this Failover monitor protection against corrupted awstats records added sometimes many l are added to the beginning hostname this causes increase in number of hostnames resulting in huge and hard-to-read table and long script processing time the problem is hard to debug and therefore it was decided to just put a protection into the monitoring if there are more than 2 l in the hostname, the number is measured and removed from the beginning of the hostname moving unidentified cern.ch hostnames from Unknown site to CERN-PROD site IP for hostnames of (presumably) VMs which end with cern.ch often cannot be identified now, every hostname in Unknown site which ends with cern.ch is movedto CERN-PROD site IP ranges are now used to discern sites which share geoip location IP ranges are defined in the GOCDB in site definition (or OIM equivalent) IP ranges are rarely used when it is defined, most often sites define it as 0.0.0.0/0 (which is ignored by the script as it seems all IPs belong to that range) this makes difference for very few sites right now 6

WLCG Squid Ops joint group New (joint) group of experts following up the issues shown in the monitoring pages (both Frontier and CVMFS) Michal Svatos (ATLAS), Edita Kizinevic (CMS) and Barry Blumenfeld (CMS) wlcg-squid-ops@cern.ch 7