Considerations for database servers Castor review – June 2006 Eric Grancher, Nilo Segura Chinchilla IT-DES.

Slides:



Advertisements
Similar presentations
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Advertisements

ITEC474 INTRODUCTION.
Advanced Oracle DB tuning Performance can be defined in very different ways (OLTP versus DSS) Specific goals and targets must be set => clear recognition.
Database Tuning. Objectives Describe the roles associated with database tuning. Describe the dependency between tuning in different development phases.
Copyright © 2012 AirWatch, LLC. All rights reserved. Proprietary & Confidential. Mobile Content Strategies and Deployment Best Practices.
Acknowledgments Byron Bush, Scott S. Hilpert and Lee, JeongKyu
DB server limits (process/sessions) Carlos Fernando Gamboa, BNL Andrew Wong, TRIUMF WLCG Collaboration Workshop, CERN Geneva, April 2008.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
INTRODUCTION OS/2 was initially designed to extend the capabilities of DOS by IBM and Microsoft Corporations. To create a single industry-standard operating.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
Module 15: Monitoring. Overview Formulate requirements and identify resources to monitor in a database environment Types of monitoring that can be carried.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.
M ODULE 2 D ATABASE I NSTALLATION AND C ONFIGURATION Section 1: DBMS Installation 1 ITEC 450 Fall 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
Online Database Support Experiences Diana Bonham, Dennis Box, Anil Kumar, Julie Trumbo, Nelly Stanfield.
CERN - IT Department CH-1211 Genève 23 Switzerland t Tier0 database extensions and multi-core/64 bit studies Maria Girone, CERN IT-PSS LCG.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Operation of CASTOR at RAL Tier1 Review November 2007 Bonny Strong.
The protection of the DB against intentional or unintentional threats using computer-based or non- computer-based controls. Database Security – Part 2.
Oracle Tuning Considerations. Agenda Why Tune ? Why Tune ? Ways to Improve Performance Ways to Improve Performance Hardware Hardware Software Software.
Oracle Tuning Ashok Kapur Hawkeye Technology, Inc.
Selling the Storage Edition for Oracle November 2000.
CASTOR Databases at RAL Carmine Cioffi Database Administrator and Developer Castor Face to Face, RAL February 2009.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
CERN Physics Database Services and Plans Maria Girone, CERN-IT
RAL Site Report Castor Face-to-Face meeting September 2014 Rob Appleyard, Shaun de Witt, Juan Sierra.
Introduction to Oracle. Oracle History 1979 Oracle Release client/server relational database 1989 Oracle Oracle 8 (object relational) 1999.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
Review of Recent CASTOR Database Problems at RAL Gordon D. Brown Rutherford Appleton Laboratory 3D/WLCG Workshop CERN, Geneva 11 th -14 th November 2008.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
CERN IT Department CH-1211 Genève 23 Switzerland t Load Testing Dennis Waldron, CERN IT/DM/DA CASTOR Face-to-Face Meeting, Feb 19 th 2009.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
18 Copyright © 2004, Oracle. All rights reserved. Backup and Recovery Concepts.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Database Performance Eric Grancher - Nilo Segura Oracle Support Team IT/DES.
2 Copyright © 2006, Oracle. All rights reserved. Configuring Recovery Manager.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
RAC aware change In order to reduce the cluster contention, a new scheme for the insertion has been developed. In the new scheme: - each “client” receives.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
REMINDER Check in on the COLLABORATE mobile app Best Practices for Oracle on VMware - Deep Dive Darryl Smith Chief Database Architect Distinguished Engineer.
I NTRODUCTION OF W EEK 2  Assignment Discussion  Due this week:  1-1 (Exam Proctor): everyone including in TLC  1-2 (SQL Review): review SQL  Review.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Sql Server Architecture for World Domination Tristan Wilson.
Jean-Philippe Baud, IT-GD, CERN November 2007
Oracle structures on database applications development
Database Services at CERN Status Update
Service Challenge 3 CERN
Maximum Availability Architecture Enterprise Technology Centre.
Castor services at the Tier-0
SQL Server Monitoring Overview
Get Oracle 8i Running on Your Linux Server Straight Away!
Introduction of Week 3 Assignment Discussion
Presentation transcript:

Considerations for database servers Castor review – June 2006 Eric Grancher, Nilo Segura Chinchilla IT-DES

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 2 Outline  Current implementation  Machine setup  Database setup  Throttling  Return on experience  Documentation, operations and sharing with Tier1 sites  Upcoming implementation  Our conclusions

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 3 Questions to be addressed  _Only_ for the database part  3. Is the service infrastructure, and particularly the database server hardware, appropriate?  4. Are the necessary operational procedures in place (with adequate trained personnel) to ensure the MoU service reliability commitments can be met?  5. Does the overall service have the required performance and scalability the Tier0? For the CAF? What aspects of the software or system limit performance or scalability? Are these limits a concern?  6. Is the current policy for the support of Castor at other sites effective and sustainable?

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 4 Return on the database architecture  Castor(1) name server database  Castor(2) stager database  Castor(2) distributed logging facility database  SRM database  Different worklaod, can be grouped.

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 5 Current implementation  1 Castor name server database  6 Castor2 environments (each 2 database servers stager + DLF): ALICE, ATLAS, CMS, LHCb, ITDC, SC4  Castor2 test (RAC system and DLF) and dev  Castor2 SRM database  -> 18 database systems  Using “CERN disk servers”  RedHat Enterprise Linux (3ES x86)  Oracle Enterprise Edition database (10.2)

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 6 Issues with “CERN disk servers”  Hardware problems  Lack of support from the vendor  Hardware problems, corruptions (see next slide)  Machines are not RedHat Enterprise certified  Ultimately Oracle Support can reject calls (has happened: Oracle Support “ask your HW vendor to study the issue and review the certification”)  Not designed for database workload  Database systems require a lot of IO operations per second (typically 8kB) -the case for the Castor2 stager- and very few “large IO”  Lack of performance / memory  New preferred deployment Oracle platform is Linux x86-64  Larger memory  Several CPUs and/or cores

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 7 Corruptions  11 corruptions worked on between November 2005 and April 2006  On control files  On redo log files  On data files Fri Nov 11 01:16: Errors in file /ORA/dbs00/oracle/admin/CASTORSG/bdump/castorsg_arc0_7289.trc: ORA-00354: corrupt redo log block header ORA-00353: log corruption near block change time 11/11/ :00:59  Corruptions typically imply service downtime (lengthy recovery) and lot of work, may generate data loss.  For Castor2, no data lost (but data loss on similar types of hardware for other services).  Many of these corruptions have been fixed by the DBA during the night, delicate to automate fix (depends on type / place of corruption, not doing the right thing might harm more than help!).  (since 1996), never had any of these type of corruptions on other platforms (Oracle writes a block which can be read later but not consistent with the format).  Many thanks to IT-FIO Castor deployment and Linux Support for their help / investigations / new machines installations on the hardware platform.

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 8 Throttling  The “database system” has to sustain the load and manage priorities (even in worst cases, where it should throttle the load while still being reactive)  Throttling  CPU in case of resource starvation  Active sessions (better usage of resource, limits latch inter-locking)  Strict limitations  PGA Memory (generate ORA-600 to be trapped)  Sessions (only for new session, no issue in stable usage, “garde-fou”)  SGA maxsize (should not be reached, but…)  Hints for the system PGA target SGA target  Space of course Each user / major component has its own tablespace with limitations

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 9 Castor NameServer (DB) scalability  Using a test (thanks to Olof/Ben), multithreaded / loop  Cns_creat  Cns_stat  Cns_setfsize  Cns_filestat  Without problem, up to 920 Castor nameserver operations per second (with 15 threads).  Oracle database is fully instrumented. We use the “wait event interface” as the main source for performance tuning (and early indication of scalability issues).

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 10 Castor NameServer (DB) figures EventWaits Time (s) Avg Wait(ms) % Total Call Time Wait Class enq: TX - row lock contention135,8598, Application log file parallel write67,2824, System I/O log file sync75,3334, Commit wait for scn ack43,8991, Other control file sequential read6, System I/O

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 11 New platform  Use of the cluster (RAC) and standby (DataGuard) technologies understanding its limits  Certified hardware (tested / validated / supported / guaranteed that the HW supplier and Oracle will work together on eventual issues)  Flexible (depending on needs)  NS, stager, DLF, read-status  Eventually in a dynamic way  Flexible (can have components made more powerful)  Can scale with load

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 12 Initial to future architectures Initial to future architectures DLF ALICE stager DLF ATLAS stager DLF CMS stager DLF LHCb stager DLF … stager DLF stager standby ALICE DLF stager standby CMS DLF stager standby ATLAS DLF stager standby LHCb … NS-CMS NS-LHCb NS-ALICE NS-ATLAS Name Server (unique db) standby … SRM… SRM-… NSSRM

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 13 Variations / combinations  We may see that the usage patterns actually put a high-load on “name server”, stager, “status read”…  Oracle clustering (RAC) make it possible to re-allocate resource where needed  1 node for DLF + 2 nodes for stager versus  1 node for stager + 2 nodes for DLF…  Having standby (DataGuard) enables us to perform almost upgrades with very minimal downtime (<1 minute), cluster (RAC) does not.  All database instances in a cluster share the same database, if there is a corruption, all nodes in the cluster may stop. With a standby, databases do not share “files”.  Cluster and standby combination is the Oracle recommended maximum availability architecture.

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 14 Same hardware anyway DLF stager standby NS-CMS NS-LHCb NS-ALICE NS-ATLAS Name Server (unique db) NS-ATLAS NS-ALICE = DLF stager standby servers storage network storage arrays NS-CMS

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 15 Collaboration with other institutes  Oracle licensing.  Castor2 has been tested and works with Oracle Express Edition (free / limited version).  Provide the creation scripts (with space and user allocation throttling/limits…) and can provide help for the initial deployment.  Full production support has to be done “locally”.  Provide “Oracle statistics” (to have the same execution plans than the ones validated at CERN).  Provide recipes / CERN Castor database guideline for deployment.  Sharing of information ( for now, will to have phone calls, meetings).

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 16 Operations  Database administrators are part of castor-deployment mailing list.  Very good interaction with the Castor dev/deployment teams.  Database administrators can be called 24x7 (we also cover Christmas) with rotation on the people. Operators have the procedure (and had to exercise it , see corruptions).  All operations are logged in Twiki (visible to Castor deployment team, experiments).  Reference documents for installation  Space allocation  “Oracle user” privileges  Throttling and limits  Table statistics

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 17 Security  Model follows security best practices  no extra processes ran on the database servers,  No OS user login (administrators via ssh key).  OS and Oracle patches applied on a regular basis).  Use Oracle advanced security mechanisms can be used (encryption, more secure authentication)  Tested and used on other services, non-significant overhead (especially for non-bandwidth intensive applications).  CERN has Oracle license for it.  To be validated, but a priori no issue to be faced.

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 18 Our conclusions  A lot of experience gathered with the “initial” deployment (for the throttling / limits, for the execution plans / together with the other sites).  We have “encouraged” an early move for all databases to database version 10.2 (10gR2), “premier support” until July 2010 (extended 2013).  Security of the platform is correct satisfactory, more can be done if required (transparent for the application).  We believe, based on our experience, that adequate hardware / infrastructure is mandatory for a service to reach its quality and availability goals.

Eric Grancher / Nilo Segura Chinchilla (IT/DES) 19 References  Oracle support lifetime-support-policy-datasheet.pdf lifetime-support-policy-datasheet.pdf  Olof’s stress test on the stager (stress on Oracle) stress-testing.pdf stress-testing.pdf