Download presentation
Presentation is loading. Please wait.
Published byMoses Briggs Modified over 9 years ago
1
LHC Logging Cluster Nilo Segura IT/DB
2
Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition
3
Hardware Components ● Two Sun Fire V240 – Dual CPU 1Ghz, 4Gb memory – Dual internal disks, dual power supply ● One Sun Storedge 3510FC – 2Gb fiber channel architecture – 12x146Gb 10k RPM FC disks – Two Raid controllers with 1GB cache – Can accept up to 2x3510 Jbod expansion trays ● Both machines share the same set of disks – The 3510 can accept up to 8 hosts directly attached (or up to 4 with a redundant config.).
4
Software Components ● Sun Cluster 3.1 Update 1 ● Solaris 9 ● Oracle RDBMS 9.2.0.5 (Real Application Cluster) ● Oracle Distributed Lock Manager ● Veritas Volume Manager 3.5 ● Sun certification completed – checking correct level of patches – shutting down one of the nodes – disconnecting one of the nodes from the disk system – etc..
5
High Availability ● The purpose of the cluster installation is to offer 365x24 access to the database ● No single point of failure – Two nodes, two disk system, two.... ● Recovery/Availability offered by the Oracle software (Real Application Cluster) – Transactions are recovered by the surviving instance ● Tested the following cases – Listener down (re-connection immediate) – Listener up but instance down (re-connection immediate) – Machine down (re-connection takes longer, 3minutes connecting from a Linux client due to TCP driver timeout) ● Timeout can be tweaked but...
6
Transparent Application Failover ● For SELECT operations, if the connection is lost, the session is resumed transparently in the surviving node – Tested and working, the session stops for a few seconds and then resumes withouth the user issuing a new connect request – Not tested from a JDBC Thin driver.... it will work with the JDBC OCI driver ● Sessions modifying data will still lose the connection and need to re-connect – As expected, the current transaction will be rolled-back ● Possibility of LOAD BALANCING at the level of the connect string – Not enabled for the moment, perhaps later
7
Service definition - General ● Service to run 365x24, backups will not interrupt the database access – Export + hot backups – Oracle Recovery Manager will reduce the backup window time ● Problems with the service to be reported to oracle.support@cern.ch and/or Oracle GSM telephone (depending on the criticality of the problem).oracle.support@cern.ch – Same mechanism used for SUNSLPS and LEP Database servers ● However, the system can still collapsed due to other reasons (network outage, power failure, gremlins....) so applications must be able to react to these events (local buffering?) – Instance failure when recovering a distributed transaction – Surviving instance tried to recover and crashed in the same point
8
Service Definition - Patches ● We may need to interrupt the service for updates... – If all goes well, one day (scheduled) interruption per year – We should be able to apply Solaris patches one node at a time ● Moving applications from one node to another – Oracle offers apparently Rolling upgrade features in their RAC patches ● Some patches that touch common structures used by all the instances will still require database downtime ● But : critical security patches may need to be applied at any given moment (following Sun and/or CERN SecurityChief requests) – Removed all unneeded Solaris services to avoid potential problems ● Private firewall for all the database servers ala AIS ?
9
lhclog=(DESCRIPTION= (FAILOVER=on) (LOAD_BALANCE=off) (ADDRESS= (PROTOCOL=TCP) (HOST=sunlhclog01.cern.ch) (PORT=1521) ) (ADDRESS= (PROTOCOL=TCP) (HOST=sunlhclog02.cern.ch) (PORT=1521) ) (CONNECT_DATA= (SERVICE_NAME=LHCLOGDB) (FAILOVER_MODE= (TYPE=SELECT) (METHOD=BASIC) )
10
lhclog=(DESCRIPTION= (FAILOVER=on) (LOAD_BALANCE=off) (ADDRESS= (PROTOCOL=TCP) (HOST=sunlhclog01.cern.ch) (PORT=1521) ) (ADDRESS= (PROTOCOL=TCP) (HOST=sunlhclog02.cern.ch) (PORT=1521) ) (CONNECT_DATA= (SERVICE_NAME=LHCLOGDB) (FAILOVER_MODE= (TYPE=SELECT) (METHOD=PRECONNECT) )
11
Database ● Space will be managed automatically by Oracle – No need to specify extent size – Unlimited number of extents
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.