Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando.

Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando Gamboa Michael Ernst John Hover Hironori Ito RACF Computer Facility, Brookhaven National Lab, US Computing in High Energy and Nuclear Physics conference, ASGC, Taiwan 2010.

Outline Introduction –Oracle database services at BNL within the WLCG context. –Overview database services at BNL Topology database services Operational challenges –Case 1. Remote oracle database access –Case 2. Database hardware migration Conclusions 21/10/102CHEP 2010, ASGC, TAIWAN

World LHC Computer BNL Open Science Grid (OSG) Worldwide LHC Computing Grid (WLCG) Common grid technologies infrastructure relies on relational databases with the goal of: Distribute and catalog LHC data. -LHC File Catalog (LFC) -File Transfer Service (FTS) Replicate and store detector conditions and calibration metadata. Oracle Real Application Clusters database technology adopted by WLCG Scalable and reliable Brookhaven National Laboratory Serves as US Tier 1 site for ATLAS VO as part of OSG for the LHC experiment: - About 6,100,000GB of total online storage. - About 1,658 physical cpu, 5,652 logical cpu. - Hosts a conditions database which is replicated from Tier 0 using oracle streams, BNL is part of the 3D project. - 4 independent set of database clusters providing about 50 TB raw disk storage. 21/10/103CHEP 2010, ASGC, TAIWAN

Brookhaven National Laboratory 21/10/104CHEP 2010, ASGC, TAIWAN

Topology Oracle Database services hosted at BNL Independent clusters set per application service. – Dual nodes, Direct Attach Storage (DAS) – Storage distribution adjusted to application needs Hardware RAID levels Storage and spindles – Flexible architecture that allows to increase nodes and storage per application needs. – Homogenous software stack deployed: Real Application Cluster 10gR2. Database server Clusterware ASM file system N1,1 N1,2 Storage1 LFC and FTS database LFC and FTS database N4,1 N4,2 Storage4 TAGS test database N3,1 N3,2 Storage3 Conditions database N2,1 N2,2 Storage2 VOMS and Priority Stager database VOMS and Priority Stager database Database service accessed via LAN Database service accessed via LAN / WAN 21/10/105CHEP 2010, ASGC, TAIWAN

Node 1 Node 2 IBM 3550/3650 Server description: - 2 dual – 2 quad core 3GHz, 64 bits Architecture - RAM 16GB -32GB Interconnectivity Server to clients -NIC 1000Gb/s. Server to storage -HBA QLogic 4Gb FC Dual-Port PCI-X -1M LC-LC Fibre Channel Cable Storage IBM DS3400 FC dual controller -2 Hot Swap disk per enclosure -4 Gbps SW SFP Transceiver -12 SAS disks 15krpm, size 300 GB/disk or 450GB/disk IBM DS3000 storage expansion 12 SAS disks 15krpm, size 300 GB/disk to 450GB/disk Monitor tools -Oracle Enterprise Manager Grid Control -Nagios -Ganglia RAID 10 DS3400 DS3000 Expansion 21/10/106CHEP 2010, ASGC, TAIWAN

Distribution of database services per production cluster LFC and FTS database -Dedicated to host BNL, US Tier 3 LFC and FTS data. -Each database service is distributed on only one node. In case of failure, database services will fail over to the surviving node. -Cluster inside BNL firewall. -TSM is enabled for tape backups besides the disks backups. FTS FTS DB LFC DB (BNL and TIER 3) Data Stored 350GB LFC Backup process Node 1 Node 2 DISK BACKUP TAPE BACKUP LFC (Tier 3) 21/10/107CHEP 2010, ASGC, TAIWAN

Node 1 Node 2 IBM 3650 Server description: - 2 Quad Core 3GHz, 64 bits Architecture - RAM 32GB DS3400 DS3000 Expansion Storage IBM DS3400 FC dual controller -1 GB buffer cache -2 Hot Swap disk per enclosure -4 Gbps SW SFP Transceiver - IBM DS3000 storage expansion -36 SAS disks 15krpm, size 450 GB/disk 3D Conditions Conditions DB + DB admin tables Data Space 502GB Used / 5 TB 3D Conditions Frontier database service Node 1Node 2 Backup processes 3D Conditions Streams process 3D Conditions Streams process FRA, DISK BACKUP Space 5 TB DS3000 Expansion DS3000 Expansion ASM Data disk Group -RAID 1 LUN’s -External Redundancy ASM FRA disk group -RAID 6 LUN’s -External Redundancy 12 SAS disks 15krpm 450 GB/disk 21/10/108CHEP 2010, ASGC, TAIWAN

Distribution of database services per production cluster Summary storage allocated per cluster RAC # Oracle serviceTotal RAW spaceTotal SPACE after RAID 10 1 TAGS 6TB2.8 TB 2FTS/LFC /LFC tier 36TB2.8 TB 3Conditions DB~21.6TB5TB 4VOMS / Priority Stager6TB2.8 TB Table 1 21/10/109CHEP 2010, ASGC, TAIWAN

CASE1: REMOTE ORACLE DATABASE ACCESS Operational challenges 21/10/1010CHEP 2010, ASGC, TAIWAN

Case 1. Remote database service accessibility (WAN/LAN) Conditions Oracle RAC database (Tier 1, BNL) Conditions Oracle RAC database (Tier 1, BNL) WAN LAN TIER 1 BNL Worker Nodes TIER 1 BNL Worker Nodes DB DATA SOURCE SITE (Tier 0) DB DATA SOURCE SITE (Tier 0) TIER(s) 2 Interactive UsersTIER(s) 3 Interactive Users Michigan Calibration Center Data Replication Via Oracle Streams 21/10/1011CHEP 2010, ASGC, TAIWAN

Case 1. Remote database service accessibility (WAN/LAN) General motivation To understand the long resolution time of a user reconstruction job run from a remote site client compared to the resolution time of the job run at the local site (BNL worker node) when using the direct connection to BNL Oracle Conditions Database. (No proxy/cache is considered in this case) 21/10/1012CHEP 2010, ASGC, TAIWAN

Case 1. Remote database service accessibility (WAN/LAN) Remote Client Side Different simultaneous connection threads to database activity. No high load observed at the client side. Long execution time observed at client side when running reconstruction jobs using the BNL conditions database. The user could run faster his job in different environments. CLIENT/ORACLE DBBNL COND. DB (minutes) CERN COND. DB (minutes) BNL Worker Node~3 CERN Worker Node~3 Indiana University Remote Client 30-40 Table 2 General observations during interactive client job execution 21/10/1013CHEP 2010, ASGC, TAIWAN

Case 1. Remote database service accessibility (WAN/LAN) Two different approaches were done to understand the difference in the job resolution time: Approach 1 Test and verification of TCP network parameters at client site. Approach 2 Analysis of job queries in terms of their network performance. 21/10/1014CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site Test consisted on testing different database (Oracle SQLNET) and OS kernel TCP networks parameters at client side to observe the behavior of the resolution time of the job considering: – Network latency between client/server side. – TCP buffers client/server side. – Application connection mechanism. The client was located at a remote site (Indiana University). Job configured to be run using Athena Release 14. IPERF network tool was used during this test. 21/10/1015CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site Key Network files on oracle listener.ora (server): Contains information related to listening protocol addresses, about supported services, and parameters that control its Listener process runtime behavior. sqlnet.ora (client, server) : contains the parameters that specify preferences for how a client or server uses Oracle’s Network protocol features. tnsnames.ora (client,server) : Maps net services names to connection descriptions. This parameters can be defined as well on this file; Network FileParameter sqlnet.oraDEFAULT_SDU_SIZERECV_BUF_SIZESEND_BUF_SIZE tnsnames.ora Listener.ora SDU_SIZERECV_BUF_SIZESEND_BUF_SIZE Table 3 21/10/1016CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site Session Data Unit (SDU). – Allows limited control over the packet sizes – Possible values: 512 to 32767 bytes, 2048 default. – To minimize overhead it has to be adjusted according to the Maximum Segment Size (MSS) of the network protocol being used. Thus, MSS=Maximum Transmission Unit – (TCP and IP) header size =1500 (Ethernet) -20 bytes (TCP) – 20 IP =1460 bytes – Negotiated by client and server for data retrieval. Minimum value when client and server differ. 21/10/1017CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site RECV_BUF_SIZE, SEND_BUF_SIZE – Alters the TCP send and receive windows. – Not setting this parameters OS buffers sizes will be used. – These parameters depends on: SDU size negotiated SEND/RCV_BUFFER proportional to the Bandwidth Product Delay (BDP) – BDP is proportional to the network latency and the bandwidth. 21/10/1018CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site Test results Date TestsTest / tnsnames.ora SDU (Bytes) SND_BUFFER (Bytes) RCV_BUFFER (Bytes) Finish Time Reco Job (mm:ss:ms) Max Idle observed minutes 10/28/08Prior tuning2048OS 35:00-40:00~30-37 10/28/08Job default2048OS 19:49.25~18 10/28/082835214250000 18:09:75~13-14 10/28/0833276714250000 18:14:21~15 11/04/0842048142500005475000019:29.98~18 11/04/0858352142500005475000018:20:83~16 11/04/08631744142500005475000018:08:95~17 11/06/0872048OS 21:05.31~17:15 11/06/0888760292000001460000020:30:22~13-14 11/06/08931120292000001460000020:45:22~15 Table 4 21/10/1019CHEP 2010, ASGC, TAIWAN

Approach 1: Test and verification of TCP network parameters at client site Results It was possible to decrease the resolution time of the job by almost 50 % after configured the network parameters but still considerably high. Entire test methodology and results can be found at: http://indico.cern.ch/getFile.py/access?contribId=10&sessionId=2&resId=1&materialId=slides&confId=43856 21/10/1020CHEP 2010, ASGC, TAIWAN

Approach 2: Analysis of job queries in terms of their network performance Test consisted on tracing the job access to the database from client located at a remote and local site. Database sessions were traced and analyzed in terms of their network footprint considering: – The client was located at a remote site (Indiana University). Job configured to be run using Athena Release 15. New connection to the database mechanism (application improvement). – WAN, LAN network latency. 21/10/1021CHEP 2010, ASGC, TAIWAN

Approach 2: Analysis of job queries in terms of their network performance Cumulative connection time 21/10/1022CHEP 2010, ASGC, TAIWAN

Comparing the two server trace files generated by the longest connection of the remote client and the one generated at the local client, it was found: – Local and remote test had the same number database calls. The resolution time of this connection depended on the RTT between the client and the database. – ~75 % of the total of db calls were generated by the underlying application when retrieving data from 4 queries. There were 1744 rows fetched by each of the 4 queries. Remote client Indiana University Local client BNL worker node Round Time Trip~30(ms)~1.0(ms) DB call / connection~28K Total connection time~846 (seconds)~28 (seconds) Approach 2: Analysis of job queries in terms of their network performance Table 5 21/10/1023CHEP 2010, ASGC, TAIWAN

Approach 2: Analysis of job queries in terms of their network performance Posterior analysis by the Atlas database group and COOL group found the reason of this 75% delayed was due to a problem on CORAL software. The 4 queries retrieved data from a schema which the column is defined as CLOB type. This queries were using 4 db calls / 1 row retrieved. Solved in (SAVANNAH BUG#51429). Entire test methodology and results can be found at: http://indico.cern.ch/getFile.py/access?contribId=9&sessionId=12&resId=1&materialId=slides&confId=50976 21/10/1024CHEP 2010, ASGC, TAIWAN

CASE2: DATABASE HARDWARE MIGRATION Operational challenges 21/10/1025CHEP 2010, ASGC, TAIWAN

Case 2. Database hardware migration General goal To migrate the underlying database service to a new hardware while minimizing the service disruption. – Increase database service storage and processing capacity – Conditions database Database applications hosted at BNL can be organize depending on the user interaction as: READ /WRITE – LFC, FTS, Priority stager READ Conditions database – Write process controlled via oracle streams, update database replica from Tier 0 to Tier 1 BNL. VOMS – Periodically synchronized with the VOMS at TIER 0, procedure controlled at BNL. Two oracle technologies available achieve this goal: – Dataguard – Transportable Tablespaces 21/10/1026CHEP 2010, ASGC, TAIWAN

Case 2: Database hardware migration DATAGUARD General considerations: - Is the recommended tool to perform hardware migration within the 3D context. - A passive database replica (installed in the new cluster) is updated using transactional logs from the active (old hardware ) periodically. - When the database service is ready to be switched over to the new hardware, a downtime of the service is required. Before switching networks interconnections the latest changes are applied to the passive replica. -Reduces the downtime of the service. -More details can be found in: http://indicobeta.cern.ch/getFile.py/access?contribId=13&sessionId=2&resId=1&materialId=slides&confId=43856 21/10/1027CHEP 2010, ASGC, TAIWAN

Case 2: Database hardware migration Transportable Tablespaces (TT) A tool that allows to pull datafiles from tablespace among Databases. In addition, an import of the tablespace metadata is required to enable the tablespace into the database being populated with the copy. -Specific details of this technology can be found in: http://indicobeta.cern.ch/conferenceOtherViews.pyview=standard&confId=6552 Procedure recommended to recover a Tier 1 using another Tier 1 as source site when the entire Condition database replica is compromised. 21/10/1028CHEP 2010, ASGC, TAIWAN

Case 2: Database hardware migration TT for hardware migration Conditions database General considerations -Writer account controlled from Tier 0. Disabling stream process at the source DB Tier 0. BNL conditions DB (production). - The production database (old hardware) will be the source database the new hardware will be destination. - Selection of the tablespaces to be migrated can be done. Conditions Database data migrated, straight forward service isolation between service (TAGS, COND. DB). -Tablespaces to be used as a source need to be on read only mode. -Can be done without affecting user service. No downtime is required 21/10/1029CHEP 2010, ASGC, TAIWAN

Case 2. TT for hardware migration Conditions database General procedure: Preparation day intervention and switch over: - Split stream replication to BNL (Tier0). - Stop capture, propagation (Tier0) - Stop apply process (BNL) -Proceed with TT steps (copy datafiles and import metadata in the target database) - Migration of the production VIP to new nodes in rolling fashion. - Enable replication (capture, propagation and apply) Disable data replication process Move data files and metadata Redirect services to new cluster 21/10/1030CHEP 2010, ASGC, TAIWAN

Case 2. TT for hardware migration Conditions database This procedure was used to migrate the Conditions Database to the current production hardware. No user service downtime required. - 40 minutes to move ~170GB, biggest data file - About 90 minutes to move the entire schemas hosted in the database. 21/10/1031CHEP 2010, ASGC, TAIWAN

Summary and future plans Oracle database services at BNL were presented. Two operational cases were summarized. New hardware acquired for FTS, LFC and VOMS database Services: - Database service will be migrated using Dataguard. - VOMS database service will be migrated using Transportable Tablespaces. 21/10/1032CHEP 2010, ASGC, TAIWAN

Acknowledgements Case 1. Remote oracle database access Fred Luehring from Indiana University Tiesheng Dai from Muon Calibration Center at Michigan Atlas Database Group COOL CERN IT Group Case 2. Database Hardware migration Eva Dafonte CERN IT Shigeki Misawa and BNL RACF/GCE group 21/10/1033CHEP 2010, ASGC, TAIWAN

Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando.

Similar presentations

Presentation on theme: "Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando.

Similar presentations

Presentation on theme: "Deployment and operations of a high availability infrastructure for relational databases in a heterogeneous Tier 1 workload environment Carlos Fernando."— Presentation transcript:

Similar presentations

About project

Feedback