Implementing ASM Without HW RAID, A User’s Experience

Implementing ASM Without HW RAID, A User’s Experience
11/25/08 Implementing ASM Without HW RAID, A User’s Experience Luca Canali, CERN Dawid Wojcik, CERN UKOUG, Birmingham, December 2008 The title has to be worked on.

Outlook Introduction to ASM
11/25/08 Introduction to ASM Disk groups, fail groups, normal redundancy Scalability and Performance of the solution Possible pitfalls, sharing experiences Implementation details, monitoring, and tools to ease ASM deployment

Architecture and main concepts
11/25/08 Why ASM ? Provides functionality of volume manager and a cluster file system Raw access to storage for performance Why ASM-provided mirroring? Allows to use lower-cost storage arrays Allows to mirror across storage arrays arrays are not single points of failure Array (HW) maintenances can be done in a rolling way Stretch clusters

ASM and cluster DB architecture
11/25/08 Oracle architecture of redundant low-cost components Servers SAN Storage This is the architecture deployed at CERN for the Physics DBs, more on

Files, extents, and failure groups
11/25/08 Files, extents, and failure groups Files and extent pointers Failgroups and ASM mirroring

ASM disk groups Example: HW = 4 disk arrays with 8 disks each
11/25/08 Example: HW = 4 disk arrays with 8 disks each An ASM diskgroup is created using all available disks The end result is similar to a file system on RAID 1+0 ASM allows to mirror across storage arrays Oracle RDBMS processes directly access the storage RAW disk access ASM Diskgroup Mirroring Striping Striping Failgroup1 Failgroup2

Performance and scalability
11/25/08 ASM with normal redundancy Stress tested for CERN’s use Scales and performs

Case Study: the largest cluster I have ever installed, RAC5
11/25/08 The test used:14 servers

Multipathed fiber channel
11/25/08 8 FC switches: 4Gbps (10Gbps uplink)

Many spindles 11/25/08 26 storage arrays (16 SATA disks each)

Case Study: I/O metrics for the RAC5 cluster
11/25/08 Measured, sequential I/O Read: 6 GB/sec Read-Write: 3+3 GB/sec Measured, small random IO Read: 40K IOPS (8 KB read ops) Note: 410 SATA disks, 26 HBAS on the storage arrays Servers: 14 x 4+4Gbps HBAs, 112 cores, 224 GB of RAM

How the test was run A custom SQL-based DB workload:
11/25/08 A custom SQL-based DB workload: IOPS: Probe randomly a large table (several TBs) via several parallel queries slaves (each reads a single block at a time) MBPS: Read a large (several TBs) table with parallel query The test table used for the RAC5 cluster was 5 TB in size created inside a disk group of 70TB Scripts are available on request

Possible pitfalls Production Stories Sharing experiences
11/25/08 Production Stories Sharing experiences 3 years in production, 550 TB of raw capacity

Rebalancing speed 11/25/08 Rebalancing is performed (and mandatory) after space management operations Typically after HW failures (restore mirror) Goal: balanced space allocation across disks Not based on performance or utilization ASM instances are in charge of rebalancing Scalability of rebalancing operations? In 10g serialization wait events can limit scalability Even at maximum speed rebalancing is not always I/O bound

Rebalancing, an example
11/25/08 Rebalancing, an example Rebalancing speed is measured in MB/minute to conform to V$ASM_OPERATION units Test conditions may vary the results (OS, storage, Oracle version, number of ASM files, etc)‏ It’s a good idea to repeat the measurements when several parameters of the environment change to get meaningful results.

VLDB and rebalancing 11/25/08 Rebalancing operations can move more data than expected Example: 5 TB (allocated): ~100 disks, 200 GB each A disk is replaced (diskgroup rebalance) The total IO workload is 1.6 TB (8x the disk size!) How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance The issue: excessive repartnering Rebalancing in RAC is failed over when an instance crashes, but Does not restart if all instance are down (typical of single instance)‏ No obvious way to tell if a diskgroup has a pending rebalance op A partial work around is to query v$ASM_DISK to see if there are disk occupation imbalances total_mb and free_mb

Rebalancing issues wrap-up
11/25/08 Rebalancing can be slow Many hours for very large disk groups Risk associated 2nd disk failure while rebalancing Worst case - loss of the diskgroup because partner disks fail Similar problems with RAID5 systems volume rebuild

Fast Mirror Resync 11/25/08 ASM 10g with normal redundancy does not allow to offline part of the storage A transient error in a storage array can cause several hours of rebalancing to drop and add disks It is a limiting factor for scheduled maintenances 11g has new feature ‘fast mirror resync’ Great feature for rolling intervention on HW

ASM and filesystem utilities
11/25/08 Only a few tools can access ASM Asmcmd, dbms_file_transfer, xdb, ftp Limited operations (no copy, rename, etc) Require open DB instances file operations difficult in 10g 11g asmcmd has the copy command

ASM metadata corruption
ASM and corruption 11/25/08 ASM metadata corruption Can be caused by ‘bugs’ One case in prod after disk eviction Physical data corruption ASM will fix automatically most corruption on primary extent Typically when doing a full backup Secondary extent corruption goes undetected untill disk failure/rebalance can expose it

For HA our experience is that disaster recovery is needed
11/25/08 Corruption issues were fixed using physical standby to move to ‘fresh’ storage For HA our experience is that disaster recovery is needed Standby DB On-disk (flash) copy of DB

Implementation details

Storage deployment 11/25/08 Current storage deployment for Physics Databases at CERN SAN, FC (4Gb/s) storage enclosures with SATA disks (8 or 16) Linux x86_64, no ASM lib, device mapper instead (naming persistence + HA) Over 150 FC storage arrays (production, integration and test) and ~ 2000 LUNs exposed Biggest DB over 7TB (more to come when LHC starts – estimated growth up to 11TB/year)

Storage deployment ASM implementation details
11/25/08 ASM implementation details Storage in JBOD configuration (1 disk -> 1 LUN) Each disk partitioned on OS level 1st partition – 45% of disk size – faster part of disk – short stroke 2nd partition – rest – slower part – full stroke inner sectors – full stroke outer sectors – short stroke

Storage deployment Two diskgroups created for each cluster
11/25/08 Two diskgroups created for each cluster DATA – data files and online redo logs – outer part of the disks RECO – flash recovery area destination – archived redo logs and on disk backups – inner part of the disks One failgroup per storage array Failgroup1 Failgroup2 Failgroup3 Failgroup4 DATA_DG1 RECO_DG1

Storage management 11/25/08 SAN configuration in JBOD configuration – many steps, can be time consuming Storage level logical disks LUNs mappings FC infrastructure – zoning OS – creating device mapper configuration multipath.conf – name persistency + HA

Storage management Storage manageability
11/25/08 Storage manageability DBAs set-up initial configuration ASM – extra maintenance in case of storage maintenance (disk failure) Problems How to quickly set-up SAN configuration How to manage disks and keep track of the mappings: physical disk -> LUN -> Linux disk -> ASM Disk SCSI [1:0:1:3] & [2:0:1:3] -> /dev/sdn & /dev/sdax -> /dev/mpath/rstor901_3 -> ASM – TEST1_DATADG1_0016

Storage management Solution
11/25/08 Solution Configuration DB - repository of FC switches, port allocations and of all SCSI identifiers for all nodes and storages Big initial effort Easy to maintain High ROI Custom tools Tools to identify SCSI (block) devices <-> device mapper device <-> physical storage and FC port Device mapper mapper device <-> ASM disk Automatic generation of device mapper configuration

SCSI id (host,channel,id) -> storage name and FC port
Storage management [ ~]$ lssdisks.py The following storages are connected: * Host interface 1: Target ID 1:0:0: - WWPN: D0230BE0B5 - Storage: rstor316, Port: 0 Target ID 1:0:1: - WWPN: D0231C3F8D - Storage: rstor317, Port: 0 Target ID 1:0:2: - WWPN: D0232BE081 - Storage: rstor318, Port: 0 Target ID 1:0:3: - WWPN: D0233C Storage: rstor319, Port: 0 Target ID 1:0:4: - WWPN: D0234C3F68 - Storage: rstor320, Port: 0 * Host interface 2: Target ID 2:0:0: - WWPN: D0230BE0B5 - Storage: rstor316, Port: 1 Target ID 2:0:1: - WWPN: D0231C3F8D - Storage: rstor317, Port: 1 Target ID 2:0:2: - WWPN: D0232BE081 - Storage: rstor318, Port: 1 Target ID 2:0:3: - WWPN: D0233C Storage: rstor319, Port: 1 Target ID 2:0:4: - WWPN: D0234C3F68 - Storage: rstor320, Port: 1 SCSI Id Block DEV MPath name MP status Storage Port [0:0:0:0] /dev/sda [1:0:0:0] /dev/sdb rstor316_CRS OK rstor [1:0:0:1] /dev/sdc rstor316_ OK rstor [1:0:0:2] /dev/sdd rstor316_ FAILED rstor [1:0:0:3] /dev/sde rstor316_ OK rstor [1:0:0:4] /dev/sdf rstor316_ OK rstor [1:0:0:5] /dev/sdg rstor316_ OK rstor [1:0:0:6] /dev/sdh rstor316_ OK rstor . . . Custom made script SCSI id (host,channel,id) -> storage name and FC port SCSI ID -> block device -> device mapper name and status -> storage name and FC port

device mapper name -> ASM disk and status
Storage management [ ~]$ listdisks.py DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_2p UNKNOWN ONLINE CLOSED NORMAL rstor401_2p UNKNOWN ONLINE CLOSED NORMAL rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_8p1 RAC9_DATADG1_0004 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_8p2 RAC9_RECODG1_0004 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL rstor401_CRS1 rstor401_CRS2 rstor401_CRS3 rstor402_1p1 RAC9_DATADG1_0015 RAC9_DATADG1 RSTOR402 MEMBER ONLINE CACHED NORMAL . . . Custom made script device mapper name -> ASM disk and status

device mapper alias – naming persistency and multipathing (HA)
Storage management [ ~]$ gen_multipath.py # multipath default configuration for PDB defaults { udev_dir /dev polling_interval selector "round-robin 0" . . . } multipaths { multipath { wwid d c26660be0b5080a407e00 alias rstor916_CRS wwid d c26660be0b5080a407e01 alias rstor916_1 Custom made script device mapper alias – naming persistency and multipathing (HA) SCSI [1:0:1:3] & [2:0:1:3] -> /dev/sdn & /dev/sdax -> /dev/mpath/rstor916_1

Storage monitoring ASM-based mirroring means ASM level monitoring
11/25/08 ASM-based mirroring means Oracle DBAs need to be alerted of disk failures and evictions Dashboard – global overview – custom solution – RACMon ASM level monitoring Oracle Enterprise Manager Grid Control RACMon – alerts on missing disks and failgroups plus dashboard Storage level monitoring RACMon – LUNs’ health and storage configuration details – dashboard

Storage monitoring ASM instance level monitoring
11/25/08 ASM instance level monitoring Storage level monitoring new failing disk on RSTOR614 new disk installed on RSTOR903 slot 2

Oracle ASM diskgroups with normal redundancy
Conclusions 11/25/08 Oracle ASM diskgroups with normal redundancy Used at CERN instead of HW RAID Performance and scalability are very good Allows to use low-cost HW Requires more admin effort from the DBAs than high end storage 11g has important improvements Custom tools to ease administration

Thank you Q&A Links: http://cern.ch/phydb http://www.cern.ch/canali
11/25/08 Thank you Links:

Implementing ASM Without HW RAID, A User’s Experience

Similar presentations

Presentation on theme: "Implementing ASM Without HW RAID, A User’s Experience"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementing ASM Without HW RAID, A User’s Experience

Similar presentations

Presentation on theme: "Implementing ASM Without HW RAID, A User’s Experience"— Presentation transcript:

Similar presentations

About project

Feedback