BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company.

BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company

About Ray English  Sr. Systems Administrator Indianapolis Power & Light Company Focusing on UNIX (primarily Solaris) systems  UNIX geek since 1994  Sun Certified System Administrator  VCS administrator since 2000  Experience in Indianapolis IPL Lilly General Motors Allison Transmission (EDS)

OMS Overview  OMS = Outage Management System  Records outage calls from IPL customers IVR (317-261-8111 & 317-261-8222) Phone center agents “Last gasp” from meters  “I’m meter 12345 and I just lost power.”  ~500,000 customers in Indianapolis area  Predictive analysis of root cause based on call grouping (transformer, pole, etc.)  Industry-specific software from CGI/M3i

OMS Map View

Zoom-in on outage

Outage summaries

OMS business challenges  Critical system Customers expect 100% reliability Utilized to dispatch trucks to restore outages  Data is constantly churning Information from minutes ago could be useless The more data, the better idea we have of what’s wrong  Utilized most during high stress Evenings (end-of-day for day shift)  Storms  Customers arriving home from work Poor weather (storms, ice storms, snow) High customer expectations  Keep it simple

OMS Technical Architecture  Oracle databases Sun Solaris SPARC systems  Application Tier Windows systems  Client Tier Windows workstations  Dispatchers  Trucks IVR systems Web front-end call center agents (iCall)

High Availability (the belt)  VERITAS Cluster Server  Failover within datacenter Human error Power feeds Networking SAN Isolated environmental Application failure Server failure  Rolling upgrades

Overview of HA setup

VCS service group configuration

DR Challenges  Loss of a site  Need up-to-the-minute data Information from minutes ago could be useless No data is better than incorrect data  “Know that you don’t know anything.”  Seamless to users Dispatchers Crews Call center representatives Customers

Disaster Recovery (the suspenders)  Moderately close proximity ~10 miles +/-  Robust fiber Public IP subnet & VCS heartbeats span data centers Redundant loop around city Lots of bandwidth  EMC SRDF Symmetrix Remote Data Facility Other technologies available (VVR, etc.)  VERITAS Cluster Server (Global Cluster Option)

Cluster Terminology  Stretch cluster  Stretched cluster  Campus cluster  Extended cluster  Data replication cluster  Metro cluster  Metro stretched cluster

Overview of DR setup

VCS service group with SRDF

Overview of DR setup

Production node crashes (Time for HA!)

Loss of production site (Time for DR!)

Running at the DR site

Failback to the production site

The data is there- now what?

Gotchas  Mounts should be the same on both sides  SRDF needs to be “synchronous” Diskgroup, volumes, filesystem needs to be consistent “Adaptive copy” doesn’t cut it- individual devices in the disk group fall behind  Networking between sites needs to be robust Redundant: Prevent split-brain Fast: VCS heartbeats, data replication Big: Data replication, public network traffic  Freeze/disable failover to the DR servers Risk vs. Reward

Why have idle DR hardware?  Run Test, Development, Sandbox, Training, etc. environments on DR equipment when it’s not needed.  Load on these environments will probably be minimal if you’re in “DR Mode”  Also add these environments to VCS Easily offline if horsepower is needed for DR Service group dependencies

Global cluster service groups?  Adds complexity that may not be needed Networking (DNS, etc.) Management in VCS (Cluster of clusters) GCO Proxy  Instead, use parts of Global Cluster Replication agents

Oracle RAC (parallel service groups)  Oracle RAC between metro sites using data replication requires use of Global Cluster service groups because you’re failing between clusters, not machines. All-or-nothing at each site (because only 1 site can have valid data access at a time) is enforced by GCO  Machine-based failover for Oracle RAC within each site is primarily handled by Oracle RAC itself.

Remember…  Don’t underestimate the power of network and storage magic.  Call 261-8222 to report IPL power outages  VCS makes “belt & suspenders” easy for metro failover clusters with robust infrastructure.  A “fall back to an hour ago / yesterday / last week” situation requires other planning besides this (backups).  Your mileage may vary.

Obligatory slide of logos

Questions?  Any questions? Ray English ray_english@yahoo.com

BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company.

Similar presentations

Presentation on theme: "BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company.

Similar presentations

Presentation on theme: "BELT & SUSPENDERS HA & DR in one solution Ray English Sr. Systems Administrator Indianapolis Power & Light Company."— Presentation transcript:

Similar presentations

About project

Feedback