Multi-Data-Center Hadoop in a Snap Dr. Konstantin Boudnik Vice President, Open Source Development
My background ● 15 years Sun Microsystems veteran: JVM, distributed systems ● Vice President, Apache Bigtop ● Committer, PMC & contributor to various ASF projects ● Member of Apache IPMC ● Early Hadoop committer
3 WANdisco Background WANdisco: Wide Area Network Distributed Computing –Enterprise ready, high availability software solutions that enable globally distributed organizations to meet today’s data challenges of secure storage, scalability and availability Leader in tools for software engineers – Subversion –Apache Software Foundation sponsor Highly successful IPO, London Stock Exchange, June 2012 (LSE:WAND) US patented active-active replication technology granted, November 2012 Global locations –San Ramon (CA) –Chengdu (China) –Tokyo (Japan) –Boston (MA) –Sheffield (UK) –Belfast (UK)
Customers
Non-Stop Hadoop Non-Intrusive Plugin Provides Continuous Availability In the LAN / Across the WAN Active/Active
3 Key Problems For Multi Cluster Hadoop LAN / WAN
Enterprise Ready Hadoop Characteristics of Mission Critical Applications Require 100% Uptime of Hadoop –SLA’s, Regulatory Compliance Require HDFS to be Deployed Globally –Share Data Between Data Centers –Data is Consistent and Not Eventual Ease Administrative Burden –Reduce Operational Complexity –Simplify Disaster Recovery –Lower RTO/RPO Allow Maximum Utilization of Resource –Within the Data Center –Across Data Centers
Single Standby Inefficient utilization of resource –Journal Nodes –ZooKeeper Nodes –Standby Node Performance Bottleneck Still tied to the beeper Limited to LAN scope Active / Active All resources utilized –Only NameNode configuration –Scale as the cluster grows –All NameNodes active Load balancing Set resiliency (# of active NN) Global Consistency Breaking Away from Active/Passive What’s in a NameNode
Standby Datacenter Idle Resource –Single Data Center Ingest –Disaster Recovery Only One way synchronization –DistCp Error Prone –Clusters can diverge over time Difficult to scale > 2 Data Centers –Complexity of sharing data increases Active / Active DR Resource Available –Ingest at all Data Centers –Run Jobs in both Data Centers Replication is Multi-Directional –active/active Absolute Consistency –Single HDFS spans locations ‘N’ Data Center support –Global HDFS allows appropriate data to be shared Breaking Away from Active/Passive What’s in a Data Center
One Cluster Aproach Example Applications –HBASE –RT Query –Map Reduce Poor Resource Management –Data Locality Issues –Network Use –Complex Multiple Clusters
Creating Multiple Clusters Example Applications –HBASE –RT Query –Map Reduce Need to share data between clusters –DistCp / Stale Data –Inefficient use of storage and or network –Some clusters may not be available Multiple Clusters
Cluster Zones Zoning for Optimal Efficiency 1 100% HDFS Consistency
Multi Datacenter Hadoop Disaster Recovery WAN REPLICATION Absolute Consistency Maximum Resource Use Lower Recovery Time/Point Replicate Only What You Want Better Utilization of Power/Cooling Lower TCO LAN Speed Performance
Architecture of a Non-Stop Hadoop
Technical Use Cases Eliminate Performance Bottleneck –HBASE issues Multi Data-Center Ingest –Information doesn't need to be sent to one DC and then copied back to the other using DistCP –Parallel ingest methods don’t require redirected data streams –Ingest data at, or close to the source –Global Analysis (Logs, Click Streams, etc…) Cluster Zones –Efficient use of resource based on application profile –HBASE, MapReduce, SPARK, etc… Maximize Data Center Resource Utilization –All datacenters can be used to run different jobs concurrently Disaster Recovery –Data is as current as possible (no periodic synchs) –Virtually zero downtime to recover from regional data center failure –Regulatory compliance
Non-Stop Hadoop Demonstration
Q & A
Thank you