How did it start? • At Google • • • • Lots of semi structured data Commodity hardware Horizontal scalability • • • Tight integration with MapReduce 2
Why NoSQL? • RDBMS don’t scale • Buzzword! • • • Typically large monolithic systems Hard to shard • • Specialized hardware.. expensive! • Buzzword! 3
Google BigTable • • • • • • • Distributed multi level map Fault tolerant, persistent Scalable • • • Runs on commodity hardware Self managing • • Large number of read/write ops Fast scans • 4
HBase • Open source BigTable • HDFS as underlying DFS • ZooKeeper as lock service • Tight integration with Hadoop MapReduce 5
HBase • • • • • • • Data model Architecture, implementation API Regions, Region Servers etc • API • Current status and future direction Use cases • • How to think HBase (or NoSQL)? 6
• Sparse, multi dimensional map Data Model • Sparse, multi dimensional map (row, column, timestamp) cell • Column = Column Family:Column Qualifier Columns Fam1:Qual1 Rows t1 AK v1 Timestamps 7
• Sparse, multi dimensional map Data Model • Sparse, multi dimensional map (row, column, timestamp) cell • Column = Column Family:Column Qualifier Columns Fam1:Qual1 Rows t1 AK v1 t2 v2 Timestamps t2>t1 7
Regions • Region: Contiguous set of lexicographically sorted rows • hbase.hregion.max.filesize (default 256MB) • Regions hosted by Region Servers 8
Regions and Splitting row1 row256 row257 row600 9
Regions and Splitting row1 row256 row257 row600 Writes 9
Regions and Splitting row1 row256 row257 row400 row401 row600 9
System Structure Region Servers Master HDFS ZooKeeper M a p R e d u c 10
Master • Region splitting • Load balancing • Metadata operations • Multiple masters for failover 11
ZooKeeper • Master election • Locate -ROOT- region • Region Server membership 12
Where is my row? • 3 level hierarchical lookup scheme 13 MyTable MyRow .META. MyRow -ROOT- ZooKeeper 13
Where is my row? • 3 level hierarchical lookup scheme 13 MyTable MyRow .META. MyRow -ROOT- ZooKeeper 13
Where is my row? • 3 level hierarchical lookup scheme MyTable .META. MyRow -ROOT- ZooKeeper Row per META region 13
Where is my row? • 3 level hierarchical lookup scheme MyTable .META. MyRow -ROOT- ZooKeeper Row per META region Row per table region 13
Where is my row? • 3 level hierarchical lookup scheme MyTable .META. MyRow -ROOT- ZooKeeper Row per META region Row per table region 13
Memstore (Append only HFile: Immutable sorted map (byte[] byte[]) Region Memstore HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Region Memstore (Append only Write HFile: Immutable sorted map (byte[] HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore (Append only HFile: Immutable sorted map (byte[] byte[]) Region Memstore HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Region Memstore (Append only Small HFile Flush HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) Small HFile (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore (Append only Small HFile HFile: Immutable sorted map (byte[] Region Memstore HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) Small HFile (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore (Append only Small HFile (on HDFS) Compaction Region Memstore HLog (Append only WAL on HDFS) (Sequence File) (one per RS) HFile (on HDFS) HFile (on HDFS) Compaction Small HFile Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore HLog (Append only Compaction Region Memstore HLog (Append only WAL on HDFS) (Sequence File) (one per RS) Compaction Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore (Append only HFile: Immutable sorted map (byte[] byte[]) Region Memstore HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) (one per RS) Region HFile: Immutable sorted map (byte[] byte[]) (row, column, timestamp) 14 cell value
Memstore (Append only 15 WAL on HDFS) (Sequence File) (on HDFS) Region Memstore HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) HFile (on HDFS) (one per RS) Region 15
Region Memstore (Append only Read 15 WAL on HDFS) (Sequence File) HLog (Append only WAL on HDFS) (Sequence File) HFile (on HDFS) HFile (on HDFS) HFile (on HDFS) (one per RS) Region 15
Ways to access • • • • • • • • Java REST Thrift Scala Jython Groovy DSL Ruby shell • • Java MR, Cascading, Pig, Hive 16
Java API • • • • • • • Get Put Delete Scan IncrementColumnValue TableInputFormat - MapReduce Source TableOutputFormat - MapReduce Sink • 17
Other Features • • • • • • • Compression In memory column families Multiple masters Rolling restart Bloom filters • • • • Efficient bulk loads • Source and sink for Hive, Pig, Cascading 18
How to think in HBase?
HBase v/s RDBMS • Neither solves all problems • • It’s really a wrong comparison But puts things in context • 29
HBase v/s RDBMS Column oriented Flexible schema, add columns on the fly Good with sparse tables No query language Wide tables Joins using MR - not optimized Tight integration with MR RDBMS Row oriented (mostly) Fixed schema Not optimized for sparse tables SQL Narrow tables Optimized for joins (small, fast ones too!) Not really... 30
HBase v/s RDBMS De-normalize your data Horizontal scalability. Just add hardware Consistent No transactions Good for semi structured data as well as structured data RDBMS Normalize as you can Hard to shard and scale Consistent Transactional Good for structured data 31
HBase v/s RDBMS data can easily fit and be processed on a single Rule:You probably don’t need HBase if your data can easily fit and be processed on a single RDBMS box. 32
HBase v/s RDBMS data can easily fit and be processed on a single Rule:You probably don’t need HBase if your data can easily fit and be processed on a single RDBMS box. But then, you are at Hadoop Day, so it probably can’t! 32
Q&A