Lecture 8: BigTable and Dynamo COSC6376 Cloud Computing Lecture 8: BigTable and Dynamo Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston
Outline Plan Project Next Class BigTable and Hbase Dynamo
Projects
Sample Projects Support video processing using HDFS and Mapreduce Image processing using cloud Security services using cloud Web analytics using cloud Cloud based MPI Novel applications of cloud based storage New pricing model Cyber physical system with cloud as the backend Bioinformatics using Mapreduce
Next week In-Class Presentation
In-Class Presentation Oct 3, Next Thursday In-Class Each team, 10 minutes What should be included in the presentation Team Objectives Plan of work
Project Proposal Due: Oct 8 Formal project description (at most 4 pages) Team members Objective Tools Plan of work (tasks and assignments) Division of labor Roadmap Risk and mitigation strategy
Plan Today Bigtable and Hbase Dynamo Thursday Paxos
Reading Assignment Due: Thursday Jenkins, if I want another yes-man, I’ll build one! Due: Thursday
Reading Assignment Due: Thursday
Bigtable Fay Chang, et al @google.com
Global Picture
BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls
(row, column, timestamp) -> cell contents Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications
Tablet Contains some range of rows of the table Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index
Chubby A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.
Bigtable and Chubby Bigtable uses Chubby to: Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.
Tablet Serving “Log Structured Merge Trees” Image Source: Chang et al., OSDI 2006
Tablet Representation append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples
Compactions Minor compaction Merging compaction Major compaction Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data
Refinements: Locality Groups Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.
Refinements: Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)
Refinements: Bloom Filters Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk
Bloom Filters
Approximate set membership problem Suppose we have a set S = {s1,s2,...,sm} universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .
Bloom filters 1. Initially set the array to 0 Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2. sS, A[hi(s)] = 1 for 1 i k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS
Initial with all 0 Each element of S is hashed k times 1 x1 x2 Each element of S is hashed k times Each hash location set to 1 Initial with all 0
If only 1s appear, conclude that y is in S This may yield false positive 1 x1 x2
Bigtable Applications
Application 1: Google Analytics Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: Number of unique visitors per day and the page views per URL per day, Percentage of users that made a purchase given that they earlier viewed a specific page. How? A small JavaScript program that the webmaster embeds in their web pages. Every time the page is visited, the program is executed. Program records the following information about each request: User identifier The page being fetched
Application 1: Google Analytics Two of the Bigtables Raw click table (~ 200 TB) A row for each end-user session. Row name include website’s name and the time at which the session was created. Clustering of sessions that visit the same web site. And a sorted chronological order. Compression factor of 6-7. Summary table (~ 20 TB) Stores predefined summaries for each web site. Generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. Row name includes website’s name and the column family is the aggregate summaries. Compression factor is 2-3.
Application 2: Google Earth & Maps Functionality: Pan, view, and annotate satellite imagery at different resolution levels. One Bigtable stores raw imagery (~ 70 TB): Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together. Column family maintains sources of data for each segment.
Application 3: Personalized Search Records user queries and clicks across Google properties. Users browse their search histories and request for personalized search results based on their historical usage patterns. One Bigtable: Row name is userid A column family is reserved for each action type, e.g., web queries, clicks. User profiles are generated using MapReduce. These profiles personalize live search results. Replicated geographically to reduce latency and increase availability.
HBase is an open-source, distributed, column-oriented database built on top of HDFS based on BigTable!
HBase is .. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability.
Backdrop Started toward by Chad Walters and Jim 2006.11 2007.2 2007.10 Google releases paper on BigTable 2007.2 Initial HBase prototype created as Hadoop contrib. 2007.10 First useable HBase 2008.1 Hadoop become Apache top-level project and HBase becomes subproject 2008.10~ HBase 0.18, 0.19 released
Why HBase ? HBase is a Bigtable clone. It is open source It has a good community and promise for the future It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.
HBase Is Not … No join operators. Limited atomicity and transaction support. HBase supports multiple batched mutations of single rows only. Data is unstructured and untyped. No accessed or manipulated via SQL. Programmatic access via Java, REST, or Thrift APIs. Scripting via JRuby.
HBase benefits than RDBMS No real indexes Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing
Testing $ hbase shell > create 'test', 'data' 0 row(s) in 4.3066 seconds > list test 1 row(s) in 0.1485 seconds > put 'test', 'row1', 'data:1', 'value1' 0 row(s) in 0.0454 seconds > put 'test', 'row2', 'data:2', 'value2' 0 row(s) in 0.0035 seconds > put 'test', 'row3', 'data:3', 'value3' 0 row(s) in 0.0090 seconds > scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp=1240148026198, value=value1 row2 column=data:2, timestamp=1240148040035, value=value2 row3 column=data:3, timestamp=1240148047497, value=value3 3 row(s) in 0.0825 seconds > disable 'test' 09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test 0 row(s) in 6.0426 seconds > drop 'test' 09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test 0 row(s) in 0.0210 seconds > list 0 row(s) in 2.0645 seconds
Connecting to HBase Java client Non-Java clients get(byte [] row, byte [] column, long timestamp, int versions); Non-Java clients Thrift server hosting HBase client instance Sample ruby, c++, & java (via thrift) clients REST server hosts HBase client TableInput/OutputFormat for MapReduce HBase as MR source or sink HBase Shell ./bin/hbase shell YOUR_SCRIPT
Dynamo
Motivation Build a distributed storage system: Scale Simple: key-value Highly available Guarantee Service Level Agreements (SLA)
System Assumptions and Requirements Query Model: simple read and write operations to a data item that is uniquely identified by a key. Other Assumptions: operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.
Service Level Agreements (SLA) Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. Service-oriented architecture of Amazon’s platform
Design Consideration Sacrifice strong consistency for availability Conflict resolution is executed during read instead of write, i.e. “always writeable”. Other principles: Incremental scalability. Symmetry. Decentralization. Heterogeneity.
Summary of techniques used in Dynamo and their advantages Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
Partitioning and Consistent Hashing
Caches can Load Balance Numerous items in central server. Requests can swamp server. Distribute items among cache nodes. Clients get items from cache nodes. Server gets only 1 request per item. Server Items distributed among caches Users get items from caches
Who Caches What? Each cache node should hold few items else cache gets swamped by clients Each item should be in few cache nodes else server gets swamped by caches and cache invalidations/updates expensive
A Solution: Hashing Example: y = ax+b (mod n) Intuition: Assigns items to “random” cache nodes few items per cache Easy to compute which cache holds an item Server items assigned to caches by hash function. Users use hash to compute cache for item.
Problem: Adding Cache Nodes Suppose a new cache node arrives. How does it affect the hash function? Natural change: y=ax+b (mod n+1) Problem: changes bucket for every item every cache node will be flushed servers get swamped with new requests Goal: when add bucket, few items move
Solution: Consistent Hashing Use standard hash function to map cache nodes and items to points in unit interval. “random” points spread uniformly Item assigned to nearest cache node Cache (Bucket) item Computation easy as standard hash function
Properties All buckets get roughly same number of items (like standard hashing). When kth bucket is added only a 1/k fraction of items move. and only from a few caches When a cache node is added, minimal reshuffling of cached items is required.
Consistent Hashing Partition using consistent hashing Keys hash to a point on a fixed circular space Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots Nodes take positions on the circle. A, B, and D exists. B responsible for AB range. D responsible for BD range. A responsible for DA range. C joins. B, D split ranges. C gets BC from D.
Virtual Nodes “Virtual Nodes”: Each node can be responsible for more than one virtual node. If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.
Replication Each data item is replicated at N hosts. “preference list”: The list of nodes that is responsible for storing a particular key.