Lecture 8: BigTable and Dynamo

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Homework 2 What is the role of the secondary database that we have to create? What is the role of the secondary database that we have to create?  A relational.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Consistent Hashing: Load Balancing in a Changing World
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MIT Consistent Hashing: Load Balancing in a Changing World David Karger, Eric Lehman, Tom Leighton, Matt Levine, Daniel Lewin, Rina Panigrahy.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Cloud Computing Cloud Data Serving Systems Keke Chen.
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
Dynamo: Amazon’s Highly Available Key-value Store
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 HBase Intro 王耀聰 陳威宇
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Bigtable: A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Lecture 6. NoSQL and Bigtable
Bigtable A Distributed Storage System for Structured Data
Lecture 7 Bigtable Instructor: Weidong Shi (Larry), PhD
Cassandra - A Decentralized Structured Storage System
HBase Mohamed Eltabakh
Hadoop.
Software Systems Development
Bigtable: A Distributed Storage System for Structured Data
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
How did it start? • At Google • • • • Lots of semi structured data
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Data Management in the Cloud
Large-scale file systems and Map-Reduce
Dynamo: Amazon’s Highly Available Key-value Store
CSE-291 (Cloud Computing) Fall 2016
Database Performance Tuning and Query Optimization
Google and Cloud Computing
EECS 498 Introduction to Distributed Systems Fall 2017
Chapter 11 Database Performance Tuning and Query Optimization
A Distributed Storage System for Structured Data
Presentation transcript:

Lecture 8: BigTable and Dynamo COSC6376 Cloud Computing Lecture 8: BigTable and Dynamo Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Outline Plan Project Next Class BigTable and Hbase Dynamo

Projects

Sample Projects Support video processing using HDFS and Mapreduce Image processing using cloud Security services using cloud Web analytics using cloud Cloud based MPI Novel applications of cloud based storage New pricing model Cyber physical system with cloud as the backend Bioinformatics using Mapreduce

Next week In-Class Presentation

In-Class Presentation Oct 3, Next Thursday In-Class Each team, 10 minutes What should be included in the presentation Team Objectives Plan of work

Project Proposal Due: Oct 8 Formal project description (at most 4 pages) Team members Objective Tools Plan of work (tasks and assignments) Division of labor Roadmap Risk and mitigation strategy

Plan Today Bigtable and Hbase Dynamo Thursday Paxos

Reading Assignment Due: Thursday Jenkins, if I want another yes-man, I’ll build one! Due: Thursday

Reading Assignment Due: Thursday

Bigtable Fay Chang, et al @google.com

Global Picture

BigTable Distributed multi-level map Fault-tolerant, persistent Scalable Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance Often want to examine data changes over time E.g. Contents of a web page over multiple crawls

(row, column, timestamp) -> cell contents Basic Data Model A BigTable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) -> cell contents Good match for most Google applications

Tablet Contains some range of rows of the table Built out of multiple SSTables Tablet Start:aardvark End:apple SSTable SSTable 64K block 64K block 64K block 64K block 64K block 64K block Index Index

Chubby A persistent and distributed lock service. Consists of 5 active replicas, one replica is the master and serves requests. Service is functional when majority of the replicas are running and in communication with one another – when there is a quorum. Implements a nameservice that consists of directories and files.

Bigtable and Chubby Bigtable uses Chubby to: Ensure there is at most one active master at a time, Store the bootstrap location of Bigtable data (Root tablet), Discover tablet servers and finalize tablet server deaths, Store Bigtable schema information (column family information), Store access control list. If Chubby becomes unavailable for an extended period of time, Bigtable becomes unavailable.

Tablet Serving “Log Structured Merge Trees” Image Source: Chang et al., OSDI 2006

Tablet Representation append-only log on GFS SSTable on GFS write buffer in memory (random-access) write read Tablet SSTable: Immutable on-disk ordered map from stringstring String keys: <row, column, timestamp> triples

Compactions Minor compaction Merging compaction Major compaction Converts the memtable into an SSTable Reduces memory usage and log traffic on restart Merging compaction Reads the contents of a few SSTables and the memtable, and writes out a new SSTable Reduces number of SSTables Major compaction Merging compaction that results in only one SSTable No deletion records, only live data

Refinements: Locality Groups Can group multiple column families into a locality group Separate SSTable is created for each locality group in each tablet. Segregating columns families that are not typically accessed together enables more efficient reads. In WebTable, page metadata can be in one group and contents of the page in another group.

Refinements: Compression Many opportunities for compression Similar values in the same row/column at different timestamps Similar values in different columns Similar values across adjacent rows Two-pass custom compressions scheme First pass: compress long common strings across a large window Second pass: look for repetitions in small window Speed emphasized, but good space reduction (10-to-1)

Refinements: Bloom Filters Read operation has to read from disk when desired SSTable isn’t in memory Reduce number of accesses by specifying a Bloom filter. Allows us ask if an SSTable might contain data for a specified row/column pair. Small amount of memory for Bloom filters drastically reduces the number of disk seeks for read operations Use implies that most lookups for non-existent rows or columns do not need to touch disk

Bloom Filters

Approximate set membership problem Suppose we have a set S = {s1,s2,...,sm}  universe U Represent S in such a way we can quickly answer “Is x an element of S ?” To take as little space as possible ,we allow false positive (i.e. xS , but we answer yes ) If xS , we must answer yes .

Bloom filters 1. Initially set the array to 0 Consist of an arrays A[n] of n bits (space) , and k independent random hash functions h1,…,hk : U --> {0,1,..,n-1} 1. Initially set the array to 0 2.  sS, A[hi(s)] = 1 for 1 i  k (an entry can be set to 1 multiple times, only the first times has an effect ) 3. To check if xS , we check whether all location A[hi(x)] for 1 i  k are set to 1 If not, clearly xS. If all A[hi(x)] are set to 1 ,we assume xS

Initial with all 0 Each element of S is hashed k times 1 x1 x2 Each element of S is hashed k times Each hash location set to 1 Initial with all 0

If only 1s appear, conclude that y is in S This may yield false positive 1 x1 x2

Bigtable Applications

Application 1: Google Analytics Enables webmasters to analyze traffic pattern at their web sites. Statistics such as: Number of unique visitors per day and the page views per URL per day, Percentage of users that made a purchase given that they earlier viewed a specific page. How? A small JavaScript program that the webmaster embeds in their web pages. Every time the page is visited, the program is executed. Program records the following information about each request: User identifier The page being fetched

Application 1: Google Analytics Two of the Bigtables Raw click table (~ 200 TB) A row for each end-user session. Row name include website’s name and the time at which the session was created. Clustering of sessions that visit the same web site. And a sorted chronological order. Compression factor of 6-7. Summary table (~ 20 TB) Stores predefined summaries for each web site. Generated from the raw click table by periodically scheduled MapReduce jobs. Each MapReduce job extracts recent session data from the raw click table. Row name includes website’s name and the column family is the aggregate summaries. Compression factor is 2-3.

Application 2: Google Earth & Maps Functionality: Pan, view, and annotate satellite imagery at different resolution levels. One Bigtable stores raw imagery (~ 70 TB): Row name is a geographic segments. Names are chosen to ensure adjacent geographic segments are clustered together. Column family maintains sources of data for each segment.

Application 3: Personalized Search Records user queries and clicks across Google properties. Users browse their search histories and request for personalized search results based on their historical usage patterns. One Bigtable: Row name is userid A column family is reserved for each action type, e.g., web queries, clicks. User profiles are generated using MapReduce. These profiles personalize live search results. Replicated geographically to reduce latency and increase availability.

HBase is an open-source, distributed, column-oriented database built on top of HDFS based on BigTable!

HBase is .. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate on top of the Hadoop distributed file system (HDFS) or Kosmos File System (KFS, aka Cloudstore) for scalability, fault tolerance, and high availability.

Backdrop Started toward by Chad Walters and Jim 2006.11 2007.2 2007.10 Google releases paper on BigTable 2007.2 Initial HBase prototype created as Hadoop contrib. 2007.10 First useable HBase 2008.1 Hadoop become Apache top-level project and HBase becomes subproject 2008.10~ HBase 0.18, 0.19 released

Why HBase ? HBase is a Bigtable clone. It is open source It has a good community and promise for the future It is developed on top of and has good integration for the Hadoop platform, if you are using Hadoop already.

HBase Is Not … No join operators. Limited atomicity and transaction support. HBase supports multiple batched mutations of single rows only. Data is unstructured and untyped. No accessed or manipulated via SQL. Programmatic access via Java, REST, or Thrift APIs. Scripting via JRuby.

HBase benefits than RDBMS No real indexes Automatic partitioning Scale linearly and automatically with new nodes Commodity hardware Fault tolerance Batch processing

Testing $ hbase shell > create 'test', 'data' 0 row(s) in 4.3066 seconds > list test 1 row(s) in 0.1485 seconds > put 'test', 'row1', 'data:1', 'value1' 0 row(s) in 0.0454 seconds > put 'test', 'row2', 'data:2', 'value2' 0 row(s) in 0.0035 seconds > put 'test', 'row3', 'data:3', 'value3' 0 row(s) in 0.0090 seconds > scan 'test' ROW COLUMN+CELL row1 column=data:1, timestamp=1240148026198, value=value1 row2 column=data:2, timestamp=1240148040035, value=value2 row3 column=data:3, timestamp=1240148047497, value=value3 3 row(s) in 0.0825 seconds > disable 'test' 09/04/19 06:40:13 INFO client.HBaseAdmin: Disabled test 0 row(s) in 6.0426 seconds > drop 'test' 09/04/19 06:40:17 INFO client.HBaseAdmin: Deleted test 0 row(s) in 0.0210 seconds > list 0 row(s) in 2.0645 seconds

Connecting to HBase Java client Non-Java clients get(byte [] row, byte [] column, long timestamp, int versions); Non-Java clients Thrift server hosting HBase client instance Sample ruby, c++, & java (via thrift) clients REST server hosts HBase client TableInput/OutputFormat for MapReduce HBase as MR source or sink HBase Shell ./bin/hbase shell YOUR_SCRIPT

Dynamo

Motivation Build a distributed storage system: Scale Simple: key-value Highly available Guarantee Service Level Agreements (SLA)

System Assumptions and Requirements Query Model: simple read and write operations to a data item that is uniquely identified by a key. Other Assumptions: operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.

Service Level Agreements (SLA) Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. Service-oriented architecture of Amazon’s platform

Design Consideration Sacrifice strong consistency for availability Conflict resolution is executed during read instead of write, i.e. “always writeable”. Other principles: Incremental scalability. Symmetry. Decentralization. Heterogeneity.

Summary of techniques used in Dynamo and their advantages Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.

Partitioning and Consistent Hashing

Caches can Load Balance Numerous items in central server. Requests can swamp server. Distribute items among cache nodes. Clients get items from cache nodes. Server gets only 1 request per item. Server Items distributed among caches Users get items from caches

Who Caches What? Each cache node should hold few items else cache gets swamped by clients Each item should be in few cache nodes else server gets swamped by caches and cache invalidations/updates expensive

A Solution: Hashing Example: y = ax+b (mod n) Intuition: Assigns items to “random” cache nodes few items per cache Easy to compute which cache holds an item Server items assigned to caches by hash function. Users use hash to compute cache for item.

Problem: Adding Cache Nodes Suppose a new cache node arrives. How does it affect the hash function? Natural change: y=ax+b (mod n+1) Problem: changes bucket for every item every cache node will be flushed servers get swamped with new requests Goal: when add bucket, few items move

Solution: Consistent Hashing Use standard hash function to map cache nodes and items to points in unit interval. “random” points spread uniformly Item assigned to nearest cache node Cache (Bucket) item Computation easy as standard hash function

Properties All buckets get roughly same number of items (like standard hashing). When kth bucket is added only a 1/k fraction of items move. and only from a few caches When a cache node is added, minimal reshuffling of cached items is required.

Consistent Hashing Partition using consistent hashing Keys hash to a point on a fixed circular space Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots Nodes take positions on the circle. A, B, and D exists. B responsible for AB range. D responsible for BD range. A responsible for DA range. C joins. B, D split ranges. C gets BC from D.

Virtual Nodes “Virtual Nodes”: Each node can be responsible for more than one virtual node. If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.

Replication Each data item is replicated at N hosts. “preference list”: The list of nodes that is responsible for storing a particular key.