CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra.

Slides:



Advertisements
Similar presentations
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Advertisements

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
CS525: Special Topics in DBs Large-Scale Data Management HBase Spring 2013 WPI, Mohamed Eltabakh 1.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Distributed Graph Processing Abhishek Verma CS425.
Spark: Cluster Computing with Working Sets
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Paper by: Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) Pregel: A System for.
Distributed storage for structured data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Pregel: A System for Large-Scale Graph Processing
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes X: S4 Hector Garcia-Molina.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
High Throughput Computing on P2P Networks Carlos Pérez Miguel
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
Cassandra - A Decentralized Structured Storage System
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Structures and Algorithms in Parallel Computing Lecture 4.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
CS 245Notes 131 CS 245: Database System Principles Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
Bigtable: A Distributed Storage System for Structured Data
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Bigtable A Distributed Storage System for Structured Data.
TensorFlow– A system for large-scale machine learning
Cassandra - A Decentralized Structured Storage System
Large-scale file systems and Map-Reduce
CSE-291 (Cloud Computing) Fall 2016
PREGEL Data Management in the Cloud
The NoSQL Column Store used by Facebook
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
湖南大学-信息科学与工程学院-计算机与科学系
5/7/2019 Map Reduce Map reduce.
Presentation transcript:

CS 347Lecture 91 CS 347: Parallel and Distributed Data Management Notes X: BigTable, HBASE, Cassandra

Sources HBASE: The Definitive Guide, Lars George, O’Reilly Publishers, Cassandra: The Definitive Guide, Eben Hewitt, O’Reilly Publishers, BigTable: A Distributed Storage System for Structured Data, F. Chang et al, ACM Transactions on Computer Systems, Vol. 26, No. 2, June CS 347Lecture 92

Lots of Buzz Words! “Apache Cassandra is an open-source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database that bases its distribution design on Amazon’s dynamo and its data model on Google’s Big Table.” Clearly, it is buzz-word compliant!! CS 347Lecture 93

Basic Idea: Key-Value Store CS 347Lecture 94 Table T:

Basic Idea: Key-Value Store CS 347Lecture 95 Table T: keys are sorted API: –lookup(key)  value –lookup(key range)  values –getNext  value –insert(key, value) –delete(key) Each row has timestemp Single row actions atomic (but not persistent in some systems?) No multi-key transactions No query language!

Fragmentation (Sharding) CS 347Lecture 96 server 1server 2server 3 use a partition vector “auto-sharding”: vector selected automatically tablet

Tablet Replication CS 347Lecture 97 server 3server 4server 5 primary backup Cassandra: Replication Factor (# copies) R/W Rule: One, Quorum, All Policy (e.g., Rack Unaware, Rack Aware,...) Read all copies (return fastest reply, do repairs if necessary) HBase: Does not manage replication, relies on HDFS

Need a “directory” Table Name: Key  Server that stores key  Backup servers Can be implemented as a special table. CS 347Lecture 98

Tablet Internals CS 347Lecture 99 memory disk Design Philosophy: Only write to disk using sequential I/O.

Tablet Internals CS 347Lecture 910 memory disk flush periodically (minor compaction) tablet is merge of all layers (files) disk segments immutable writes efficient; reads only efficient when all data in memory use bloom filters for reads periodically reorganize into single layer (merge, major compaction) tombstone

Column Family CS 347Lecture 911

Column Family CS 347Lecture 912 for storage, treat each row as a single “super value” API provides access to sub-values (use family:qualifier to refer to sub-values e.g., price:euros, price:dollars ) Cassandra allows “super-column”: two level nesting of columns (e.g., Column A can have sub-columns X & Y )

Vertical Partitions CS 347Lecture 913 can be manually implemented as

Vertical Partitions CS 347Lecture 914 column family good for sparse data; good for column scans not so good for tuple reads are atomic updates to row still supported? API supports actions on full table; mapped to actions on column tables API supports column “project” To decide on vertical partition, need to know access patterns

Failure Recovery (BigTable, HBase) CS 347Lecture 915 tablet server memory log GFS or HDFS master node spare tablet server write ahead logging ping

Failure recovery (Cassandra) No master node, all nodes in “cluster” equal CS 347Lecture 916 server 1 server 2 server 3

Failure recovery (Cassandra) No master node, all nodes in “cluster” equal CS 347Lecture 917 server 1 server 2 server 3 access any table in cluster at any server that server sends requests to other servers

CS 347Lecture 918 CS 347: Parallel and Distributed Data Management Notes X: MemCacheD

MemCacheD General-purpose distributed memory caching system Open source CS 347Lecture 919

What MemCacheD Is CS 347Lecture 920 data source 1 cache 1 data source 2 cache 2 data source 3 cache 3 get_object(cache 1, MyName) x put(cache 1, myName, X)

What MemCacheD Is CS 347Lecture 921 data source 1 cache 1 data source 2 cache 2 data source 3 cache 3 get_object(cache 1, MyName) x put(cache 1, myName, X) Can purge MyName whenever each cache is hash table of (name, value) pairs no connection

What MemCacheD Could Be (but ain't) CS 347Lecture 922 data source 1 cache 1 data source 2 cache 2 data source 3 cache 3 distributed cache get_object(X)

What MemCacheD Could Be (but ain't) CS 347Lecture 923 data source 1 cache 1 data source 2 cache 2 data source 3 cache 3 distributed cache get_object(X) x x

Persistence? CS 347Lecture 924 cache 1cache 2cache 3 get_object(cache 1, MyName) put(cache 1, myName, X) Can purge MyName whenever each cache is hash table of (name, value) pairs

Persistence? CS 347Lecture 925 cache 1cache 2cache 3 get_object(cache 1, MyName) put(cache 1, myName, X) Can purge MyName whenever each cache is hash table of (name, value) pairs DBMS Example: MemcacheDB = memcached + BerkeleyDB

CS 347Lecture 926 CS 347: Parallel and Distributed Data Management Notes X: ZooKeeper

ZooKeeper Coordination service for distributed processes Provides clients with high throughput, high availability, memory only file system CS 347Lecture 927 client ZooKeeper / ab c d e znode /a/d/e (has state)

ZooKeeper Servers CS 347Lecture 928 client server state replica

ZooKeeper Servers CS 347Lecture 929 client server state replica read

ZooKeeper Servers CS 347Lecture 930 client server state replica write propagate & sych writes totally ordered: used Zab algorithm

Failures CS 347Lecture 931 client server state replica if your server dies, just connect to a different one!

ZooKeeper Notes Differences with file system: –all nodes can store data –storage size limited API: insert node, read node, read children, delete node,... Can set triggers on nodes Clients and servers must know all servers ZooKeeper works as long as a majority of servers are available Writes totally ordered; read ordered w.r.t. writes CS 347Lecture 932

ZooKeeper Applications Metadata/configuration store Lock server Leader election Group membership Barrier Queue CS 347Lecture 933

CS 347Lecture 934 CS 347: Parallel and Distributed Data Management Notes X: S4

Material based on: CS 347Lecture 935

S4 Platform for processing unbounded data streams –general purpose –distributed –scalable –partially fault tolerant (whatever this means!) CS 347Lecture 936

Data Stream CS 347Lecture 937 Terminology: event (data record), key, attribute Question: Can an event have duplicate attributes for same key? (I think so...) Stream unbounded, generated by say user queries, purchase transactions, phone calls, sensor readings,...

S4 Processing Workflow CS 347Lecture 938 user specified “processing unit”

Inside a Processing Unit CS 347Lecture 939 PE1 PE2 PEn... key=z key=b key=a processing element

Example: Stream of English quotes Produce a sorted list of the top K most frequent words CS 347Lecture 940

CS 347Lecture 941

Processing Nodes CS 347Lecture key=b key=a processing node... hash(key) =1 hash(key) =m

Dynamic Creation of PEs As a processing node sees new key attributes, it dynamically creates new PEs to handle them Think of PEs as threads CS 347Lecture 943

Another View of Processing Node CS 347Lecture 944

Failures Communication layer detects node failures and provides failover to standby nodes What happens events in transit during failure? (My guess: events are lost!) CS 347Lecture 945

How do we do DB operations on top of S4? Selects & projects easy! CS 347Lecture 946   What about joins?

What is a Stream Join? For true join, would need to store all inputs forever! Not practical... Instead, define window join: –at time t new R tuple arrives –it only joins with previous w S tuples CS 347Lecture 947 R S

One Idea for Window Join CS 347Lecture 948 R S code for PE: for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else... “key” is join attribute

One Idea for Window Join CS 347Lecture 949 R S code for PE: for each event e: if e.rel=R [store in Rset(last w) for s in Sset: output join(e,s) ] else... “key” is join attribute Is this right??? (enforcing window on a per-key value basis) Maybe add sequence numbers to events to enforce correct window?

Another Idea for Window Join CS 347Lecture 950 R S code for PE: for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S... All R & S events have “key=fake”; Say join key is C.

Another Idea for Window Join CS 347Lecture 951 R S code for PE: for each event e: if e.rel=R [store in Rset(last w) for s in Sset: if e.C=s.C then output join(e,s) ] else if e.rel=S... All R & S events have “key=fake”; Say join key is C. Entire join done in one PE; no parallelism

Do You Have a Better Idea for Window Join? CS 347Lecture 952 R S

Final Comment: Managing state of PE CS 347Lecture 953 Who manages state? S4: user does Mupet: System does Is state persistent? state

CS 347Lecture 954 CS 347: Parallel and Distributed Data Management Notes X: Hyracks

Hyracks Generalization of map-reduce Infrastructure for “big data” processing Material here based on: CS 347Lecture 955 Appeared in ICDE 2011

A Hyracks data object: Records partitioned across N sites Simple record schema is available (more than just key-value) CS 347Lecture 956 site 1 site 2 site 3

Operations CS 347Lecture 957 operator distribution rule

Operations (parallel execution) CS 347Lecture 958 op 1 op 2 op 3

Example: Hyracks Specification CS 347Lecture 959

Example: Hyracks Specification CS 347Lecture 960 initial partitions input & output is a set of partitions mapping to new partitions 2 replicated partitions

Notes Job specification can be done manually or automatically CS 347Lecture 961

Example: Activity Node Graph CS 347Lecture 962

Example: Activity Node Graph CS 347Lecture 963 sequencing constraint activities

Example: Parallel Instantiation CS 347Lecture 964

Example: Parallel Instantiation CS 347Lecture 965 stage (start after input stages finish)

System Architecture CS 347Lecture 966

Library of Operators: File reader/writers Mappers Sorters Joiners (various types0 Aggregators Can add more CS 347Lecture 967

Library of Connectors: N:M hash partitioner N:M hash-partitioning merger (input sorted) N:M range partitioner (with partition vector) N:M replicator 1:1 Can add more! CS 347Lecture 968

Hyracks Fault Tolerance: Work in progress? CS 347Lecture 969

Hyracks Fault Tolerance: Work in progress? CS 347Lecture 970 save output to files One approach: save partial results to persistent storage; after failure, redo all work to reconstruct missing data

Hyracks Fault Tolerance: Work in progress? CS 347Lecture 971 Can we do better? Maybe: each process retains previous results until no longer needed? Pi output: r1, r2, r3, r4, r5, r6, r7, r8 have "made way" to final result current output

CS 347Lecture 972 CS 347: Parallel and Distributed Data Management Notes X: Pregel

Material based on: In SIGMOD 2010 Note there is an open-source version of Pregel called GIRAPH CS 347Lecture 973

Pregel A computational model/infrastructure for processing large graphs Prototypical example: Page Rank CS 347Lecture 974 x e d b a f PR[i+1,x] = f(PR[i,a]/na, PR[i,b]/nb)

Pregel Synchronous computation in iterations In one iteration, each node: –gets messages from neighbors –computes –sends data to neighbors CS 347Lecture 975 x e d b a f PR[i+1,x] = f(PR[i,a]/na, PR[i,b]/nb)

Pregel vs Map-Reduce/S4/Hyracks/... In Map-Reduce, S4, Hyracks,... workflow separate from data CS 347Lecture 976 In Pregel, data (graph) drives data flow

Pregel Motivation Many applications require graph processing Map-Reduce and other workflow systems not a good fit for graph processing Need to run graph algorithms on many processors CS 347Lecture 977

Example of Graph Computation CS 347Lecture 978

Termination After each iteration, each vertex votes to halt or not If all vertexes vote to halt, computation terminates CS 347Lecture 979

Vertex Compute (simplified) Available data for iteration i: –InputMessages: { [from, value] } –OutputMessages: { [to, value] } –OutEdges: { [to, value] } –MyState: value OutEdges and MyState are remembered for next iteration CS 347Lecture 980

Max Computation change := false for [f,w] in InputMessages do if w > MyState.value then [ MyState.value := w change := true ] if (superstep = 1) OR change then for [t, w] in OutEdges do add [t, MyState.value] to OutputMessages else vote to halt CS 347Lecture 981

Page Rank Example CS 347Lecture 982

Page Rank Example CS 347Lecture 983 iteration count iterate thru InputMessages MyState.value shorthand: send same msg to all

Single-Source Shortest Paths CS 347Lecture 984

Architecture CS 347Lecture 985 master worker A input data 1 input data 2 worker B worker C graph has nodes a,b,c,d... sample record: [a, value]

Architecture CS 347Lecture 986 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h partition graph and assign to workers

Architecture CS 347Lecture 987 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h read input dataworker A forwards input values to appropriate workers

Architecture CS 347Lecture 988 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h run superstep 1

Architecture CS 347Lecture 989 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h at end superstep 1, send messages halt?

Architecture CS 347Lecture 990 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h run superstep 2

Architecture CS 347Lecture 991 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h checkpoint

Architecture CS 347Lecture 992 master worker A vertexes a, b, c worker B vertexes d, e worker C vertexes f, g, h checkpoint write to stable store: MyState, OutEdges, InputMessages (or OutputMessages)

Architecture CS 347Lecture 993 master worker A vertexes a, b, c worker B vertexes d, e if worker dies, find replacement & restart from latest checkpoint

Architecture CS 347Lecture 994 master worker A vertexes a, b, c input data 1 input data 2 worker B vertexes d, e worker C vertexes f, g, h

Interesting Challenge How best to partition graph for efficiency? CS 347Lecture 995 x e d b a f g

CS 347Lecture 996 CS 347: Parallel and Distributed Data Management Notes X: Kestrel

Kestrel Kestrel server handles a set of reliable, ordered message queues A server does not communicate with other servers (advertised as good for scalability!) CS 347Lecture 997 client server queues

Kestrel Kestrel server handles a set of reliable, ordered message queues A server does not communicate with other servers (advertised as good for scalability!) CS 347Lecture 998 client server queues get q put q log