© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016.

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
Topic 6.3: Transactions and Concurrency Control Hari Uday.
Transaction.
Distributed components
Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.
Transaction Management and Concurrency Control
1 Transaction Management Overview Yanlei Diao UMass Amherst March 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Transaction Processing IS698 Min Song. 2 What is a Transaction?  When an event in the real world changes the state of the enterprise, a transaction is.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
Transactions and Recovery
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Google’s NoSQL Solution 2013/4/1Title1 Chao Wang Fay Chang, Jeffrey Dean, Sanjay.
BigTable and Accumulo CMSC 461 Michael Wilson. BigTable  This was Google’s original distributed data concept  Key value store  Meant to be scaled up.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Transaction Management Overview Lecture 21 Ramakrishnan - Chapter 18.
Transaction Lectured by, Jesmin Akhter, Assistant professor, IIT, JU.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Database Systems/COMP4910/Spring05/Melikyan1 Transaction Management Overview Unit 2 Chapter 16.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Concurrency Control. Objectives Management of Databases Concurrency Control Database Recovery Database Security Database Administration.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Feb 1, 2001CSCI {4,6}900: Ubiquitous Computing1 Eager Replication and mobile nodes Read on disconnected clients may give stale data Eager replication prohibits.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
SQLintersection Understanding Transaction Isolation Levels Randy Knight Wednesday, 3:45-5:00.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Data The fact and figures that can be recorded in system and that have some special meaning assigned to it. Eg- Data of a customer like name, telephone.
Bigtable : A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows,
Bigtable: A Distributed Storage System for Structured Data
Fault Tolerance (2). Topics r Reliable Group Communication.
18 September 2008CIS 340 # 1 Last Covered (almost)(almost) Variety of middleware mechanisms Gain? Enable n-tier architectures while not necessarily using.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
BIG DATA/ Hadoop Interview Questions.
Percolator: Incrementally Indexing the Web OSDI’10.
Bigtable A Distributed Storage System for Structured Data.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Percolator Data Management in the Cloud
Bigtable A Distributed Storage System for Structured Data
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
On transactions, and Atomic Operations
On transactions, and Atomic Operations
EEC 688/788 Secure and Dependable Computing
Interpret the execution mode of SQL query in F1 Query paper
EEC 688/788 Secure and Dependable Computing
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EEC 688/788 Secure and Dependable Computing
Presentation transcript:

© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Bigtable and Percolator April 25, 2016

© 2016 A. Haeberlen, Z. Ives Announcements 2 nd midterm: Wednesday at 10:30am-noon Three rooms: TOWN 319, BENN 231, Berger auditorium Please complete the course evaluations! Please let me know how you liked the class (topics covered, structure, projects, assignments,...) and especially what aspects could be improved I already know the workload is very high Your feedback will benefit future instances of CIS455! Project demo slots will be available later today One member of each team should sign up for one slot Reading: Peng & Dabek: "Large-scale Incremental Processing Using Distributed Transactions and Notifications", OSDI University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Reminder: Google award The team with the best search engine will receive an award (sponsored by ) Criteria: Architecture/design, speed, quality of search results, reliability, user interface, written final report Winning team gets four Nexus 7 tablets Winners will be announced on the course web page 3 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Bigtable Percolator Transactions and locking Snapshot isolation Observers 4 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives Bigtable Implements a multidimensional sorted map Keys are (row, column, timestamp); provides versioning Data is maintained in lexicographic order (by row key) Atomic lookup and update operations on each row, but no atomic cross-row operations Used by many Google projects, including Google Earth, the web index, and possibly others 5 University of Pennsylvania Source: Bigtable paper (OSDI2006) Different versions "Column family"

© 2016 A. Haeberlen, Z. Ives Bigtable implementation A single-master system, similar to GFS Table is broken into tablets, which each contain a contiguous region of key space. Stored by tablet servers. There is a master that assigns tablets to tablet servers Persistent state is stored in GFS files; recently committed data is kept in a memtable in memory Designed to be scalable: Handles petabytes of data, runs reliably on large numbers of unreliable machines 6 University of Pennsylvania Write op tablet log memtable Source: BigTable paper (OSDI 2006) Read op SSTable Files

© 2016 A. Haeberlen, Z. Ives Some services that use Bigtable In 2006, there were 388 non-test Bigtable clusters at Google Combined total: 24,500 tablet servers Example: Google Analytics Raw click table (~200TB): 1 row for each end-user session Summary table (~20TB): Predefined summaries per website 7 University of Pennsylvania Source: Bigtable paper (OSDI 2006)

© 2016 A. Haeberlen, Z. Ives Flashback Bigtable uses many of the technologies we've been looking at in this course: Lock service is made fault-tolerant with Paxos Tablet location hierarchy is basically a B+ tree Clients can run per-row transactions Data is persisted in a scalable file system, GFS Bigtable can be used as source or target for MapReduce jobs More details are in the paper: F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber: "Bigtable: A Distributed Storage System for Structured Data", OSDI University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Plan for today Bigtable Percolator Transactions and locking Snapshot isolation Observers 9 University of Pennsylvania NEXT

© 2016 A. Haeberlen, Z. Ives Why Percolator? Scenario: Web crawler We have a huge index (Google: Tens of Petabytes) We need to run some computation on the index (e.g., PageRank updates, clustering,...) Google's indexing system is a chain of many MapReduces Every day we recrawl a small part of the web How do we update the index? Alternative #1: Run MapReduce on changed pages only Problem: Not accurate; for example, there may be links between the new pages and the rest of the web Alternative #2: Re-run MapReduce on entire data Problem: Wasteful; discards work done in earlier runs This is what Google actually used to do prior to Percolator Alternative #3: Update incrementally 10 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Example 11 University of Pennsylvania... map reduce URLChecksumPageRankIsCanonical nyt.com0xabcdef016yes ChecksumCanonical 0xabcdef01nyt.com invert links... nytimes.com 0xabcdef01 9 yes no nytimes.com

© 2016 A. Haeberlen, Z. Ives What is Percolator? A system for incrementally processing updates to a large data set Percolator-based indexing system is known as 'Caffeine' Reduced average age of documents in Google search results by 50%; documents move through Caffeine about 100x faster than through the previous system Published at OSDI 2010 Peng & Dabek: "Large-scale Incremental Processing Using Distributed Transactions and Notifications", OSDI University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives What Percolator provides Percolator builds on Bigtable, but additionally it provides the following two abstractions: ACID transactions (as seen earlier) with snapshot isolation Observers - a way to organize incremental computation What is an observer? Essentially, a small piece of code that is invoked whenever a specific column changes Percolator applications are structured as a series of observers: An external process (e.g., the crawler) triggers updates in the table Update is handled by an observer, which then produces more updates and thus more work for other observers, etc. 13 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Why ACID semantics? Couldn't they have built this system without transactions? Transactions are not 'free' - they have some overhead Yes - but transactions make it easier to reason about the state of the system, especially when many updates are performed concurrently... avoid introduction of errors into long-lived repository These could be introduced by bugs, crashes, allow easy construction of consistent, up-to-date indexes Interesting change of perspective Given the earlier debates (e.g., Stonebraker/DeWitt) 14 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Snapshot isolation What is snapshot isolation? Conceptually, each transaction performs all its reads at a start timestamp, and all its writes at a commit timestamp In the above example, transaction 2 does not see writes from transaction 1, but transaction 3 sees writes from both 1 and 2 Implemented using versioning Protects against write-write conflicts: If two transactions write to the same cell, at least one aborts Comparison to serializability? 15 University of Pennsylvania Time Read timestamp Write timestamp

© 2016 A. Haeberlen, Z. Ives Locking in Percolator Quite different from DBMS locking Locks are kept in special BigTable columns Ensures persistence and provides high throughput Remember: Accesses to individual rows are already atomic in BigTable! 16 University of Pennsylvania keybal:databal:lockbal:write Bob 6: 5: $10 6: 5: 6: 5 5: Joe 6: 5: $2 6: 5: 6: 5 5:

© 2016 A. Haeberlen, Z. Ives Transactions in Percolator 1. At the beginning, obtain start timestamp Comes from a timestamp oracle 2. Buffer all writes until commit time 3. At commit time, try to lock all the cells being written ('prewrite') If existing locks are found, transaction aborts A random cell is designated as the 'primary'; other cells contain a reference to the primary 4. Obtain commit timestamp from oracle 5. Release locks and make writes visible Start with primary (ensures that roll-forward is possible) 17 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Handling faulty nodes in Percolator What if a node fails in transaction? What would happen in a DBMS? What is the effect in Percolator? What needs to happen? What should a transaction do if it finds locks left behind by another transaction? Option #1: Rollback If primary lock still exists, no changes have been made visible yet (since the primary lock is always removed first) Option #2: Roll-forward If primary lock no longer exists Need to make all the writes of the original transaction visible What if two transactions 'collide'? 18 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives An example Transfer $7 from Bob to Joe Lock on Bob's 'bal' column is chosen as the primary Until version 8 of Bob's row is written, transaction would be rolled back if the lock holder crashed Visible data at this point is still $2/$10! (see bal:write) After that point, it would be rolled forward 19 University of Pennsylvania keybal:databal:lockbal:write Bob 6: 5: $10 6: 5: 6: 5 5: Joe 6: 5: $2 6: 5: 6: 5 5: 7: $3 PRIMARY 7: 7: $9 7: 8: 8: 7 8: 8: 7 Taken from the Percolator paper (OSDI'10) $3/$9 not yet visible at v7! 7:

© 2016 A. Haeberlen, Z. Ives Observers User writes code ('observers') triggered by changes to the table Register a function and a set of columns to be observed Percolator invokes function when data in columns is modified Similar to database triggers But: Unlike triggers, observers and their transactions are not atomic, so observers cannot be used to maintain data integrity! Authors claim that this makes them 'easier to understand' 20 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Observer execution Guarantees: Multiple observers are allowed to observe the same column Consequence: It is possible to shoot yourself in the foot! But: At most one observer's transaction will commit for each observed change How is this implemented? Special 'notify' and 'acknowledgment' columns When a transaction writes to an observed cell, it sets the 'notify' column Percolator workers continuously perform a distributed scan of the table, looking for dirty 'notify' cells When one is found, observer runs and then updates the 'acknowledgment' column 21 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Theory and practice Observation: Worker threads tend to cluster in the same region of the table! Why might this be happening? When a worker is busy, other workers queue up behind it Similar to 'bus clumping' in public transport Solution? When a worker finds that it is scanning the same row as another worker, it jumps to a random cell Not applicable to public transport 22 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives "It depends" What is faster: MapReduce or Percolator? Example: Clustering on 240 machines w/continuous crawl Answer depends on crawl rate! 23 University of Pennsylvania Percolator saturates resources

© 2016 A. Haeberlen, Z. Ives Alternatives to Percolator Traditional DBMS Better for smaller computations (Percolator is designed for multi-Petabyte data sets!) MapReduce Better if computations can't be broken down into small updates BigTable alone Better if computation does not have strong consistency requirements 24 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Recap: Percolator A more recent technology used in Google's index Allows incremental updates No need to re-run large MapReduce jobs over entire index Result: Data in index is 'fresher' Main components: Transactions and observers Provides snapshot isolation semantics (e.g., cheaper reads) Runs over BigTable, which in turn runs over GFS "Existence proof for distributed transactions at Web scale" 25 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions 26 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives You should be able to... Identify security problems in web systems and apply suitable countermeasures Example: Devise attacks on a poorly secured servlet Write XQueries (FLWOR etc) Compare various consistency models Example: Eventual/sequential/snapshot consistency Understand fundamentals of Inform. Retrieval Example: Compare Boolean model and Vector model Understand techniques for achieving robust- ness to various types of faults, and their costs Example: How would you build a storage system that handles a) crash faults, b) rational behavior, c) Byzantine faults? 27 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Compare the architecture of Google and Mercator What is a Sybil attack, and how can you defend a system against it? How would you implement a DHT in Pastry? Be able to provide pseudocode and discuss failure cases Explain similarities and differences between the semantics of RPCs and local function calls Can you pass values by reference in a RPC? How can you achieve exactly-once semantics? Compare SOAP and REST Explain PageRank: Intuition? How to compute? 28 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Compare XQuery and XSLT Web-specific challenges for Information Retrieval? Compare Boolean model and Vector model Compare HITS and PageRank Write a simple MapReduce program Possible defenses against various SEOs Explain utility computing model; compare to classical Compare different consistency models Be able to do a simple ARIES example Which faults can you (not) recover from in 2PC? Example of a fault that is rational but not Byzantine? 29 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Design or debug a simple incentive scheme How would you exploit BitTorrent for your own profit? Explain why we need fault models How would you implement search suggestions? How would you implement phrase search? How can you 'optimize' the PageRank of your site? Explain what the utility computing model is 2PC: Explain how to recover from a given fault 2PL: Explain why it works; how it could go wrong For each component of ACID... name one technique that can be used to implement it provide an example where it goes wrong 30 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives Review questions Explain TF-IDF ranking Explain the idea behind stemming Write an XQuery (with FLWOR) Example: Use of a correspondence table Discuss importance of replication for a new service 31 University of Pennsylvania

© 2016 A. Haeberlen, Z. Ives 32 University of Pennsylvania I hope you liked CIS455/555! Please don't forget to complete your course evaluations