CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Tomcy Thankachan  Introduction  Data model  Building Blocks  Implementation  Refinements  Performance Evaluation  Real applications  Conclusion.
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
The google file system Cs 595 Lecture 9.
Big Table Alon pluda.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Bigtable: A Distributed Storage System for Structured Data Presenter: Guangdong Liu Jan 24 th, 2012.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Lecture 7 – Bigtable CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
The Google File System.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
 Pouria Pirzadeh  3 rd year student in CS  PhD  Vandana Ayyalasomayajula  1 st year student in CS  Masters.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
BigTable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
Distributed storage for structured data
Bigtable: A Distributed Storage System for Structured Data
BigTable CSE 490h, Autumn What is BigTable? z “A BigTable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Google’s Big Table 1 Source: Chang et al., 2006: Bigtable: A Distributed Storage System for Structured Data.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
Bigtable: A Distributed Storage System for Structured Data 1.
Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber.
Big Table - Slides by Jatin. Goals wide applicability Scalability high performance and high availability.
Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows,
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Chubby Lock Service Steve Ko Computer Sciences and Engineering University at Buffalo.
CSC590 Selected Topics Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Bigtable: A Distributed Storage System for Structured Data
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Chapter 3 System Models.
Bigtable: A Distributed Storage System for Structured Data Written By: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Bigtable A Distributed Storage System for Structured Data
Bigtable: A Distributed Storage System for Structured Data
Google File System.
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
Cloud Computing Storage Systems
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System (GFS)
Presentation transcript:

CSC 536 Lecture 8

Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

Reactive Streams

Streams Stream  Process involving data flow and transformation  Data possibly of unbounded size  Focus on describing transformation Examples  bulk data transfer  real-time data sources  batch processing of large data sets  monitoring and analytics

Needed: Asynchrony For fault tolerance:  Encapsulation  Isolation For scalability:  Distribution across nodes  Distribution across cores Problem: Managing data flow across an async boundary

Types of Async Boundaries between different applications between network nodes between CPUs between threads between actors

Possible solutions Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale

Possible solutions Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale Push way:  Asynchronous/non-blocking message passing  Scales!  Problem: message buffering and message dropping

Supply and Demand Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale Push way:  Asynchronous/non-blocking message passing  Scales!  Problem: message buffering and message dropping Reactive way:  non-blocking  non-dropping

Reactive way View slides of

Supply and Demand data items flow downstream demand flows upstream data items flow only when there is demand  recipient is in control of incoming data rate  data in flight is bounded by signaled demand

Dynamic Push-Pull “push” behavior when consumer is faster “pull” behavior when producer is faster switches automatically between these batching demand allows batching data

Tailored Flow Control Splitting the data means merging the demand

Tailored Flow Control Merging the data means splitting the demand

Reactive Streams Back-pressured Asynchronous Stream Processing  asynchronous non-blocking data flow  asynchronous non-blocking demand flow  Goal: minimal coordination and contention Message passing allows for distribution  across applications  across nodes  across CPUs  across threads  across actors

Reactive Streams Projects Standard implemented by many libraries Engineers from  Netflix  Oracle  Red Hat  Twitter  Typesafe … See

Reactive Streams All participants had the same basic problem All are building tools for their community A common solution benefits everybody Interoperability to make best use of efforts  minimal interfaces  rigorous specification of semantics  full TCK for verification of implementation  complete freedom for many idiomatic APIs

The underlying (internal) API trait Publisher[T] { def subscribe(sub: Subscriber[T]): Unit } trait Subscription { def requestMore(n: Int): Unit def cancel(): Unit } trait Subscriber[T] { def onSubscribe(s: Subscription): Unit def onNext(elem: T): Unit def onError(thr: Throwable): Unit def onComplete(): Unit }

The Process

Reactive Streams All calls on Subscriber must dispatch async All calls on Subscription must not block Publisher is just there to create Subscriptions

Akka Streams Powered by Akka Actors Type-safe streaming through Actors with bounded buffering Akka Streams API is geared towards end-users Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages

Examples View slides of basic.scala TcpEcho.scala WritePrimes.scala

Overview of Google’s distributed systems

Original Google search engine architecture

More than just a search engine

Organization of Google’s physical infrastructure PCs per rack (terabytes of disk space each) 30+ racks per cluster Hundreds of clusters spread across data centers worldwide

System architecture requirements Scalability Reliability Performance Openness (at the beginning, at least)

Overall Google systems architecture

Google infrastructure

Design philosophy Simplicity Software should do one thing and do it well Provable performance “every millisecond counts” Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.) Testing ”if it ain’t broke, you’re not trying hard enough” Stringent testing

Data and coordination services Google File System (GFS) Broadly similar to NFS and AFS Optimized to type of files and data access used by Google BigTable A distributed database that stores (semi-)structured data Just enough organization and structure for the type of data Google uses Chubby a locking service (and more) for GFS and BigTable

GFS requirements Must run reliably on the physical platform Must tolerate failures of individual components So application-level services can rely on the file system Optimized for Google’s usage patterns Huge files (100+MB, up to 1GB) Relatively small number of files Accesses dominated by sequential reads and appends Appends done concurrently Meets the requirements of the whole Google infrastructure scalable, reliable, high performance, open Important: throughput has higher priority than latency

GFS architecture File stored in 64MB chunks in a cluster with a master node (operations log replicated on remote machines) hundreds of chunk servers Chunks replicated 3 times

Reading and writing When the client wants to access a particular offset in a file The GFS client translates this to a (file name, chunk index) And then send this to the master When the master receives the (file name, chunk index) pair It replies with the chunk identifier and replica locations The client then accesses the closest chunk replica directly No client-side caching Caching would not help in the type of (streaming) access GFS has

Keeping chunk replicas consistent

When the master receives a mutation request from a client the master grants a chunk replica a lease (replica is primary) returns identity of primary and other replicas to client The client sends the mutation directly to all the replicas Replicas cache the mutation and acknowledge receipt The client sends a write request to primary Primary orders mutations and updates accordingly Primary then requests that other replicas do the mutations in the same order When all the replicas have acknowledged success, the primary reports an ack to the client What consistency model does this seem to implement?

GFS (non-)guarantees Writes (at a file offset) are not atomic Concurrent writes to the same location may corrupt replicated chunks If any replica is left inconsistent, the write fails (and is retried a few times) Appends are executed atomically “at least once” Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appends GFS does not guarantee that the replicas are identical It only guarantees that some file regions are consistent across replicas When needed, GFS needs an external locking service (Chubby) As well as a leader election service (also Chubby) to select the primary replica

Bigtable GFS provides raw data storage Also needed: Storage for structured data optimized to handle the needs of Google’s apps that is reliable, scalable, high-performance, open, etc

Examples of structured data URLs: Content, crawl metadata, links, anchors, PageRank,... Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Commercial DB Why not use commercial database? Not scalable enough Too expensive Full-featured relational database not required Low-level optimizations may be needed

Bigtable table Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) → cell contents

Rows Each row has a key A string up to 64KB in size Access to data in a row is atomic Rows ordered lexicographically Rows close together lexicographically reside on one or close machines (locality)

Columns “com.cnn.www” ‘contents:.’ “ …” “CNN Sports” ‘anchor:com.cnn.www/sport’ “CNN world” ‘anchor:com.cnn.www/world’ Columns have two-level name structure: family:qualifier Column family logical grouping of data groups unbounded number of columns (named with qualifiers) may have a single column with no qualifier

Timestamps Used to store different versions of data in a cell default to current time can also be set explicitly set by client Garbage Collection Per-column-family GC settings “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”...

API Create / delete tables and column families Table *T = OpenOrDie(“/bigtable/web/webtable”); RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”); r1.Delete(“anchor:com.cnn.www/world”); Operation op; Apply(&op, &r1);

Bigtable architecture An instance of BigTable is a cluster that stores tables library on client side master server tablet servers table is decomposed into tablets

Tablets A table is decomposed into tablets Tablet holds contiguous range of rows 100MB - 200MB of data per tablet Tablet server responsible for ~100 tablets Each tablet is represented by A set of files stored in GFS The files use the SSTable format, a mapping of (string) keys to (string) values Log files

Tablet Server Master assigns tablets to tablet servers Tablet server Handles reads / writes requests to tablets from clients No data goes through master Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table The metadata table contains metadata about actual tablets including location information of associated SSTables and log files

Master Upon startup, must grab master lock to insure it is the single master of a set of tablet servers provided by locking service (Chubby) Monitors tablet servers periodically scans directory of tablet servers provided by naming service (Chubby) keeps track of tablets assigned to its table servers obtains a lock on the tablet server from locking service (Chubby) lock is the communication mechanism between master and tablet server Assigns unassigned tablets in the cluster to tablet servers it monitors and moving tablets around to achieve load balancing Garbage collects underlying files stored in GFS

BigTable tablet architecture Each is an ordered and immutable mapping of keys to values

Tablet Serving Writes committed to log Memtable: ordered log of recent commits (in memory) SSTables really store a snapshot When Memtable gets too big Create new empty Memtable Merge old Memtable with SSTables and write to GFS

SSTable Operations Look up value for key Iterate over all key/value pairs in specified range Relies on lock service (Chubby) Ensure there is at most one active master Administer table server death Store column family information Store access control lists

Chubby Chubby provides to the infrastructure a locking service a file system for reliable storage of small files a leader election service (e.g. to select a primary replica) a name service Seemingly violates “simplicity” design philosophy but Chubby really provides an asynchronous distributed agreement service

Chubby API

Overall architecture of Chubby Cell: single instance of Chubby system 5 replicas 1 master replica Each replica maintains a database of directories and files/locks Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation log Chubby internally supports snapshots to periodically GC the operation log

Paxos distributed consensus algorithm A distributed consensus protocol for asynchronous systems Used by servers managing replicas in order to reach agreement on update when messages may be lost, re-ordered, duplicated servers may operate at arbitrary speed and fail servers have access to stable persistent storage Fact: Consensus not always possible in asynchronous systems Paxos works by insuring safety (correctness) not liveness (termination)

Paxos algorithm - step 1

Paxos algorithm - step 2

The Big Picture Customized solutions for Google-type problems GFS: Stores data reliably Just raw files BigTable: provides key/value map Database like, but doesn’t provide everything we need Chubby: locking mechanism Handles all synchronization problems

Common Principles One master, multiple workers MapReduce: master coordinates work amongst map / reduce workers Chubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers