CSC 536 Lecture 8
Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)
Reactive Streams
Streams Stream Process involving data flow and transformation Data possibly of unbounded size Focus on describing transformation Examples bulk data transfer real-time data sources batch processing of large data sets monitoring and analytics
Needed: Asynchrony For fault tolerance: Encapsulation Isolation For scalability: Distribution across nodes Distribution across cores Problem: Managing data flow across an async boundary
Types of Async Boundaries between different applications between network nodes between CPUs between threads between actors
Possible solutions Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale
Possible solutions Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale Push way: Asynchronous/non-blocking message passing Scales! Problem: message buffering and message dropping
Supply and Demand Traditional way: Synchronous/blocking (possibly remote) method calls Does not scale Push way: Asynchronous/non-blocking message passing Scales! Problem: message buffering and message dropping Reactive way: non-blocking non-dropping
Reactive way View slides of
Supply and Demand data items flow downstream demand flows upstream data items flow only when there is demand recipient is in control of incoming data rate data in flight is bounded by signaled demand
Dynamic Push-Pull “push” behavior when consumer is faster “pull” behavior when producer is faster switches automatically between these batching demand allows batching data
Tailored Flow Control Splitting the data means merging the demand
Tailored Flow Control Merging the data means splitting the demand
Reactive Streams Back-pressured Asynchronous Stream Processing asynchronous non-blocking data flow asynchronous non-blocking demand flow Goal: minimal coordination and contention Message passing allows for distribution across applications across nodes across CPUs across threads across actors
Reactive Streams Projects Standard implemented by many libraries Engineers from Netflix Oracle Red Hat Twitter Typesafe … See
Reactive Streams All participants had the same basic problem All are building tools for their community A common solution benefits everybody Interoperability to make best use of efforts minimal interfaces rigorous specification of semantics full TCK for verification of implementation complete freedom for many idiomatic APIs
The underlying (internal) API trait Publisher[T] { def subscribe(sub: Subscriber[T]): Unit } trait Subscription { def requestMore(n: Int): Unit def cancel(): Unit } trait Subscriber[T] { def onSubscribe(s: Subscription): Unit def onNext(elem: T): Unit def onError(thr: Throwable): Unit def onComplete(): Unit }
The Process
Reactive Streams All calls on Subscriber must dispatch async All calls on Subscription must not block Publisher is just there to create Subscriptions
Akka Streams Powered by Akka Actors Type-safe streaming through Actors with bounded buffering Akka Streams API is geared towards end-users Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages
Examples View slides of basic.scala TcpEcho.scala WritePrimes.scala
Overview of Google’s distributed systems
Original Google search engine architecture
More than just a search engine
Organization of Google’s physical infrastructure PCs per rack (terabytes of disk space each) 30+ racks per cluster Hundreds of clusters spread across data centers worldwide
System architecture requirements Scalability Reliability Performance Openness (at the beginning, at least)
Overall Google systems architecture
Google infrastructure
Design philosophy Simplicity Software should do one thing and do it well Provable performance “every millisecond counts” Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.) Testing ”if it ain’t broke, you’re not trying hard enough” Stringent testing
Data and coordination services Google File System (GFS) Broadly similar to NFS and AFS Optimized to type of files and data access used by Google BigTable A distributed database that stores (semi-)structured data Just enough organization and structure for the type of data Google uses Chubby a locking service (and more) for GFS and BigTable
GFS requirements Must run reliably on the physical platform Must tolerate failures of individual components So application-level services can rely on the file system Optimized for Google’s usage patterns Huge files (100+MB, up to 1GB) Relatively small number of files Accesses dominated by sequential reads and appends Appends done concurrently Meets the requirements of the whole Google infrastructure scalable, reliable, high performance, open Important: throughput has higher priority than latency
GFS architecture File stored in 64MB chunks in a cluster with a master node (operations log replicated on remote machines) hundreds of chunk servers Chunks replicated 3 times
Reading and writing When the client wants to access a particular offset in a file The GFS client translates this to a (file name, chunk index) And then send this to the master When the master receives the (file name, chunk index) pair It replies with the chunk identifier and replica locations The client then accesses the closest chunk replica directly No client-side caching Caching would not help in the type of (streaming) access GFS has
Keeping chunk replicas consistent
When the master receives a mutation request from a client the master grants a chunk replica a lease (replica is primary) returns identity of primary and other replicas to client The client sends the mutation directly to all the replicas Replicas cache the mutation and acknowledge receipt The client sends a write request to primary Primary orders mutations and updates accordingly Primary then requests that other replicas do the mutations in the same order When all the replicas have acknowledged success, the primary reports an ack to the client What consistency model does this seem to implement?
GFS (non-)guarantees Writes (at a file offset) are not atomic Concurrent writes to the same location may corrupt replicated chunks If any replica is left inconsistent, the write fails (and is retried a few times) Appends are executed atomically “at least once” Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appends GFS does not guarantee that the replicas are identical It only guarantees that some file regions are consistent across replicas When needed, GFS needs an external locking service (Chubby) As well as a leader election service (also Chubby) to select the primary replica
Bigtable GFS provides raw data storage Also needed: Storage for structured data optimized to handle the needs of Google’s apps that is reliable, scalable, high-performance, open, etc
Examples of structured data URLs: Content, crawl metadata, links, anchors, PageRank,... Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …
Commercial DB Why not use commercial database? Not scalable enough Too expensive Full-featured relational database not required Low-level optimizations may be needed
Bigtable table Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) → cell contents
Rows Each row has a key A string up to 64KB in size Access to data in a row is atomic Rows ordered lexicographically Rows close together lexicographically reside on one or close machines (locality)
Columns “com.cnn.www” ‘contents:.’ “ …” “CNN Sports” ‘anchor:com.cnn.www/sport’ “CNN world” ‘anchor:com.cnn.www/world’ Columns have two-level name structure: family:qualifier Column family logical grouping of data groups unbounded number of columns (named with qualifiers) may have a single column with no qualifier
Timestamps Used to store different versions of data in a cell default to current time can also be set explicitly set by client Garbage Collection Per-column-family GC settings “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”...
API Create / delete tables and column families Table *T = OpenOrDie(“/bigtable/web/webtable”); RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”); r1.Delete(“anchor:com.cnn.www/world”); Operation op; Apply(&op, &r1);
Bigtable architecture An instance of BigTable is a cluster that stores tables library on client side master server tablet servers table is decomposed into tablets
Tablets A table is decomposed into tablets Tablet holds contiguous range of rows 100MB - 200MB of data per tablet Tablet server responsible for ~100 tablets Each tablet is represented by A set of files stored in GFS The files use the SSTable format, a mapping of (string) keys to (string) values Log files
Tablet Server Master assigns tablets to tablet servers Tablet server Handles reads / writes requests to tablets from clients No data goes through master Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table The metadata table contains metadata about actual tablets including location information of associated SSTables and log files
Master Upon startup, must grab master lock to insure it is the single master of a set of tablet servers provided by locking service (Chubby) Monitors tablet servers periodically scans directory of tablet servers provided by naming service (Chubby) keeps track of tablets assigned to its table servers obtains a lock on the tablet server from locking service (Chubby) lock is the communication mechanism between master and tablet server Assigns unassigned tablets in the cluster to tablet servers it monitors and moving tablets around to achieve load balancing Garbage collects underlying files stored in GFS
BigTable tablet architecture Each is an ordered and immutable mapping of keys to values
Tablet Serving Writes committed to log Memtable: ordered log of recent commits (in memory) SSTables really store a snapshot When Memtable gets too big Create new empty Memtable Merge old Memtable with SSTables and write to GFS
SSTable Operations Look up value for key Iterate over all key/value pairs in specified range Relies on lock service (Chubby) Ensure there is at most one active master Administer table server death Store column family information Store access control lists
Chubby Chubby provides to the infrastructure a locking service a file system for reliable storage of small files a leader election service (e.g. to select a primary replica) a name service Seemingly violates “simplicity” design philosophy but Chubby really provides an asynchronous distributed agreement service
Chubby API
Overall architecture of Chubby Cell: single instance of Chubby system 5 replicas 1 master replica Each replica maintains a database of directories and files/locks Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation log Chubby internally supports snapshots to periodically GC the operation log
Paxos distributed consensus algorithm A distributed consensus protocol for asynchronous systems Used by servers managing replicas in order to reach agreement on update when messages may be lost, re-ordered, duplicated servers may operate at arbitrary speed and fail servers have access to stable persistent storage Fact: Consensus not always possible in asynchronous systems Paxos works by insuring safety (correctness) not liveness (termination)
Paxos algorithm - step 1
Paxos algorithm - step 2
The Big Picture Customized solutions for Google-type problems GFS: Stores data reliably Just raw files BigTable: provides key/value map Database like, but doesn’t provide everything we need Chubby: locking mechanism Handles all synchronization problems
Common Principles One master, multiple workers MapReduce: master coordinates work amongst map / reduce workers Chubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers