CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

CSC 536 Lecture 8

Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

Reactive Streams

Streams Stream  Process involving data flow and transformation  Data possibly of unbounded size  Focus on describing transformation Examples  bulk data transfer  real-time data sources  batch processing of large data sets  monitoring and analytics

Needed: Asynchrony For fault tolerance:  Encapsulation  Isolation For scalability:  Distribution across nodes  Distribution across cores Problem: Managing data flow across an async boundary

Types of Async Boundaries between different applications between network nodes between CPUs between threads between actors

Possible solutions Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale

Possible solutions Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale Push way:  Asynchronous/non-blocking message passing  Scales!  Problem: message buffering and message dropping

Supply and Demand Traditional way:  Synchronous/blocking (possibly remote) method calls  Does not scale Push way:  Asynchronous/non-blocking message passing  Scales!  Problem: message buffering and message dropping Reactive way:  non-blocking  non-dropping

Reactive way View slides 24-55 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014

Supply and Demand data items flow downstream demand flows upstream data items flow only when there is demand  recipient is in control of incoming data rate  data in flight is bounded by signaled demand

Dynamic Push-Pull “push” behavior when consumer is faster “pull” behavior when producer is faster switches automatically between these batching demand allows batching data

Tailored Flow Control Splitting the data means merging the demand

Tailored Flow Control Merging the data means splitting the demand

Reactive Streams Back-pressured Asynchronous Stream Processing  asynchronous non-blocking data flow  asynchronous non-blocking demand flow  Goal: minimal coordination and contention Message passing allows for distribution  across applications  across nodes  across CPUs  across threads  across actors

Reactive Streams Projects Standard implemented by many libraries Engineers from  Netflix  Oracle  Red Hat  Twitter  Typesafe … See http://reactive-streams.orghttp://reactive-streams.org

Reactive Streams All participants had the same basic problem All are building tools for their community A common solution benefits everybody Interoperability to make best use of efforts  minimal interfaces  rigorous specification of semantics  full TCK for verification of implementation  complete freedom for many idiomatic APIs

The underlying (internal) API trait Publisher[T] { def subscribe(sub: Subscriber[T]): Unit } trait Subscription { def requestMore(n: Int): Unit def cancel(): Unit } trait Subscriber[T] { def onSubscribe(s: Subscription): Unit def onNext(elem: T): Unit def onError(thr: Throwable): Unit def onComplete(): Unit }

The Process

Reactive Streams All calls on Subscriber must dispatch async All calls on Subscription must not block Publisher is just there to create Subscriptions

Akka Streams Powered by Akka Actors Type-safe streaming through Actors with bounded buffering Akka Streams API is geared towards end-users Akka Streams implementation uses the Reactive Streams interfaces (Publisher/Subscriber) internally to pass data between the different processing stages

Examples View slides 62-80 of http://www.slideshare.net/ktoso/reactive-streams-akka-streams-geecon-prague-2014 basic.scala TcpEcho.scala WritePrimes.scala

Overview of Google’s distributed systems

Original Google search engine architecture

More than just a search engine

Organization of Google’s physical infrastructure 40-80 PCs per rack (terabytes of disk space each) 30+ racks per cluster Hundreds of clusters spread across data centers worldwide

System architecture requirements Scalability Reliability Performance Openness (at the beginning, at least)

Overall Google systems architecture

Google infrastructure

Design philosophy Simplicity Software should do one thing and do it well Provable performance “every millisecond counts” Estimate performance costs (accessing memory and disk, sending packet over network, locking and unlocking a mutex, etc.) Testing ”if it ain’t broke, you’re not trying hard enough” Stringent testing

Data and coordination services Google File System (GFS) Broadly similar to NFS and AFS Optimized to type of files and data access used by Google BigTable A distributed database that stores (semi-)structured data Just enough organization and structure for the type of data Google uses Chubby a locking service (and more) for GFS and BigTable

GFS requirements Must run reliably on the physical platform Must tolerate failures of individual components So application-level services can rely on the file system Optimized for Google’s usage patterns Huge files (100+MB, up to 1GB) Relatively small number of files Accesses dominated by sequential reads and appends Appends done concurrently Meets the requirements of the whole Google infrastructure scalable, reliable, high performance, open Important: throughput has higher priority than latency

GFS architecture File stored in 64MB chunks in a cluster with a master node (operations log replicated on remote machines) hundreds of chunk servers Chunks replicated 3 times

Reading and writing When the client wants to access a particular offset in a file The GFS client translates this to a (file name, chunk index) And then send this to the master When the master receives the (file name, chunk index) pair It replies with the chunk identifier and replica locations The client then accesses the closest chunk replica directly No client-side caching Caching would not help in the type of (streaming) access GFS has

Keeping chunk replicas consistent

When the master receives a mutation request from a client the master grants a chunk replica a lease (replica is primary) returns identity of primary and other replicas to client The client sends the mutation directly to all the replicas Replicas cache the mutation and acknowledge receipt The client sends a write request to primary Primary orders mutations and updates accordingly Primary then requests that other replicas do the mutations in the same order When all the replicas have acknowledged success, the primary reports an ack to the client What consistency model does this seem to implement?

GFS (non-)guarantees Writes (at a file offset) are not atomic Concurrent writes to the same location may corrupt replicated chunks If any replica is left inconsistent, the write fails (and is retried a few times) Appends are executed atomically “at least once” Offset is chosen by primary May end up with non-identical replicated chunks with some having duplicate appends GFS does not guarantee that the replicas are identical It only guarantees that some file regions are consistent across replicas When needed, GFS needs an external locking service (Chubby) As well as a leader election service (also Chubby) to select the primary replica

Bigtable GFS provides raw data storage Also needed: Storage for structured data...... optimized to handle the needs of Google’s apps...... that is reliable, scalable, high-performance, open, etc

Examples of structured data URLs: Content, crawl metadata, links, anchors, PageRank,... Per-user data: User preference settings, recent queries/search results, … Geographic locations: Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …

Commercial DB Why not use commercial database? Not scalable enough Too expensive Full-featured relational database not required Low-level optimizations may be needed

Bigtable table Implementation: Sparse distributed multi-dimensional map (row, column, timestamp) → cell contents

Rows Each row has a key A string up to 64KB in size Access to data in a row is atomic Rows ordered lexicographically Rows close together lexicographically reside on one or close machines (locality)

Columns “com.cnn.www” ‘contents:.’ “ …” “CNN Sports” ‘anchor:com.cnn.www/sport’ “CNN world” ‘anchor:com.cnn.www/world’ Columns have two-level name structure: family:qualifier Column family logical grouping of data groups unbounded number of columns (named with qualifiers) may have a single column with no qualifier

Timestamps Used to store different versions of data in a cell default to current time can also be set explicitly set by client Garbage Collection Per-column-family GC settings “Only retain most recent K values in a cell” “Keep values until they are older than K seconds”...

API Create / delete tables and column families Table *T = OpenOrDie(“/bigtable/web/webtable”); RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:com.cnn.www/sport”, “CNN Sports”); r1.Delete(“anchor:com.cnn.www/world”); Operation op; Apply(&op, &r1);

Bigtable architecture An instance of BigTable is a cluster that stores tables library on client side master server tablet servers table is decomposed into tablets

Tablets A table is decomposed into tablets Tablet holds contiguous range of rows 100MB - 200MB of data per tablet Tablet server responsible for ~100 tablets Each tablet is represented by A set of files stored in GFS The files use the SSTable format, a mapping of (string) keys to (string) values Log files

Tablet Server Master assigns tablets to tablet servers Tablet server Handles reads / writes requests to tablets from clients No data goes through master Bigtable client requires a naming/locator service (Chubby) to find the root tablet, which is part of the metadata table The metadata table contains metadata about actual tablets including location information of associated SSTables and log files

Master Upon startup, must grab master lock to insure it is the single master of a set of tablet servers provided by locking service (Chubby) Monitors tablet servers periodically scans directory of tablet servers provided by naming service (Chubby) keeps track of tablets assigned to its table servers obtains a lock on the tablet server from locking service (Chubby) lock is the communication mechanism between master and tablet server Assigns unassigned tablets in the cluster to tablet servers it monitors and moving tablets around to achieve load balancing Garbage collects underlying files stored in GFS

BigTable tablet architecture Each is an ordered and immutable mapping of keys to values

Tablet Serving Writes committed to log Memtable: ordered log of recent commits (in memory) SSTables really store a snapshot When Memtable gets too big Create new empty Memtable Merge old Memtable with SSTables and write to GFS

SSTable Operations Look up value for key Iterate over all key/value pairs in specified range Relies on lock service (Chubby) Ensure there is at most one active master Administer table server death Store column family information Store access control lists

Chubby Chubby provides to the infrastructure a locking service a file system for reliable storage of small files a leader election service (e.g. to select a primary replica) a name service Seemingly violates “simplicity” design philosophy but...... Chubby really provides an asynchronous distributed agreement service

Chubby API

Overall architecture of Chubby Cell: single instance of Chubby system 5 replicas 1 master replica Each replica maintains a database of directories and files/locks Consistency achieved using Lamport’s Paxos consensus protocol that uses an operation log Chubby internally supports snapshots to periodically GC the operation log

Paxos distributed consensus algorithm A distributed consensus protocol for asynchronous systems Used by servers managing replicas in order to reach agreement on update when messages may be lost, re-ordered, duplicated servers may operate at arbitrary speed and fail servers have access to stable persistent storage Fact: Consensus not always possible in asynchronous systems Paxos works by insuring safety (correctness) not liveness (termination)

Paxos algorithm - step 1

Paxos algorithm - step 2

The Big Picture Customized solutions for Google-type problems GFS: Stores data reliably Just raw files BigTable: provides key/value map Database like, but doesn’t provide everything we need Chubby: locking mechanism Handles all synchronization problems

Common Principles One master, multiple workers MapReduce: master coordinates work amongst map / reduce workers Chubby: master among five replicas Bigtable: master knows about location of tablet servers GFS: master coordinates data across chunkservers

CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

Similar presentations

Presentation on theme: "CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)

Similar presentations

Presentation on theme: "CSC 536 Lecture 8. Outline Reactive Streams Streams Reactive streams Akka streams Case study Google infrastructure (part I)"— Presentation transcript:

Similar presentations

About project

Feedback