Download presentation
Presentation is loading. Please wait.
Published byHilda Heath Modified over 9 years ago
1
Go Stream Matvey Arye, Princeton/Cloudflare Albert Strasheim, Cloudflare
2
Awesome CDN service for websites big & small Millions of request a second peak 24 data centers across the globe
3
Data Analysis – Customer facing analytics – System health monitoring – Security monitoring => Need global view
4
Functionality Calculate aggregate functions on fast, big data Aggregate across nodes (across datacenters) Data stored at different time granularities
5
Storm & Rainbird
6
Basic Design Requirements 1.Reliability – Exactly-once semantics 2.High Data Volumes
7
Our Environment Source Storage Source Stream processing
8
Basic Programming Model Op Storage Op Storage Op Storage Op
9
Existing Systems S4 The reliability model is not consistent Storm Exactly-once-semantics requires batching Reliability only inside the stream processing system What if a source goes down? The DB?
10
The Need For End-to-End Reliability Source Stream Proccessing Storage When source comes back up where does it start sending data from? If using something like Storm, need additional reliability mechanisms
11
The Takeaway Need end-to-end reliability - Or - Multiple reliability mechanisms Reliability of stream processing not enough
12
Design of Reliability Avoid queuing because destination has failed – Rely on storage at the edges – Minimize replication Minimize edge cases No specialized hardware
13
Big Design Decisions End-to-end reliability Only transient operator state
14
Recovering From Failure Source I am starting a stream with you. What have you already seen from me? Storage I’ve seen Source Okie dokie. Here is all the new stuff.
15
Tracking what you have seen Store identifier for all items Store one identifier for highest number 1 1 2 2 3 3 4 4
16
Tracking what you have seen Store identifier for all items The answer to what have I seen is huge Requires lots of storage for IDs Store one identifier for highest number Parallel processing of ordered data is tricky 1 1 2 2 3 3 4 4
17
Tension between Parallelization High Volume Data Ordering Reliability
18
Go Makes This Easier Language from Google written for concurrency Goroutine I run code Goroutine I run code Goroutine I run code Goroutine I run code Channels send data between Go routines Most synchronization is done by passing data
19
Goroutine Scheduling Channels are FIFO queues that have a maximum capacity So goroutine can be in 4 states: 1.Executing Code 2.Waiting for a thread to execute code 3.Blocking to receive data from a channel 4.Blocking to send data to a channel Scheduler optimizes assignment of goroutines to threads.
20
Efficient Ordering Under The Hood 11 22 33 44 Source distributes items to workers in a specific order Reading from each worker: 1.Read one tuple off the count channel. Assign count to X 2.Read X tuples of the result channel Count of output tuples for each input Actual result tuples Input tuple
21
Intuition behind design Multiple output channels allows each worker to write independently. Count channel tells reader how many tuples to expect. Does not block except when result needed to satisfy ordering. Judicious blocking allows scheduler to use blocking as a signal for which worker to schedule.
22
Throughput does not suffer
23
The Big Picture - Reliability Source provide monotonically increasing ids – per stream Stream processor preserves ordering – per source-stream Central DB maintains mapping of: Source-stream => highest ID processed
24
Functionality of Stream Processor Compression, serialization Partitioning for distributed sinks Bucketing – Take individual records and construct aggregates Across source nodes Across time – adjustable granularity Batching – Submitting many records at once to the DB Bucketing and batching all done with transient state
25
Where to get the code Stable https://github.com/cloudflare/go-stream Bleeding Edge https://github.com/cevian/go-stream arye@cs.princeton.edu
26
Data Model Streaming OLAP-like cubes Useful summaries of high-volume data
27
Cube Dimensions 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL 27
28
Cube Aggregates (Count, Max) bar.com/m 01:01:01 28
29
Updating A Cube Request #1 bar.com/m 01:01:00 Latency: 90 ms Request #1 bar.com/m 01:01:00 Latency: 90 ms (0,0) 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL 29
30
Map Request To Cell Request #1 bar.com/m 01:01:00 Latency: 90 ms Request #1 bar.com/m 01:01:00 Latency: 90 ms (0,0) 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL 30
31
Update The Aggregates Request #1 bar.com/m 01:01:00 Latency: 90 ms Request #1 bar.com/m 01:01:00 Latency: 90 ms (1,90) (0,0) 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL 31
32
Update In-Place Request #2 bar.com/m 01:01:00 Latency: 50 ms Request #2 bar.com/m 01:01:00 Latency: 50 ms (2,90) (0,0) 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL 32
33
Cube Slice 01:01:00 foo.com/rfoo.com/qbar.com/nbar.com/m 01:01:01 Time URL … 01:01:58 01:01:59 Slice 33
34
Cube Rollup 01:01:0 0 foo.com/rfoo.com/qbar.com/nbar.com/m Time URL URL: bar.com/* Time: 01:01:01 URL: foo.com/* Time: 01:01:01 34
35
Rich Structure (5,90) (3,75) (8,199) (21,40) D A C 01:01:59 01:01:00 foo.com/rfoo.com/r foo.com/qfoo.com/q bar.com/nbar.com/n bar.com/m 01:01:01 01:01:58 B … E CellURLTime Abar.com/*01:01:01 B* Cfoo.com/*01:01:01 Dfoo.com/r01:01:* Efoo.com/*01:01:* 35
36
Key Property 2 types of rollups 1.Across Dimensions 2.Across Sources We use the same aggregation function for both Powerful conceptual constraints Semantic properties preserved when changing the granularity of reporting
37
Where to get the code Stable https://github.com/cloudflare/go-stream Bleeding Edge https://github.com/cevian/go-stream arye@cs.princeton.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.