Scalable Distributed Stream Processing Presented by Ming Jiang
Centralized stream processing review
Situation when distributed A distributed federation of participating nodes in different administrative domains Collaboration between different domains required
Two complementary efforts for the situation Aurora* intra-participant distribution Medusa inter-participant distribution
Three pieces to be shard Aurora An overlay network of communication Algorithms for high-availability
Three architectural issues Communications Load sharing High availability in the presence of failure
Communications Naming (participants, entity-name) Routing 1. a data source or an administrator registers a schema and a stream 2. When DS produce an event, labels
Communications Message Transport multiplexing all the message streams on a single TCP connection Remote definition: process migration is too complicated
Load Management Repartitioning Aurora Networks, based on loads and resources: Box Sliding Box Splitting
Box Sliding Takes a box on the edge of a sub- network on one machine and shifts it to its neighbor. upstream box sliding
Box Splitting Create a copy of a box that is intended to run on second machine, to offload Need a filter as router
Box splitting Tumble Merge: Box splitting has to be transparent
Box splitting If predicate in filter is: B<3 A machine: 1,2,3,4,7B machine: 5,6 A machine B machine final result after merge
Key partitioning Challenges Choosing what to offload Choosing what to split Choosing filters Others…
High Availability Utilize the push-based nature
Failure detection and Recovery 1. periodically send heartbeat msgs to upstream neighbors 2. if any server does not reply for pre-defined time, we assume it failed 3. initiate recovery phase, emulating the process of failed server (load shedding can be used)
Thank you!