Distributed, real-time actionable insights on high-volume data streams Conflux Distributed, real-time actionable insights on high-volume data streams Vinay Eswara Jai Krishna Gaurav Srivastava veswara,jaikrishna,gsrivastava@vmware.com 2nd December 2016
Introduction A monitoring system based on time series data needed for a cloud scale data center. Supported multi tenancy. Design configuration maximums: Configuration Maximums Reference 50,000 VMs / 30,000 powered on VMs 85 metrics each emitted every 20 seconds. Data volume : (30,000 VMs x 85 metrics x 1KB packets) / 20 Approximately 127 MB/s or 11 TB per day, not accounting for compression. CONFIDENTIAL
Objective Group streams arbitrarily in real-time. E.g all customer VMs or capacity utilization / Accounting workload etc Was able to compute machine learning style models fast : M = A x (s1) a’ + B x (s2) b’ + C x (s3) c’ where s1, s2, s3 are streams. Handle updates to model functions and groups fast, at the same time being highly available, horizontally scalable and easy to deploy using VM templates. CONFIDENTIAL
Existing solutions Twitter Heron, Kestrel Google Millwheel, Photon Apache Spark, Storm, Samza CONFIDENTIAL
Background: naive solution - Modulo ID % num servers (3) = server number Server S2 crashes. S0 S1 S2 Rehashing occurs. Users not related to crash are shuffled : U3, U4, U9 u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 ID % num servers (2) = server number S0 S1 Is it possible to only redistribute users homed on the crashed server ? u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 CONFIDENTIAL
Background: consistent hashing primer Nodes Users CONFIDENTIAL
Vnodes in Practice, node failure CONFIDENTIAL
Vnodes in Practice, node failure CONFIDENTIAL
Vnodes in Practice, node failure CONFIDENTIAL
Definitions: Packet: is the atomic unit of input and output in Conflux. It is a set of (ID, Metric, Timestamp, Value) tuples. Stream: is a logically unbounded sequence of tuples bearing the same ID. Routing: is the process of consistent hashing the ID in each packet with the number of live nodes in the conflux cluster to decide which node to deliver the packet to. Metric: is an individual, time stamped, measurable property of a phenomenon being observed Note: All timestamps are UTC (client provided) CONFIDENTIAL
Method : Consistent hashing in Conflux Each stream has a unique ID. Consistent hash of that ID = Conflux node This shards the universe of streams into the number of nodes : cache partitioning. Failure : Batch acknowledgements lead to retransmit of batch. Failure : Cassandra replication leads to data being available locally again, since hashes match! CONFIDENTIAL
Groups A set of streams with ID’s is a group: e.g. G = A + B + C + D Conflux treats a group itself as a stream with ID ‘G’. This allows group composition: e.g : GoG = G1 + G2 + G3 When ingesting a packet with some ID ‘X’ of group ‘G’, conflux simply retransmits the packet changing its ID to ‘G’. This is called feed- forward. CONFIDENTIAL
Merging streams based on groups Streams hashed to different nodes B Membership is cached at each node A=>G B=>G C=>G D=>G C D G Consistent hash ring Data re-transmitted with group ID ‘G’ : Feed forward CONFIDENTIAL
Models / Formulae A Streams hashed to different nodes B Membership and first stage of computation is cached at each node X x Ax =>G Y x By =>G Z x Cz =>G W x Dw =>G C D G Consistent hash ring Data re-transmitted with group ID ‘G’ : Feed forward G = (X x Ax)+ (Y x By)…. CONFIDENTIAL
Group Gx create, with members A,B,C CONFIDENTIAL
Group Gx member delete CONFIDENTIAL
Implementation Single unit of deployment Thresholding + HTTP callouts = customized actions. Data persisted with TTL into Cassandra for disk reclamation. Cassandra Compaction is done daily in an off peak window JavaScript engine is used to define groups / formulae on the fly. 5 node cluster 8 vCPU, 32GB RAM, 2 TB disk CONFIDENTIAL
Results 1 node ingestion rate with approximately 60% CPU: Recovery : Single node failure with 5 node cluster Run# Avg CPU before Msg/s before Max CPU in recovery Avg CPU After 1 63% 1647 93% 75% 2 64% 1587 97% 88% 3 62% 1688 96% 72% 4 1711 100% 86% 5 61% 1649 95% 80% CONFIDENTIAL
Conclusion : How is Conflux different Conflux uses routing using consistent hashing to ensure all related streams of a group or formula end up on the same node This allows for fast in-memory evaluation using cached data on one node. Using the same consistent hash function for message routing as for persistence ensures that reads and writes are always on local disk. Consistent hashing also ensures read write locality is preserved in case of failure. CONFIDENTIAL
Future work Tree - Group for load balancing Tree group fast updates, compaction. Pure-dynamic groups defined by a function : e.g All nodes whose CPU > 80% CONFIDENTIAL
Q & A
FAQ Does more RAM per node help ? => to an extent Does more CPU per node help ? => Oh yes!! What if a rack dies ? => VM affinity, anti affinity What if a datacenter dies => tough luck Why not Spark Heron Samza Can I go across geographies ? => no, backplane IP ensures superfast connections. CONFIDENTIAL