Distributed, real-time actionable insights on high-volume data streams

Slides:



Advertisements
Similar presentations
Capacity Planning in a Virtual Environment
Advertisements

CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
System Center 2012 R2 Overview
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Scalable Content-aware Request Distribution in Cluster-based Network Servers Jianbin Wei 10/4/2001.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
DatacenterMicrosoft Azure Consistency Connectivity Code.
How to Resolve Bottlenecks and Optimize your Virtual Environment Chris Chesley, Sr. Systems Engineer
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Windows Azure Conference 2014 Deploy your Java workloads on Windows Azure.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
MDC417 Follow me on Working as Practice Manager for Insight, he is a subject matter expert in cloud, virtualization and management.
Cassandra - A Decentralized Structured Storage System
DBI313. MetricOLTPDWLog Read/Write mixMostly reads, smaller # of rows at a time Scan intensive, large portions of data at a time, bulk loading Mostly.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Yaping Zhu with: Jennifer Rexford (Princeton University) Aman Shaikh and Subhabrata Sen (ATT Research) Route Oracle: Where Have.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Cloudera Kudu Introduction
Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->
Capacity Planning in a Virtual Environment Chris Chesley, Sr. Systems Engineer
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
Cassandra - A Decentralized Structured Storage System
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Guangxiang Du*, Indranil Gupta
Introduction to Cassandra
International Conference on Data Engineering (ICDE 2016)
Curator: Self-Managing Storage for Enterprise Clusters
Distributed Network Traffic Feature Extraction for a Real-time IDS
Large-scale file systems and Map-Reduce
Learning MongoDB ZhangGang
Dynamo: Amazon’s Highly Available Key-value Store
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Senior Solutions Architect, MongoDB Inc.
Cloud Computing Ed Lazowska August 2011 Bill & Melinda Gates Chair in
Google Filesystem Some slides taken from Alan Sussman.
In-Memory Performance
02 | Design and implement database
Benchmarking Modern Distributed Stream Processing Systems
Replication Middleware for Cloud Based Storage Service
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Host Multicast: A Framework for Delivering Multicast to End Users
Overview Introduction VPS Understanding VPS Architecture
SCOPE: Scalable Consistency in Structured P2P Systems
Capital One Architecture Team and DataTorrent
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Be Fast, Cheap and in Control
Data Security for Microsoft Azure
Building a Database on S3
AWS Cloud Computing Masaki.
Key Manager Domains February, 2019.
Control Theory in Log Processing Systems
by Mikael Bjerga & Arne Lange
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Caching 50.5* + Apache Kafka
Client/Server Computing and Web Technologies
Moving your on-prem data warehouse to cloud. What are your options?
Fundamentals of a Task Sequence
OpenStack for the Enterprise
Presentation transcript:

Distributed, real-time actionable insights on high-volume data streams Conflux Distributed, real-time actionable insights on high-volume data streams Vinay Eswara Jai Krishna Gaurav Srivastava veswara,jaikrishna,gsrivastava@vmware.com 2nd December 2016

Introduction A monitoring system based on time series data needed for a cloud scale data center. Supported multi tenancy. Design configuration maximums: Configuration Maximums Reference 50,000 VMs / 30,000 powered on VMs 85 metrics each emitted every 20 seconds. Data volume : (30,000 VMs x 85 metrics x 1KB packets) / 20 Approximately 127 MB/s or 11 TB per day, not accounting for compression. CONFIDENTIAL

Objective Group streams arbitrarily in real-time. E.g all customer VMs or capacity utilization / Accounting workload etc Was able to compute machine learning style models fast : M = A x (s1) a’ + B x (s2) b’ + C x (s3) c’ where s1, s2, s3 are streams. Handle updates to model functions and groups fast, at the same time being highly available, horizontally scalable and easy to deploy using VM templates. CONFIDENTIAL

Existing solutions Twitter Heron, Kestrel Google Millwheel, Photon Apache Spark, Storm, Samza CONFIDENTIAL

Background: naive solution - Modulo ID % num servers (3) = server number Server S2 crashes. S0 S1 S2 Rehashing occurs. Users not related to crash are shuffled : U3, U4, U9 u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 ID % num servers (2) = server number S0 S1 Is it possible to only redistribute users homed on the crashed server ? u0 u1 u2 u3 u4 u5 u6 u7 u8 u9 CONFIDENTIAL

Background: consistent hashing primer Nodes Users CONFIDENTIAL

Vnodes in Practice, node failure CONFIDENTIAL

Vnodes in Practice, node failure CONFIDENTIAL

Vnodes in Practice, node failure CONFIDENTIAL

Definitions: Packet: is the atomic unit of input and output in Conflux. It is a set of (ID, Metric, Timestamp, Value) tuples. Stream: is a logically unbounded sequence of tuples bearing the same ID. Routing: is the process of consistent hashing the ID in each packet with the number of live nodes in the conflux cluster to decide which node to deliver the packet to. Metric: is an individual, time stamped, measurable property of a phenomenon being observed Note: All timestamps are UTC (client provided) CONFIDENTIAL

Method : Consistent hashing in Conflux Each stream has a unique ID. Consistent hash of that ID = Conflux node This shards the universe of streams into the number of nodes : cache partitioning. Failure : Batch acknowledgements lead to retransmit of batch. Failure : Cassandra replication leads to data being available locally again, since hashes match! CONFIDENTIAL

Groups A set of streams with ID’s is a group: e.g. G = A + B + C + D Conflux treats a group itself as a stream with ID ‘G’. This allows group composition: e.g : GoG = G1 + G2 + G3 When ingesting a packet with some ID ‘X’ of group ‘G’, conflux simply retransmits the packet changing its ID to ‘G’. This is called feed- forward. CONFIDENTIAL

Merging streams based on groups Streams hashed to different nodes B Membership is cached at each node A=>G B=>G C=>G D=>G C D G Consistent hash ring Data re-transmitted with group ID ‘G’ : Feed forward CONFIDENTIAL

Models / Formulae A Streams hashed to different nodes B Membership and first stage of computation is cached at each node X x Ax =>G Y x By =>G Z x Cz =>G W x Dw =>G C D G Consistent hash ring Data re-transmitted with group ID ‘G’ : Feed forward G = (X x Ax)+ (Y x By)…. CONFIDENTIAL

Group Gx create, with members A,B,C CONFIDENTIAL

Group Gx member delete CONFIDENTIAL

Implementation Single unit of deployment Thresholding + HTTP callouts = customized actions. Data persisted with TTL into Cassandra for disk reclamation. Cassandra Compaction is done daily in an off peak window JavaScript engine is used to define groups / formulae on the fly. 5 node cluster 8 vCPU, 32GB RAM, 2 TB disk CONFIDENTIAL

Results 1 node ingestion rate with approximately 60% CPU: Recovery : Single node failure with 5 node cluster Run# Avg CPU before Msg/s before Max CPU in recovery Avg CPU After 1 63% 1647 93% 75% 2 64% 1587 97% 88% 3 62% 1688 96% 72% 4 1711 100% 86% 5 61% 1649 95% 80% CONFIDENTIAL

Conclusion : How is Conflux different Conflux uses routing using consistent hashing to ensure all related streams of a group or formula end up on the same node This allows for fast in-memory evaluation using cached data on one node. Using the same consistent hash function for message routing as for persistence ensures that reads and writes are always on local disk. Consistent hashing also ensures read write locality is preserved in case of failure. CONFIDENTIAL

Future work Tree - Group for load balancing Tree group fast updates, compaction. Pure-dynamic groups defined by a function : e.g All nodes whose CPU > 80% CONFIDENTIAL

Q & A

FAQ Does more RAM per node help ? => to an extent Does more CPU per node help ? => Oh yes!! What if a rack dies ? => VM affinity, anti affinity What if a datacenter dies => tough luck Why not Spark Heron Samza Can I go across geographies ? => no, backplane IP ensures superfast connections. CONFIDENTIAL