Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark 6/8/2018 8:44 PM Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark Raghav Mohan Program Manager Azure HDInsight © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Bing Ads and Marketing campaigns use streaming on Azure to process ad campaigns in realtime Scenario For BingAds, the latency of reports (metrics on search ads) had been the number one customer DSAT issue for years. FastBI is the solution that brings the e2e latency to under 20 minutes. Low latency monetization log ingestion using Siphon is a key part of this. Solution Onboarded to Microsoft’s Databus (Siphon) which is powered by Azure HDInsight Kafka. Multiple Kafka clusters along with Spark Streaming are deployed across various regions in a multi-tenant fashion to process millions of events/second. Result Through Siphon + HDInsight Kafka, Spark Streaming Bing Ads is able to process 50K events per second, 7 TB per day with 60 seconds latency. Guarantees on data completion before next set of events is processed. Helped reduced costs, and increase Bing Ads revenue by multiplier factor.
BingAds Marketing Campaigns 6/8/2018 8:44 PM BingAds Marketing Campaigns © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Office 365 Customer Fabric is a shared near real time analysis of customer and service data that enables anyone in O365 to quickly and deeply understand our customers. Scenario Collect hundreds of service logs from O365 services with minutes of latency Meet O365 compliance requirements. Handle EUII scrubbing (hashing/encryption). Solution Made possible through Microsoft’s Databus (Siphon) which is powered by Azure HDInsight Kafka. Multiple Kafka clusters along with Spark Streaming are deployed across various regions in a multi-tenant fashion to process millions of events/second. Result This sub minute latency for Customer Fabric results O365 Compliance requirements met through Azure HDInsight Scalable pub-sub by leveraging the power of the cloud with Azure HDInsight
Realtime Analytics Scenarios 6/8/2018 8:44 PM Realtime Analytics Scenarios Real-time fraud detection Fleet management and Connected cars Clickstream analysis Real-time patient monitoring Smart grid Customer behavior in stores IT Infrastructure and Network monitoring Real time demand and inventory management and many more… © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Real-time IoT Scenarios Phone Tracking Across Cell Sites Connected Car - Remote Management & Diagnostics Asset Tracking Fleet Management Facilities Management Personnel Tracking & Crowd Control Ride Sharing Geofencing Racecar Telemetry Connected Manufacturing and many more…
Connected Car Scenarios enabled by building a Connected Car Platform Monetization and business opportunities for key players in the industry Future possibilities and driving experiences Sources: McKinsey; Accenture; Corp Strat analysis
Insurers Pay per use insurance Behavior based driving contracts Predictions for driving behavior Occasion related policies Insurers
Telematics & Predictive Services Use insights from vehicle data to prevent downtime warranty and recall issues, offer new services that improve user experience Vehicle Health Reports Maintenance Reminders Vehicle Alerts Convenience
Advanced Navigation Unify navigation data elements like maps, weather, traffic, and parking to deliver optimized routing and location based services Highly Automated Driving Maps, Geolocation & Geofencing Geolocation & geofencing Contextual POI search Contextual routing Geospatial Analysis Maps for Highly Automated Driving (HAD) Sample scenario: A car knows that its driver leaves for work at 7:55am every day. It detects that the driver has an 8:00am meeting, and selects and suggests a different route based on optimal cell connectivity for the Skype call, rather than the shortest trip duration.
Open source Stream Processing on Azure HDInsight 6/8/2018 8:44 PM Connected Car Architecture powered by Azure HDInsight Long term storage 4G/5G network cards Real-time applications IoT Hubs Open source Stream Processing on Azure HDInsight Real-time dashboards Azure VNet Boundary Laptop as Gateway © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Agenda Scenarios for realtime Big Data analytics 6/8/2018 8:44 PM Agenda Scenarios for realtime Big Data analytics Break down how to build a streaming pipeline, technologies used and tradeoffs Industry shift towards open source, challenges and how HDInsight helps Understand Kafka internals and HDInsight Kafka benefits Techniques to estimate the resources needed for a streaming system Deploy a streaming pipeline, walk through a real life example © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Presentation/Serving Layer Big Data Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Data Consumption (Ingestion) Data Processing Presentation/Serving Layer
Big Data Architecture Data Processing Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) REALTIME ANALYTICS Data Processing Business apps Custom apps Sensors + devices Realtime Machine Learning (Anomaly Detection) PowerBI dashboard Azure Stream Analytics (Shared with field Ops, customers, MIS, and Engineers) CosmosDB BATCH ANALYTICS HDI + ISVs OLAP for Data Warehousing HDI Custom ETL Aggregate /Partition Machine Learning (Spark + Azure ML) (Failure and RCA Predictions) Operational logs Local DB Logs Legacy Data BIG DATA STORAGE ANALYTICS Big Data Storage Big Data Applications Azure Data Lake Store CosmosDB Azure Blob Storage INTERACTIVE ANALYTICS Interactive HDInsight clusters Data Scientists, BI Analysts
SPEED LAYER BATCH LAYER Big Data Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) SPEED LAYER REALTIME ANALYTICS Data Processing Business apps Custom apps Sensors + devices Realtime Machine Learning (Anomaly Detection) PowerBI dashboard Azure Stream Analytics (Shared with field Ops, customers, MIS, and Engineers) CosmosDB BATCH LAYER BATCH ANALYTICS HDI + ISVs OLAP for Data Warehousing HDI Custom ETL Aggregate /Partition Machine Learning (Spark + Azure ML) (Failure and RCA Predictions) Operational logs Local DB Logs Legacy Data BIG DATA STORAGE ANALYTICS Big Data Storage Big Data Applications Azure Data Lake Store CosmosDB Azure Blob Storage INTERACTIVE ANALYTICS Interactive HDInsight clusters Data Scientists, BI Analysts
SPEED LAYER BATCH LAYER Big Data Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) SPEED LAYER REALTIME ANALYTICS Data Processing Business apps Custom apps Sensors + devices Realtime Machine Learning (Anomaly Detection) PowerBI dashboard Azure Stream Analytics (Shared with field Ops, customers, MIS, and Engineers) CosmosDB BATCH LAYER BATCH ANALYTICS HDI + ISVs OLAP for Data Warehousing HDI Custom ETL Aggregate /Partition Machine Learning (Spark + Azure ML) (Failure and RCA Predictions) Operational logs Local DB Logs Legacy Data BIG DATA STORAGE ANALYTICS Big Data Storage Big Data Applications Azure Data Lake Store CosmosDB Azure Blob Storage INTERACTIVE ANALYTICS Interactive HDInsight clusters Data Scientists, BI Analysts
Breaking down the Streaming Space 6/8/2018 8:44 PM Breaking down the Streaming Space IT’S CROWDED! Google Millwheel Azure Stream Analytics © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Big Data Streaming patterns Tech Ready 15 6/8/2018 8:44 PM Big Data Streaming patterns Long term storage Events Business apps Custom apps Sensors and devices High throughput Event Ingestion (~million events/sec) Low latency Complex Event Processing Events Events Real-time applications Events Events Real-time dashboards © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Breaking down the Streaming Space 6/8/2018 8:44 PM Breaking down the Streaming Space Purpose Technology High throughput Event Ingestion Apache Kafka Azure Event Hubs Amazon Kinesis Firehose Complex Event processing Apache Storm Apache Heron Apache Spark Streaming Azure Stream Analytics Microsoft Orleans Apache Samza Apache Flink Apache Kafka Streams Google Millwheel Google Cloud Dataflow Amazon Kinesis Analytics © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Big Data Streaming patterns Tech Ready 15 6/8/2018 8:44 PM Big Data Streaming patterns Long term storage Events Business apps Custom apps Sensors and devices Event Ingestion Event Hubs Azure Stream Analytics Stream Processing Events Events Events Real-time applications Events Real-time dashboards © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Streaming patterns on Azure Tech Ready 15 6/8/2018 8:44 PM Streaming patterns on Azure Long term storage Events Business apps Custom apps Sensors and devices Event Ingestion Event Hubs Azure Stream Analytics Stream Processing Events Events Events Real-time applications Events Open Source Services on HDInsight Real-time dashboards © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Ignite 2016 6/8/2018 8:44 PM Streaming concepts Realtime streaming data really means analyzing data in motion at any given point in time. Windowing Semantics Sliding, Tumbling, Hopping windows Processing Semantics At-least once (duplicates tolerable) At-most once (no duplicates) Exactly once (no duplicates or messages missed) https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Tech Summit FY17 Choosing a streaming platform Microsoft Tech Summit FY17 6/8/2018 8:44 PM Open Source technologies (Kafka, Spark, Storm) AWS Kinesis, Azure Event Hubs, Azure Stream Analytics Multi-tenant service Throughput unit limit Message size limit Elegant Scale model No throttling Performance & Scalability Lift and shift model. Same architecture works on either cloud, or on-prem Not lift + shift -- migrate the applications to use the technology specific APIs from one platform to the other Multi Cloud + Hybrid Architectures Open Source community, Xplat developer familiarity, Java, Scala, C#, Python, notebooks Developer Ecosystem .Net and SQL familiarity Strong community, Forums, StackOverflow Support Company support No cluster management. Pick up and go model Ease of use Cluster Management involved © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Agenda Scenarios for realtime Big Data analytics 6/8/2018 8:44 PM Agenda Scenarios for realtime Big Data analytics Break down how to build a streaming pipeline, technologies used and tradeoffs Industry shift towards open source, challenges and how HDInsight helps Understand Kafka internals and HDInsight Kafka benefits Techniques to estimate the resources needed for a streaming system Deploy a streaming pipeline, walk through a real life example © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Companies using Open Source Streaming (Kafka, Storm and Spark) 6/8/2018 8:44 PM Companies using Open Source Streaming (Kafka, Storm and Spark) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Challenges with Open Source 6/8/2018 8:44 PM Challenges with Open Source Getting up and running is hard. Hiring and retaining the talent is a challenge Need highly trained SREs and livesite experts to ensure zero downtime Each second the pipeline is down, losing data, revenue No single authority for implementing security and compliance Not a one size fits all. Each enterprise has different requirements and tailoring these technologies requires additional work (and many more…) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Azure HDInsight (99.9% SLA on each technology) 6/8/2018 8:44 PM Azure HDInsight (99.9% SLA on each technology) Open Source Analytics service for the enterprise © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Tech Summit FY17 6/8/2018 8:44 PM Fully-managed Hadoop and Spark for the cloud. 99.9% SLA 100% Open Source Hortonworks data platform Clusters up and running in minutes Familiar BI tools, interactive open source notebooks 63% lower TCO than deploy your own Hadoop on-premises* Scale clusters on demand Secure Hadoop workloads via Active Directory and Ranger Compliance for Open Source bits Best in class monitoring and predictive operations via OMS Native Integration with leading ISVs Azure HDInsight Open source analytics service for the Enterprise *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight” © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Deploy Globally within minutes 6/8/2018 8:44 PM Deploy Globally within minutes Multi Region Availability Available in >25 regions world-wide Launched most recently in US West 2, and UK regions Available in China, Europe and US Government clouds © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Security + Compliance to enable OSS for Enterprises 6/8/2018 8:44 PM Perimeter Level Security Virtual Networks Network Security Groups (firewalls) Authentication Azure Active Directory Kerberos authentication Authorization Apache Ranger RBAC for Admin POSIX ACLs for Data Plane Data Security Server-Side encryption at rest HTTPS/TLS In-transit Enterprise security and monitoring for big data solutions on Azure HDInsight Speaker: Saurin Shah Thursday, September 28 4:00 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Rich Developer Ecosystem 6/8/2018 8:44 PM Rich Developer Ecosystem Plugins for HDI available for most popular IDEs for agile development and debugging Unique feature of IntelliJ plugin is remote debugging of Spark jobs running on the HDInsight cluster. Rich support for powerful notebooks used by data scientists Develop in C#, deploy on Linux in Java via HDI developed SCP.Net technology New: Remote Debugging through IntelliJ https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-intellij-tool-debug-remotely-through-ssh https://www.youtube.com/watch?v=wQtj_wjn1Ac © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Applications – Data Science 6/8/2018 8:44 PM HDInsight Applications – Data Science https://aka.ms/hdi-scn-datascience Building modern data pipelines with Spark on Azure HDInsight Speaker: Maxim Lukiyanov Tuesday, September 26 12:30 PM Patterns, Architecture, & Best Practices: Scaling Machine Learning Algorithms with Azure HDInsight Speaker: Xiaoyong Zhu © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Applications – Data Warehousing 6/8/2018 8:44 PM HDInsight Applications – Data Warehousing https://aka.ms/hdi-scn-warehousing Building Petabyte scale Interactive Data warehouse in Azure HDInsight Speaker: Ashish Thapliyal , Dharmesh Kakadia Wednesday, September 27 4:00 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Agenda Scenarios for realtime Big Data analytics 6/8/2018 8:44 PM Agenda Scenarios for realtime Big Data analytics Break down how to build a streaming pipeline, technologies used and tradeoffs Industry shift towards open source, challenges and how HDInsight helps Understand Kafka internals and HDInsight Kafka benefits Techniques to estimate the resources needed for a streaming system Deploy a streaming pipeline, walk through a real life example © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Azure HDInsight Kafka for HDInsight addition 6/8/2018 8:44 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Big Data Streaming patterns 6/8/2018 8:44 PM Big Data Streaming patterns Long term storage Business apps Custom apps Sensors and devices Events Event Processing Stream Processing Events Events Events Real-time applications Events Real-time dashboards © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Kafka Architecture Producer Consumer (L) TOPIC-01 Partition 1 6/8/2018 8:44 PM TOPIC-01 Partition 1 Replica 1 (L) Producer Consumer Partition 2 Replica 1 (L) Partition 3 Replica 1 (L) Cluster Apache Zookeeper © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Kafka Architecture Producer Consumer (L) TOPIC-01 Partition 1 6/8/2018 8:44 PM TOPIC-01 Partition 1 Replica 1 (L) Producer Consumer Partition 2 Replica 1 (L) Config Number Topics 1 Partitions 3 Nodes/Brokers Replicas Partition 3 Replica 1 (L) Cluster Apache Zookeeper © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Kafka Architecture Producer Consumer (L) TOPIC-01 Partition 3 6/8/2018 8:44 PM TOPIC-01 Broker Id 1 Partition 1 Replica 1 (L) Partition 3 Replica 2 Producer Consumer Broker Id 2 Partition 2 Replica 1 (L) Partition 1 Replica 2 Broker Id 3 Partition 3 Replica 1 (L) Partition 2 Replica 2 Config Number Topics 1 Partitions 3 Nodes/Brokers Replicas 2 Cluster Apache Zookeeper © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Kafka Managed Kafka with a 99.9 % SLA 6/8/2018 8:44 PM HDInsight Kafka Managed Kafka with a 99.9 % SLA Through 4 clicks and 12 minutes, get a managed Kafka cluster that can scale beyond 768 TB (soft limit) Only offering to provide 99.9% SLA on the Kafka uptime We constantly monitor and fix Kafka failures, VM and disk failures so you can concentrate on writing realtime applications and the higher level pipelines © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Kafka Kafka Rack Awareness for Azure 6/8/2018 8:44 PM HDInsight Kafka Kafka Rack Awareness for Azure Kafka was designed with a single dimensional view of a rack Azure environments provide higher reliability with 2D rack view with Update Domains and Fault Domains HDInsight Kafka adds rack awareness support for environments like Azure by spreading out the replicas across update domains and fault domains. This provides the highest levels of Kafka uptime. We also provide tooling for customers to rebalance this – https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-high- availability Designed carefully such that no change to application code or open source required. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDI Kafka + Azure Managed Disks
HDI Kafka + Azure Managed Disks
HDInsight Kafka Out of the box integration with Azure 6/8/2018 8:44 PM HDInsight Kafka Out of the box integration with Azure Operations Management Suite (OMS) Azure OMS provides a rich experience for Alerting, Monitoring, and automated RunBooks against threshold metrics HDInsight Kafka exposes everything from VM NIC, Disk level to Kafka’s JMX metrics through JMX Through a single dashboard monitor not only the cluster, but the end to end streaming pipeline for predictive maintenance For ex. Let’s say a disk is getting 80% full. Not only can you alert on this metric, but automatically trigger a runbook to add nodes to the cluster. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Kafka Connect streaming pipelines with Virtual Networks 6/8/2018 8:44 PM HDInsight Kafka Connect streaming pipelines with Virtual Networks HDInsight Kafka in the cloud is deployed in an Azure VNet for providing the highest layer of data security Build on the Vnet security by either Deploying applications in the Vnet Joining the Vnet from on-prem through a secure VPN Peering two Azure Vnets together HDInsight has rich templates for one click deploy for all of the above https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Kafka Extend from on-prem to HDInsight Kafka 6/8/2018 8:44 PM HDInsight Kafka Extend from on-prem to HDInsight Kafka Azure VNet Boundary https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-connect-vpn-gateway © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Kafka Kafka Replication with MirrorMaker 6/8/2018 8:44 PM HDInsight Kafka Kafka Replication with MirrorMaker Kafka is often deployed in multiple environments for Disaster Recovery, high availability, and on-prem to cloud hybrid scenarios. These require replication of data from one Kafka to the other. HDInsight has worked closely with enterprise customers to understand this need, and provides support for data replication scenarios through Apache MirrorMaker https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-kafka-mirroring © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDI Kafka Customer Success "Toyota manufactures millions of cars running globally, and building a connected car platform to process real-time data at Toyota scale is a monumental challenge. To process events at Toyota’s scale, technologies such as Kafka need to be leveraged. Since HDInsight is the only managed platform that provides Kafka as a managed service with a 99.9% SLA, Toyota was able to leverage the scalable technology of Kafka, Storm and Spark on Azure HDInsight. Using the HDInsight platform, we were able to deploy enterprise grade streaming pipelines to process events from millions of cars every second. This is just scratching the surface - the future of global connected cars on Azure HDInsight is bright, and we are excited for what's in store." --Vijay Chemuturi, Chief Product Owner, Toyota Connected https://azure.microsoft.com/en-us/blog/announcing-public-preview-of-apache-kafka-on-hdinsight-with-azure-managed-disks
Agenda Scenarios for realtime Big Data analytics 6/8/2018 8:44 PM Agenda Scenarios for realtime Big Data analytics Break down how to build a streaming pipeline, technologies used and tradeoffs Industry shift towards open source, challenges and how HDInsight helps Understand Kafka internals and HDInsight Kafka benefits Techniques to estimate the resources needed for a streaming system Deploy a streaming pipeline, walk through a real life example © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Size estimate calculations 6/8/2018 8:44 PM Size estimate calculations Typical bottlenecks in Big Data Systems CPU Memory Network Disks + Storage Kafka uses Filesystem Cache, hence CPU and Memory often not a bottleneck for ingesting millions of events/sec. Network and Disk often do become the bottleneck – to balance the total throughput tradeoffs between these often need to be made © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Size estimate calculations - Kafka 6/8/2018 8:44 PM Size estimate calculations - Kafka Message Rate 10,000 messages/sec Message size 150 KB upperbound Replica count 3 Retention Policy 12 hours Known Inputs through Perf Runs: D12V2 VMs Network limit: 450 MBps Requirements: Total Throughput: 10,000 messages/sec * 150 KB / message * 3 replicas = 4500 MB/sec total throughput Nodes needed from network throughput perspective (4500 MB/sec / 450 MBps) = 10 Nodes to accommodate the network bandwidth # Nodes needed from storage perspective: 4500 MB/sec * 12 hours = 194.4 TB / retention policy. Since each VM can attach 16 disks of 1 TB each. 194.4 TB / 16 disks per node => 13 nodes from storage perspective Final # nodes needed: MAX(#Network Nodes, #Storage Nodes) => 13 nodes with 16 managed disks on each node © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Build Streaming - Agenda 6/8/2018 8:44 PM Build Streaming - Agenda Scenarios for realtime Big Data analytics Break down how to build a streaming pipeline, technologies used and tradeoffs Industry shift towards open source, challenges and how HDInsight helps Understand Kafka internals and HDInsight Kafka benefits Techniques to estimate the resources needed for a streaming system Deploy a streaming pipeline, walk through a real life example © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo: Build Streaming pipeline with Kafka + Spark Streaming 6/8/2018 8:44 PM Demo: Build Streaming pipeline with Kafka + Spark Streaming © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo Structure Azure VNet Boundary 6/8/2018 8:44 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo Structure Create Topic Publish NYC Taxi Events 6/8/2018 8:44 PM Demo Structure Create Topic Publish NYC Taxi Events Azure VNet Boundary © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo Structure Create Topic Publish NYC Taxi Events 6/8/2018 8:44 PM Demo Structure Create Topic Publish NYC Taxi Events 3. Stream events in a tumbling window 4. Analyze in realtime SQL Azure VNet Boundary © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo Structure Create Topic Publish NYC Taxi Events Long term storage 6/8/2018 8:44 PM Demo Structure Create Topic Publish NYC Taxi Events Long term storage 3. Stream events in a tumbling window 4. Analyze in realtime SQL Azure VNet Boundary © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo: Build Streaming pipeline with Kafka + Spark Streaming 6/8/2018 8:44 PM Demo: Build Streaming pipeline with Kafka + Spark Streaming © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Databus (Siphon) Usage 8 million EVENTS PER SECOND PEAK INGRESS 800 TB (10 GB per Sec) INGRESS PER DAY 1,800; 450 PRODUCTION KAFKA BROKERS; TOPICS 15 Sec 99th PERCENTILE LATENCY KEY CUSTOMER SCENARIOS Ads Monetization (Fast BI) O365 Customer Fabric NRT – Tenant & User insights BingNRT Operational Intelligence Presto (Fast SML) interactive analysis Delve Analytics
Microsoft Databus Architecture Open Source Microsoft Internal Siphon
HDInsight Streaming investments 6/8/2018 8:44 PM HDInsight Streaming investments Continued innovation for making HDInsight Kafka, Storm and Spark Streaming More secure and compliant for our enterprise customers Richer monitoring experience to further reduce the operationalizability cost of Open Source Streaming © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Streaming investments 6/8/2018 8:44 PM HDInsight Streaming investments Close collaboration with the Open Source Streaming Community © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
6/8/2018 8:44 PM Resources HDInsight developer academy+ guide https://academy.microsoft.com/en-us/professional-program/big-data/ https://github.com/hdinsight/hdinsight-dev-guide/blob/master/HDInsight%20Developer%20Guide.pdf HDInsight Streaming resources Getting up and running with Apache Kafka on Azure HDInsight Deploy Kafka on Azure HDInsight with one click Configure Scalability with Managed Disks Configure High availability with Rack awareness Build realtime pipelines with Spark Streaming and Storm with one click. Use Kafka with Storm on Azure HDInsight Use Kafka with Spark Structured DStreams on Azure HDInsight Use Kafka with Spark Structured Streaming on Azure HDInsight Monitoring, Debugging + Extensions Learn how to use HDInsight Kafka's integration with Azure Monitoring Connect to HDInsight Kafka from an on-premises network, or a development enviroment using Azure Virtual Networks. Use MirrorMaker to replicate data from on-premises, or another Kafka instance to, and from HDInsight Kafka. Other https://www.pluralsight.com/courses/spark-kafka-cassandra-applying-lambda-architecture Toyota Blog: https://azure.microsoft.com/en-us/blog/announcing-public-preview-of-apache-kafka-on-hdinsight-with-azure-managed-disks/ © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Please fill out the speaker evaluations! 6/8/2018 8:44 PM Thank you! Questions + Feedback: Raghav Mohan ramoha@Microsoft.com Please fill out the speaker evaluations! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
HDInsight Ignite sessions 6/8/2018 8:44 PM HDInsight Ignite sessions Building modern data pipelines with Spark on Azure HDInsight Speaker: Maxim Lukiyanov Tuesday, September 26 12:30 PM Building Petabyte scale Interactive Data warehouse in Azure HDInsight Speaker: Ashish Thapliyal , Dharmesh Kakadia Wednesday, September 27 4:00 PM Enterprise security and monitoring for big data solutions on Azure HDInsight Speaker: Saurin Shah Thursday, September 28 4:00 PM Streaming Big Data on Azure with HDInsight Kafka, Storm and Spark Speaker: Raghav Mohan Thursday, September 28 9:00 AM Tweet to us with #HDInsight, come checkout the HDInsight Booth for SWAG! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.