Download presentation
Presentation is loading. Please wait.
1
Realtime Streaming on HDInsight
Microsoft Ignite 2016 10/28/ :34 AM Realtime Streaming on HDInsight DA233 Raghav Mohan Program Manager, Azure Big Data © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
2
10/28/ :34 AM Agenda Explore the Big Data Streaming space and understand the technologies Build a near real-time, enterprise ready IoT solution on HDInsight Examine customer journeys on HDInsight Streaming © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
3
Big Data Streaming use cases
© 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
4
Big Data Streaming Scenarios
Real-time Fraud Detection Streaming ETL Predictive Maintenance Call Center Analytics IT Infrastructure and Network Monitoring Customer Behavior Prediction Log Analytics Real-time Cross Sell Offers Fleet monitoring and Connected Cars Real-time Patient Monitoring Smart Grid Real-time Marketing and many more… © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
5
Real-time IoT Scenarios
Phone Tracking Across Cell Sites Connected Car - Remote Management & Diagnostics Asset Tracking Fleet Management Facilities Management Personnel Tracking & Crowd Control Ride Sharing Geofencing Racecar Telemetry Connected Manufacturing and many more…
6
Simple IoT Setup Tech Ready 15 10/28/2017 12:34 AM Gateways Devices
Stream Processing Azure Stream Analytics IoT Hubs Arduino Uno Laptop as Gateway Azure HDInsight © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
7
Building a Robust Big Data Architecture
10/28/ :34 AM Building a Robust Big Data Architecture © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
8
Presentation/Serving Layer
Big Data Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Data Consumption (Ingestion) Data Processing Presentation/Serving Layer
9
Presentation/Serving Layer
Lambda Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Data Consumption (Ingestion) Speed Layer Data Processing Presentation/Serving Layer Batch Layer
10
Presentation/Serving Layer
Lambda Architecture Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Data Consumption (Ingestion) Speed Layer Presentation/Serving Layer Batch Layer
11
Big Data Streaming patterns
Tech Ready 15 10/28/ :34 AM Big Data Streaming patterns Business apps Custom apps Sensors and devices Event Processing Event Hubs Azure Stream Analytics Stream Processing Events Events Events Power BI Events Azure Data Lake Store © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
12
Microsoft Ignite 2016 10/28/ :34 AM Streaming concepts Realtime streaming data really means analyzing data in motion at any given point in time. Concepts we will quickly review - Windowing Semantics Processing Semantics © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
13
A 10-second Window with a 5-second “Slide”
Tech Ready 15 10/28/ :34 AM Sliding Windows A 10-second Window with a 5-second “Slide” Slide.interval: 5 seconds Window.size : 10 seconds 1 5 4 6 2 8 7 5 3 6 1 5 10 15 20 25 30 Time (secs) 1 5 4 6 2 4 6 2 Every 5 seconds give me the count of tweets over the last 10 seconds 8 7 8 7 5 3 5 3 6 1 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
14
A 10-second Tumbling Window
Tech Ready 15 10/28/ :34 AM Tumbling Windows A 10-second Tumbling Window Slide.interval: 10 seconds Window.size : 10 seconds 1 5 4 6 2 8 6 5 3 6 1 5 10 15 20 25 30 Time (secs) 1 5 4 6 2 8 6 5 3 6 1 Every 10 seconds give me the count of tweets over the last 10 seconds © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
15
Stream Processing Semantics
Microsoft Ignite 2016 10/28/ :34 AM Stream Processing Semantics At-least Once Processing (Duplicates are tolerable) At-Most Once Processing (No Duplicate processing) At-least Once Processing (This is hard to achieve in practice) © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
16
Azure Big Data Streaming patterns
Tech Ready 15 10/28/ :34 AM Azure Big Data Streaming patterns Business apps Custom apps Sensors and devices Event Processing Event Hubs Azure Stream Analytics Stream Processing Events Events Events Power BI Events ADLS Open Source Services © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
17
Choosing a ingestion platform
HDI Kafka Azure event hubs Managed Yes Ordering Delivery At-least-once Lifetime Configurable 1–30 Days Replication Configurable (cross DC) Throughput *nodes 20 throughput units MapReduce No Record size 256K Cost Low - Medium Low
18
Choosing a stream processing platform
Microsoft Tech Summit FY17 10/28/ :34 AM Choosing a stream processing platform Azure Stream Analytics HDInsight Open Source – Storm + Spark Streaming Managed Yes Temporal operators Windowed aggregates, and temporal joins are supported out of the box Temporal operators must to be implemented Development experience Interactive authoring and debugging experience through Azure Portal Visual Studio, etc. Data Encoding formats Stream Analytics requires UTF-8 data format to be utilized Any data encoding format may be implemented via custom code Scalability Number of Streaming Units for each job. Each Streaming Unit processes up to 1MB/s. Max of 50 units by default. Call to increase limit Number of nodes in the HDI Spark cluster. No limit on number of nodes (Top limit defined by your Azure quota). Call to increase limit Data processing limits Users can scale up or down number of Streaming Units to increase data processing or optimize costs Scale up to 1 GB/s User can scale up or down cluster size to meet needs. Late arrival and out of order event handling Built-in configurable policies to reorder, drop events or adjust event time User must implement logic to handle this scenario © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
19
Microsoft Tech Summit FY17
10/28/ :34 AM Choosing a streaming platform Open Source on HDI (Kafka, Spark, Storm) Azure Event Hubs, Azure Stream Analytics Elegant Scale model No throttling Multi-tenant service Throughput unit limit Message size limit Performance & Scalability Open Source community, Xplat developer familiarity, Java, Scala, C#, Python, notebooks .Net and SQL familiarity Developer Ecosystem Strong community, Forums, StackOverflow Support Microsoft support Ease of use Cluster Management involved No cluster management. Pick up and go model © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
20
Big Data Streaming patterns
Tech Ready 15 10/28/ :34 AM Big Data Streaming patterns Business apps Custom apps Sensors and devices Event Processing Event Hubs Azure Stream Analytics Stream Processing Events Events Events Power BI Events HBase, Azure Data Lake Store Open Source Services © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
21
Open source powers a lot of enterprises
10/28/ :34 AM Open source powers a lot of enterprises This is just scratching the surface. Full lists are located on powered by Spark, Storm and Kafka. Kafka Storm Spark LinkedIn AliBaba Uber Netflix Yelp Amazon Microsoft Twitter Groupon Pinterest Yahoo Oracle Spotify Alibaba © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
22
Open source is great, but …
10/28/ :34 AM Open source is great, but … Hard to setup, run and scale Not secure or compliant Community supported, other models have clunky support Huge cost to operationalize monitor and fix issues 24/7 © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
23
Azure HDInsight Cloud Hadoop and Spark as a Service on Azure
Fully-managed Hadoop and Spark for the cloud with 99.9% SLA 100% Open Source Hortonworks data platform Clusters up and running in minutes Enterprise level monitoring and alerting with Operations Management Suite Familiar BI tools for analysis, or open source notebooks for interactive data science 63% lower TCO than deploy your own Hadoop on-premises* Azure HDInsight Cloud Hadoop and Spark as a Service on Azure
24
Azure HDInsight Compliance for Open Source baked in
Leverage leading ISVs with HDInsight via one-click
25
HDInsight Offerings
26
Big Data Streaming patterns
Tech Ready 15 10/28/ :34 AM Big Data Streaming patterns Business apps Custom apps Sensors and devices Stream Buffering Stream Processing Events Events Events Power BI Events ADLS © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
27
Big Data Streaming patterns
Tech Ready 15 10/28/ :34 AM Big Data Streaming patterns Azure Operations Management Suite Business apps Custom apps Sensors and devices Event Processing Stream Processing Events Events Events Power BI Events ADLS More Enterprise Features to come… © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
28
What is Kafka? High throughput, low-latency message queueing
10/28/ :34 AM What is Kafka? High throughput, low-latency message queueing Developed at LinkedIn, now part of Apache OSS Simple Publish-Subscribe messaging model (Much like Dish subscribers) Robust, scalable architecture Data is partitioned, replicated across the whole cluster Scales linearly as throughput increases © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
29
Kafka Architecture Producer Config Number Topics 1 Partitions 3
10/28/ :34 AM Kafka Architecture Broker Id 1 Partition 1 Replica 1 (L) Consumer Producer Broker Id 2 Partition 2 Replica 1 (L) Consumer Consumer Broker Id 3 Config Number Topics 1 Partitions Nodes/Brokers 3 Replicas Partition 3 Replica 1 (L) 1 3 Apache Zookeeper Cluster © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
30
Kafka Architecture Producer (L) Partition 3 Partition 1 Replica 2
10/28/ :34 AM Kafka Architecture Broker Id 1 Partition 1 Replica 1 (L) Partition 3 Replica 2 Consumer Producer Broker Id 2 Partition 2 Replica 1 Partition 1 Replica 2 (L) Consumer Consumer Broker Id 3 Partition 3 Replica 1 (L) Config Number Topics 1 Partitions 3 Nodes/Brokers Replicas 2 Partition 2 Replica 2 Apache Zookeeper Cluster © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
31
HDInsight Kafka Advantages
10/28/ :34 AM HDInsight Kafka Advantages Native Azure Operations Management Suite Support Rack Awareness for Azure environments Only platform in industry to provide Managed Kafka with 99.9 SLA Cross DataCenter Support for Disaster Recovery © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
32
Stream Processing 10/28/2017 12:34 AM
© 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
33
Apache Storm introduction
{…} Tuple {…} Stream {…} Spout Core unit of data Immutable set of key/value pair Unbounded sequence of Tuples Source of streams Wraps a streaming data source and emits Tuples
34
Spout API Lifecycle API public interface Ispout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void activate(); void deactivate(); void nextTuple(); void ack(Object id); void fail(Object id); } Lifecycle API
35
Spout API Core API public interface Ispout extends Serializable {
void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void activate(); void deactivate(); void nextTuple(); void ack(Object id); void fail(Object id); } Core API
36
Spout API Reliability API
public interface Ispout extends Serializable { void open(Map conf, TopologyContext context, SpoutOutputCollector collector); void close(); void activate(); void deactivate(); void nextTuple(); void ack(Object id); void fail(Object id); } Reliability API
37
Using Bolts Write to a data store Read from a data store Perform arbitrary computation (Optionally) Emit additional streams {…} Compute Core functions of a streaming computation | Receive tuples and do stuff
38
Bolt API Lifecycle API public interface IBolt extends Serializable {
void prepare(Map sormConf, TopologyContext context, OutputCollector collector); void cleanup(); void execute(Tuple input); } Lifecycle API
39
Bolt API Core API public interface IBolt extends Serializable {
void prepare(Map sormConf, TopologyContext context, OutputCollector collector); void cleanup(); void execute(Tuple input); } Core API
40
Topologies
41
Define topology Language counts Twitter tweets
TopologyBuilder b = new TopologyBuilder(); b.setSpout (“TwitterSampleSpout”, new TwitterSampleSpout()); b.setBolt (“LanguageCountBolt”, new LanguageCountBolt()). fieldsGrouping (“TwitterSampleSpout”, new Fields (“lang”)).setNumTasks(7);
42
Apache Storm Trident Extension of Storm, built for stateful stream processing, and low-latency querying Fluent Stream-oriented API Merges and joins High level abstraction Built on Storm’s core primitives Built for Aggregation, groupings, functions, & filters Uses micro-batching Helps achieve Exactly Once processing with Storm
43
10/28/ :34 AM Storm on HDInsight Only platform to offer Apache Storm as a managed Service with % SLA Cost savings via need based Scaling for clusters and topologies .NET development for Linux clusters via SCP.Net (msft developed) Native Operations Management Suite integration © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
44
Apache Spark as a platform
Spark SQL Spark Streaming GraphX (graph) MLlib (machine learning) Apache Spark Data Sources
45
Apache Spark Takeaways (amongst many)
Achieve Big Data tasks from one platform Data cleaning, ETL, Streaming, Machine Learning, Interactive queries Extremely fast Built using in-memory constructs such as Resilient Distributed Datasets, and Dataframes (Spark 2.0+) Easy to use Rich, powerful interactive notebooks for Data Scientists. Scala, python support for developers
46
Spark on Azure HDInsight
Microsoft Ignite 2016 10/28/ :34 AM Spark on Azure HDInsight Fully Managed Service 100% open source Apache Spark and Hadoop bits Fully supported by Microsoft and Hortonworks 99.9% Azure Cloud SLA Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC Optimized for data exploration, experimentation and development Jupyter Notebooks (scala, python, automatic data visualizations) IntelliJ and Eclipse plugins (job submission, remote debugging) ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
47
Integrated with Azure data pipelines
Batch pipelines Azure Data Factory orchestrates Spark ETL jobs PowerBI connector for Spark HA of job submission service Streaming pipelines High Availability for Spark Streaming using Yarn Azure connectors: EventHub, Power BI, Azure SQL, DW, Data Lake Github: Blog: saving-spark-dataframe-to-powerbi/
48
Apache Spark as a platform
Spark SQL Spark Streaming GraphX (graph) MLlib (machine learning) Apache Spark Data Sources
49
Spark Streaming Spark offers two streaming approaches
Microsoft Ignite 2016 10/28/ :34 AM Spark Streaming Spark offers two streaming approaches Direct Streaming (Dstreams) Structured Streaming (Spark 2.0+) © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
50
Limitations of DStream APIs
Processing data using event time DStreams natively support batch time only, which makes it hard to process late arrived data Interoperability between streaming and batch While APIs are already similar manual translation is still needed between the two Exactly-once processing semantics are not guaranteed Exactly-once semantics depend on the Receiver implementation and not always guaranteed Check-pointing efficiency Checkpointing saves complete RDD, no incremental storage possible.
51
Structured streaming Static DataFrame Continuous DataFrame
Microsoft Ignite 2016 10/28/ :34 AM Structured streaming Static DataFrame Continuous DataFrame Single API ! The simplest way to perform streaming analytics is to not having to reason about streaming © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
52
Structured streaming High-level streaming API based on Datasets/DataFrames Event time, windowing, sessions, sources & sinks End-to-end exactly once processing semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models
53
Batch ETL pipeline Read from static json file Do query
val input = spark.read .format("json") .load("source-path") Read from static json file val output = input .select(“clientid“, “querytime”) .where(“querytime > 100") Do query output.write .format("json") .save(“dest-path") Write to static json file
54
Streaming pipeline val input = spark.readStream .format("json") .load("source-path") readStream…load() creates a streaming DataFrame, does not start any computation val output = input .select(“clientid“, “querytime”) .where(“querytime > 100") output.writeStream .format("json") .start(“dest-path") writeStream…start() specifies where to output the data and starts processing
55
Continuous aggregations
input.avg(“querytime") Continuously compute average query time across all clients input.groupBy(“devicemake") .avg(“querytime") Continuously compute average query time of each type of device manufacturer
56
Continuous windowed aggregations
input.groupBy( $“devicemake”, window($“event-time”, “10 min”)) .avg(“querytime”) Continuously compute average query time of each type of device manufacturer in the last 10 minutes using event-time Simplifies event-time stream processing (not possible in DStreams) Works in both, streaming and batch jobs
57
Joining streams with static data
val kafkaDataset = spark.readStream .kafka(“device-updates”) .load() Join streaming data from Kafka with static dataset from JDBC source to enrich streaming data val staticDataset = spark.read .jdbc(“jdbc://”, “device-info”) val joinedDataset = kafkaDataset.join( staticDataset, “devicemake”)
58
Streaming v/s Micro-batching
Data Sources Ingest Prepare (normalize, clean, etc.) Analyze (stat analysis, ML, etc.) Publish (for programmatic consumption, BI/visualization) Consume (Alerts, Operational Stats, Insights) Data Consumption (Ingestion) Speed Layer Data Processing Presentation/Serving Layer Micro-Batching Batch Layer
59
A 10-second Tumbling Window
Tech Ready 15 10/28/ :34 AM Tumbling Windows A 10-second Tumbling Window Slide.interval: 10 seconds Window.size : 10 seconds 1 5 4 6 2 8 6 5 3 6 1 5 10 15 20 25 30 Time (secs) 1 5 4 6 2 8 6 5 3 6 1 Every 10 seconds give me the count of tweets over the last 10 seconds © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
60
Micro-batch labeled as “5”
Tech Ready 15 10/28/ :34 AM Spark Micro-batching RDD1 RDD2 RDD3 RDD4 1 5 4 6 2 5 10 Time (secs) Micro-batch labeled as “5” 1 5 4 6 2 Batch Interval (secs) © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
61
Compare and contrast Storm Spark Streaming
Supports micro-batching only. Adds to latency, however may provide higher throughput Event-by-Event Processing Micro-batching to achieve Exactly Once Lower overhead for Lambda architectures Same API for Batch and Streaming Lower latency without Trident. With Trident, performance comparison is on a tuning basis Both Support Windowing, transformations, exactly-once processing. Evaluate which to use depending upon your performance, scale needs and development ecosystem
62
Demo Deploy Kafka and Spark Streaming on HDInsight
63
Putting it all together
10/28/ :34 AM Putting it all together © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
64
Simple IoT Setup Tech Ready 15 10/28/2017 12:34 AM Gateways Devices
Stream Processing IoT Hubs Azure HDInsight Arduino Uno Laptop as Gateway © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
65
10/28/ :34 AM Customers © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
66
Shared Data (Siphon) 10/28/2017 12:34 AM
© 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
67
Siphon Usage 6 million 4 petabytes 1,700+ 10 Sec
EVENTS PER SECOND INGRESS 4 petabytes PROCESSED PER DAY 1,700+ PRODUCTION KAFKA BROKERS 10 Sec 99th PERCENTILE LATENCY KEY CUSTOMER SCENARIOS Ads Monetization (Fast BI) O365 Customer Fabric NRT – Tenant & User insights BingNRT Operational Intelligence Presto (Fast SML) interactive analysis Delve Analytics
68
Customer Fabric: Understanding O365 Customers
Customer Fabric is a shared insight repository built on near real time analysis of customer and service data that enables anyone in O365 to quickly and deeply understand our customers. Key Capabilities Provides sophisticated user/tenant SEEC facets (Satisfaction, Engagement, Experience, Core) Built on a compliant, NRT, big data platform that scales to every user in the world Easily accessible for anyone in O365 to leverage and contribute to The key scenario requirements are Collect hundreds of service logs from O365 services with minutes of latency Support OSS technology stack (Spark streaming, Cassandra) Meet O365 compliance requirements. Handle EUII scrubbing (hashing/encryption). Key benefits from Siphon Scalable pub/sub O365 Compliance Native OSS integration (Spark streaming) Auditing infrastructure
69
Siphon Architecture Open Source Microsoft Internal Siphon
70
Data completeness – Audit Trail
Kafka Brokers Sampled vs Full Auditing support Collector Load Balancer Broker P0 P1 P2 High Level Consumer Broker P3 P4 Collector P5 Broker P6 P7 P8 Collector Broker P9 P10 P11
71
Continue your Ignite learning path
10/28/ :34 AM Continue your Ignite learning path Visit Channel 9 to access a wide range of Microsoft training and event recordings Head to the TechNet Eval Centre to download trials of the latest Microsoft products Visit Microsoft Virtual Academy for free online training visit © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
72
10/28/ :34 AM Thank you Chat with me in the Speaker Lounge Find © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.