Download presentation
Presentation is loading. Please wait.
Published byTerence Bruce Modified over 8 years ago
1
#SQLSAT454 Azure Stream Analytics [Part of the Data Platform] Marco Parenzan
2
#SQLSAT454 Sponsors
3
#SQLSAT454 Meet Marco Parenzan Microsoft MVP 2015 for Azure Develop modern distributed and cloud solutions marco.parenzan@1nn0va.it Passion for speaking and inspiring programmers, students, people www.innovazionefvg.net www.innovazionefvg.net I’m a developer!
4
#SQLSAT454 Agenda Why a developer talks about analytics Analytics in a modern world Introduction to Azure Stream Analytics Stream Analytics Query Language (SAQL) Handling time in Azure Stream Analytics Scaling Analytics Conclusions
5
#SQLSAT454 ANALYTICS IN A MODERN WORLD
6
#SQLSAT454 What is Analytics From Wikipedia Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.
7
#SQLSAT454 IoT proof of concept
8
#SQLSAT454 Event-based systems Event I “something happened… …somewhere… …sometime! Event arrive at different times i.e. have unique timestamps Events arrive at different rates (events/sec). In any given period of time there may be 0, 1 or more events
9
#SQLSAT454 Azure Service Bus Relay Queue Topic Notification Hub Event Hub NAT and Firewall Traversal Service Request/Response Services Unbuffered with TCP Throttling Transactional Cloud AMQP/HTTP Broker High-Scale, High-Reliability Messaging Sessions, Scheduled Delivery, etc. Transactional Message Distribution Up to 2000 subscriptions per Topic Up to 2K/100K filter rules per subscription High-scale notification distribution Most mobile push notification services Millions of notification targets Hyper Scale. A Million Clients. Concurrent.
10
#SQLSAT454 Azure Event Hubs Event Producers > 1M Producers > 1GB/sec Aggregate Throughput Direct Hash Throughput Units: 1 ≤ TUs ≤ Partition Count TU: 1 MB/s writes, 2 MB/s reads
11
#SQLSAT454 Microsoft Azure IoT Services DevicesDevice ConnectivityStorageAnalyticsPresentation & Action Event HubsSQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDBHDInsight Notification Hubs External Data Sources Data FactoryMobile Services BizTalk Services { }
12
#SQLSAT454 ANALYTICS IN A MODERN WORLD
13
#SQLSAT454 Traditional analytics Everything around us produce data From devices, sensors, infrastructures and applications Traditional Business Intelligence first collects data and analyzes it afterwards Typically 1 day latency, the day after But we live in a fast paced world Social media Internet of Things Just-in-time production Offline data is unuseful For many organizations, capturing and storing event data for later analysis is no longer enough Data at Rest
14
#SQLSAT454 Analytics in a modern world We work with streaming data We want to monitor and analyze data in near real time Typically a few seconds up to a few minutes latency So we don’t have the time to stop, copy data and analyze, but we have to work with streams of data Data in motion
15
#SQLSAT454 Scenarios Real-time ingestion, processing and archiving of data Real-time Analytics Connected devices (Internet of Things)
16
#SQLSAT454 Why Stream Analytics in the Cloud? Not all data is local Event data is already in the Cloud Event data is globally distributed Bring the processing to the data, not the data to the processing 16
17
#SQLSAT454 Apply cloud principles Focus on building solutions (PAAS or SAAS) Without having to manage complex infrastructure and software no hardware or other up-front costs and no time- consuming installation or setup has elastic scale where resources are efficiently allocated and paid for as requested Scale to any volume of data while still achieving high throughput, low-latency, and guaranteed resiliency Up and running in minutes
18
#SQLSAT454 SCENARIO demo 18
19
#SQLSAT454 An API can be a “thing” Api Apps, Logic Apps, World-wide distributed API (Rest) Resource consuming (CPU, storage, network bandwidth) Each request is logged With Event Hub or in log files Evaluate how API is going on “real time” statistics Ex. ASP.NET apps logs directly on EventHub
20
#SQLSAT454 INTRODUCTION TO AZURE STREAM ANALYTICS
21
#SQLSAT454 What is Azure Stream Analytics? Azure Stream Analytics is a cost effective event processing engine Describe their desired transformations in SQL-like syntax Is a stream processing engine that is integrated with a scalable event queuing system like Azure Event Hubs
22
#SQLSAT454 Canonical Stream Analytics Pattern
23
#SQLSAT454 Real-time analytics Intake millions of events per second Intake millions of events per second (up to 1 GB/s) At variable loads Scale that accommodates variable loads Low processing latency, auto adaptive (sub-second to seconds) Transform, augment, correlate, temporal operations Correlate between different streams, or with reference data Find patterns or lack of patterns in data in real-time
24
#SQLSAT454 No challenges with scale Elasticity of the cloud for scale out Spin up any number of resources on demand Scale from small to large when required Distributed, scale-out architecture
25
#SQLSAT454 Fully managed No hardware (PaaS offering) Bypasses deployment expertise No software provisioning and maintaining No performance tuning Spin up any number of resources on demand Expand your business globally leveraging Azure regions
26
#SQLSAT454 Mission critical availability Guaranteed events delivery Guaranteed not to lose events or incorrect output Guaranteed “once and only once” delivery of event Ability to replay events Guaranteed business continuity Guaranteed uptime (three nines of availability) Auto-recovery from failures Built in state management for fast recovery Effective Audits Privacy and security properties of solutions are evident Azure integration for monitoring and ops alerting
27
#SQLSAT454 Lower costs Efficiently pay only for usage Architected for multi-tenancy Not paying for idle resources Typical cloud expense model Low startup costs Ability to incrementally add resources Reduce costs when business needs changes
28
#SQLSAT454 Rapid development SQL like language High-level: focus on stream analytics solution Concise: less code to maintain First-class support for event streams and reference data Built in temporal semantics Built-in temporal windowing and joining Simple policy configuration to manage out-of- order events and late arrivals
29
#SQLSAT454 Azure Stream Analytics Data Source CollectProcessConsumeDeliver Event Inputs -Event Hub -Azure Blob Transform -Temporal joins -Filter -Aggregates -Projections -Windows -Etc. Enrich Correlate Outputs -SQL Azure -Azure Blobs -Event Hub -Service Bus Queue -Service Bus Topics - Table storage - PowerBI Azure Storage Temporal Semantics Guaranteed delivery Guaranteed up time Reference Data -Azure Blob
30
#SQLSAT454 Inputs sources for a Stream Analytics Job Currently supported input Data Streams are Azure Event Hub, Azure IoT Hub and Azure Blob Storage. Multiple input Data Streams are supported. Advanced options lets you configure how the Job will read data from the input blob (which folders to read from, when a blob is ready to be read, etc). Reference data is usually static or changes very slowly over time. Must be stored in Azure Blob Storage. Cached for performance
31
#SQLSAT454 Defining Event Schema The serialization format and the encoding for the for the input data sources (both Data Streams and Reference Data) must be defined. Currently three formats are supported: CSV, JSON and Avro (binary JSON - https://avro.apache.org/docs/1.7.7/spec.ht ml) https://avro.apache.org/docs/1.7.7/spec.ht ml For CSV format a number of common delimiters are supported: (comma (,), semi- colon(;), colon(:), tab and space. For CSV and Avro optionally you can provide the schema for the input data.
32
#SQLSAT454 Output for Stream Analytics Jobs Currently data stores supported as outputs Azure Blob storage: creates log files with temporal query results Ideal for archiving Azure Table storage: More structured than blob storage, easier to setup than SQL database and durable (in contrast to event hub) SQL database: Stores results in Azure SQL Database table Ideal as source for traditional reporting and analysis Event hub: Sends an event to an event hub Ideal to generate actionable events such as alerts or notifications Service Bus Queue: sends an event on a queue Ideal for sending events sequentially Service Bus Topics: sends an event to subscribers Ideal for sending events to many consumers PowerBI.com: Ideal for near real time reporting! DocumentDb: Ideal if you work with json and object graphs
33
#SQLSAT454 PREPARATION demo 33
34
#SQLSAT454 STREAM ANALYTICS QUERY LANGUAGE (SAQL)
35
#SQLSAT454 SAQL – Language & Library
36
#SQLSAT454 Supported types TypeDescription bigint Integers in the range -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). float Floating point numbers in the range - 1.79E+308 to -2.23E-308, 0, and 2.23E-308 to 1.79E+308. nvarchar(max)Text values, comprised of Unicode characters. Note: A value other than max is not supported. datetimeDefines a date that is combined with a time of day with fractional seconds that is based on a 24-hour clock and relative to UTC (time zone offset 0). Inputs will be casted into one of these types We can control these types with a CREATE TABLE statement: This does not create a table, but just a data type mapping for the inputs
37
#SQLSAT454 INTO clause Pipelining data from input to output Without INTO clause we write to destination named ‘output’ We can have multiple outputs With INTO clause we can choose for every select the appropriate destination E.g. send events to blob storage for big data analysis, but send special events to event hub for alerting SELECT UserName, TimeZone INTO Output FROM InputStream WHERE Topic = 'XBox'
38
#SQLSAT454 WHERE clause Specifies the conditions for the rows returned in the result set for a SELECT statement, query expression, or subquery There is no limit to the number of predicates that can be included in a search condition. SELECT UserName, TimeZone FROM InputStream WHERE Topic = 'XBox'
39
#SQLSAT454 JOIN We can combine multiple event streams or an event stream with reference data via a join (inner join) or a left outer join In the join clause we can specify the time window in which we want the join to take place We use a special version of DateDiff for this
40
#SQLSAT454 Reference Data Seamless correlation of event streams with reference data Static or slowly-changing data stored in blobs CSV and JSON files in Azure Blobs scanned for new snapshots on a settable cadence JOIN (INNER or LEFT OUTER) between streams and reference data sources Reference data appears like another input: SELECT myRefData.Name, myStream.Value FROM myStream JOIN myRefData ON myStream.myKey = myRefData.myKey
41
#SQLSAT454 Reference data tips Currently reference data cannot be refreshed automatically. You need to stop the job and specify new snapshot with reference data Reference Data are only in Blog Practice says that you use services like Azure Data Factory to move data from Azure Data Sources to Azure Blob Storage Have you followed Francesco Diaz’s session?
42
#SQLSAT454 UNION SELECT TollId, ENTime AS Time, LicensePlate FROM EntryStream TIMESTAMP BY ENTime UNION SELECT TollId, EXTime AS Time, LicensePlate FROM ExitStream TIMESTAMP BY EXTime TollIdEntryTimeLicensePlate… 12014-09-10 12:01:00.000JNB 7001… 12014-09-10 12:02:00.000YXZ 1001… 32014-09-10 12:02:00.000ABC 1004… TollIdExitTimeLicensePlate 12009-06-25 12:03:00.000JNB 7001 12009-06-2512:03:00.000YXZ 1001 32009-06-25 12:04:00.000ABC 1004 TollIdTimeLicensePlate 12014-09-10 12:01:00.000JNB 7001 12014-09-10 12:02:00.000YXZ 1001 32014-09-10 12:02:00.000ABC 1004 1 2009-06-25 12:03:00.000 JNB 7001 1 2009-06-2512:03:00.000YXZ 1001 3 2009-06-25 12:04:00.000ABC 1004
43
#SQLSAT454 STORING, FILTERING AND DECODING demo 43
44
#SQLSAT454 HANDLING TIME IN AZURE STREAM ANALYTICS
45
#SQLSAT454 Traditional queries Traditional querying assumes the data doesn’t change while you are querying it: query a fixed state If the data is changing: snapshots and transactions ‘freeze’ the data while we query it Since we query a finite state, our query should finish in a finite amount of time table query result table
46
#SQLSAT454 A different kind of query When analyzing a stream of data, we deal with a potential infinite amount of data As a consequence our query will never end! To solve this problem most queries will use time windows stream temporal query result strea m
47
#SQLSAT454 Arrival Time Vs Application Time Every event that flows through the system comes with a timestamp that can be accessed via System.Timestamp This timestamp can either be an application time which the user can specify in the query A record can have multiple timestamps associated with it The arrival time has different meanings based on the input sources. For the events from Azure Service Bus Event Hub, the arrival time is the timestamp given by the Event Hub For Blob storage, it is the blob’s last modified time. If the user wants to use an application time, they can do so using the TIMESTAMP BY keyword Data are sorted by timestamp column
48
#SQLSAT454 Temporal Joins SELECT Make FROM EntryStream ES TIMESTAMP BY EntryTime JOIN ExitStream EX TIMESTAMP BY ExitTime ON ES.Make= EX.Make AND DATEDIFF(second,ES,EX) BETWEEN 0 AND 10 Time (Seconds) {“Mazda”,6} {“BMW”,7}{“Honda”,2}{“Volvo”,3} Toll Entry : {“Mazda”,3}{“BMW”,7}{“Honda”,2}{“Volvo”,3} Toll Exit : 0510152025
49
#SQLSAT454 Windowing Concepts Common requirement to perform some set-based operation (count, aggregation etc) over events that arrive within a specified period of time Group by returns data aggregated over a certain subset of data How to define a subset in a stream? Windowing functions! Each Group By requires a windowing function 1542686 4 t1t2t5t6t3t4 Time Window 1Window 2Window 3 Aggregate Function (Sum) 1814 Output Events
50
#SQLSAT454 Three types of windows Every window operation outputs events at the end of the window The output of the window will be single event based on the aggregate function used. The event will have the time stamp of the window All windows have a fixed length 50 Tumbling window Aggregate per time interval Hopping window Schedule overlapping windows Sliding window Windows constant re-evaluated 1542686 4 t1t2t5t6t3t4 Time Window 1Window 2Window 3 Aggregate Function (Sum) 1814 Output Events
51
#SQLSAT454 Tumbling Window 1542686 5 Time (secs) 15426 86 A 20-second Tumbling Window 361 5361 Tumbling windows: Repeat Are non-overlapping SELECT TollId, COUNT(*) FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, TumblingWindow(second, 20) Query: Count the total number of vehicles entering each toll booth every interval of 20 seconds. An event can belong to only one tumbling window 1542686 A 10-second Tumbling Window 86 5361 15 426 61 53
52
#SQLSAT454 Hopping Window 1542686 A 20-second Hopping Window with a 10 second “Hop” Hopping windows: Repeat Can overlap Hop forward in time by a fixed period Same as tumbling window if hop size = window size Events can belong to more than one hopping window SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, HoppingWindow (second, 20,10) 426 86 5361 15 426 8653 61 53 QUERY: Count the number of vehicles entering each toll booth every interval of 20 seconds; update results every 10 seconds 1542686 A 10-second Hopping Window with a 5-second “Hop” 426 86 5361 15 426 8653 61 53
53
#SQLSAT454 Sliding Window 15 A 20-second Sliding Window Sliding window: Continuously moves forward by an ε (epsilon) Produces an output only during the occurrence of an event Every windows will have at least one event Events can belong to more than one sliding window SELECT TollId, Count(*) FROM EntryStream ES GROUP BY TollId, SlidingWindow (second, 20) HAVING Count(*) > 10 Query: Find all the toll booths which have served more than 10 vehicles in the last 20 seconds 1 8 8 51 9 519 1542686 Every 10-second Sliding Window with changes 86 5361 15 426 61 53
54
#SQLSAT454 TEMPORAL TASKS demo 54
55
#SQLSAT454 SCALING ANALYTICS
56
#SQLSAT454 Steaming Unit Is a measure of the computing resource available for processing a Job A streaming unit can process up to 1 Mb / second By default every job consists of 1 streaming unit. Total number of streaming units that can be used depends on : rate of incoming events complexity of the query
57
#SQLSAT454 Multiple steps, multiple outputs A query can have multiple steps to enable pipeline execution A step is a sub-query defined using WITH (“common table expression”) The only query outside of the WITH keyword is also counted as a step Can be used to develop complex queries more elegantly by creating a intermediary named result Each step’s output can be sent to multiple output targets using INTO WITH Step1 AS ( SELECT Count(*) AS CountTweets, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(second, 3), Topic, PartitionId ), Step2 AS ( SELECT Avg(CountTweets) FROM Step1 GROUP BY TumblingWindow(minute, 3) ) SELECT * INTO Output1 FROM Step1 SELECT * INTO Output2 FROM Step2 SELECT * INTO Output3 FROM Step2
58
#SQLSAT454 Scaling Concepts – Partitions When a query is partitioned, input events will be processed and aggregated in a separate partition groups Output events are produced for each partition group To read from Event Hubs ensure that the number of partitions match The query within the step must have the Partition By keyword If your input is a partitioned event hub, we can write partitioned queries and partitioned subqueries (WITH clause) A non-partitioned query with a 3-fold partitioned subquery can have (1+3) * 4 = 24 streaming units! SELECT Count(*) AS Count, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(minute, 3), Topic, PartitionId QueryResult 1 QueryResult 2 QueryResult 3 Event Hub
59
#SQLSAT454 Out of order inputs Event Hub guarantees monotonicity of the timestamp on each partition of the Event Hub All events from all partitions are merged by timestamp order, there will be no out of order events. When it's important for you to use sender's timestamp, so a timestamp from the event payload is chosen using "timestamp by," there can be several sources or disorderness introduced. Producers of the events have clock skews. Network delay from the producers sending the events to Event Hub. Clock skews between Event Hub partitions. Do we skip them (drop) or do we pretend they happened just now (adjust)?
60
#SQLSAT454 Handling out of order events On the configuration tab, you will find the following defaults. Using 0 seconds as the out of order tolerance window means you assert all events are in order all the time. To allow ASA to correct the disorderness, you can specify a non- zero out of order tolerance window size. ASA will buffer events up to that window and reorder them using the user chosen timestamp before applying the temporal transformation. Because of the buffering, the side effect is the output is delayed by the same amount of time As a result, you will need to tune the value to reduce the number of out of order events and keep the latency low.
61
#SQLSAT454 STRUCTURING AND SCALING QUERY demo 61
62
#SQLSAT454 CONCLUSIONS
63
#SQLSAT454 Summary Azure Stream Analytics is the PaaS solution for Analytics on streaming data It is programmable with a SQL-like language Handling time is a special and central feature Scale with cloud principles: elastic, self service, multitenant, pay per use More questions: Other solutions Pricing What to do with that data? Futures
64
#SQLSAT454 Microsoft real-time stream processing options
65
#SQLSAT454 Apache Storm (in HDInsight) Apache Storm is a distributed, fault-tolerant, open source real-time event processing solution. Storm was originally used by Twitter to process massive streams of data from the Twitter firehose. Today, Storm is an incubator project as part of the Apache Software foundation. Typically, Storm will be integrated with a scalable event queuing system like Apache Kafka or Azure Event Hubs.
66
#SQLSAT454 Stream Analytics vs Apache Storm Storm: Data Transformation Can handle more dynamic data (if you're willing to program) Requires programming Stream Analytics Ease of Setup JSON and CSV format only Can change queries within 4 minutes Only takes inputs from Event Hub, Blob Storage Only outputs to Azure Blob, Azure Tables, Azure SQL, PowerBI
67
#SQLSAT454 Pricing Pricing based on volume per job: Volume of data processed Streaming units required to process the data stream Price (USD) Volume of Data Processed Volume of data processed by the streaming job (in GB) € 0.0009 per GB Streaming Unit* Blended measure of CPU, memory, throughput. € 0.0262 per hour € 18,864 per month
68
#SQLSAT454 Azure Machine Learning Undestand the “sequence” of data in the history to predict the future But Azure can ‘learn’ which values preceded issues Azure Machine Learning
69
#SQLSAT454 Power BI Solutions to create realtime dashboards SaaS Service Inside Office 365
70
#SQLSAT454 Futures [started] Native integration with Azure Machine Learning Provide better ways to debug. [planned] Call to a REST endpoint to invoke custom code [under review] Take input from DocumentDb
71
#SQLSAT454 Thanks Don’t forget to compile evaluations form here http://speakerscore.com/sqlsat454 http://speakerscore.com/sqlsat454 Marco Parenzan http://twitter.com/marco_parenzan http://twitter.com/marco_parenzan http://www.slideshare.net/marcoparenzan http://www.slideshare.net/marcoparenzan http://www.github.com/marcoparenzan http://www.github.com/marcoparenzan
72
#SQLSAT454 #sqlsat454
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.