Azure Stream Analytics

Azure Stream Analytics
Marco Parenzan @marco_parenzan

Sponsors

Organizers getlatestversion.it

Meet Marco Parenzan | @marco_parenzan
Microsoft MVP 2015 for Azure Develop modern distributed and cloud solutions Passion for speaking and inspiring programmers, students, people I’m a developer! Meet Stephen Hall, I have gotten to know Stephen very well in the past several months working on this project. Stephen started District Computers more than 10 years ago with the customer in mind. His company specializes in Small Business with a focus on Office365.

Agenda Why a developer talks about analytics
Analytics in a modern world Introduction to Azure Stream Analytics Stream Analytics Query Language (SAQL) Handling time in Azure Stream Analytics Scaling Analytics Conclusions

Analytics in a modern world

What is Analytics From Wikipedia
Analytics is the discovery and communication of meaningful patterns in data. Especially valuable in areas rich with recorded information, analytics relies on the simultaneous application of statistics, computer programming and operations research to quantify performance. Analytics often favors data visualization to communicate insight.

IoT proof of concept

Event-based systems Event I “something happened…
…somewhere… …sometime! Event arrive at different times i.e. have unique timestamps Events arrive at different rates (events/sec). In any given period of time there may be 0, 1 or more events

Azure Service Bus Azure Service Bus Relay Queue Topic Notification Hub
5/16/2018 Azure Service Bus Azure Service Bus Relay NAT and Firewall Traversal Service Request/Response Services Unbuffered with TCP Throttling Queue Transactional Cloud AMQP/HTTP Broker High-Scale, High-Reliability Messaging Sessions, Scheduled Delivery, etc. Topic Transactional Message Distribution Up to 2000 subscriptions per Topic Up to 2K/100K filter rules per subscription Notification Hub High-scale notification distribution Most mobile push notification services Millions of notification targets Event Hub Hyper Scale. A Million Clients. Concurrent. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Up to 32 partitions via portal, more on request
Azure Event Hubs IEventProcessor Event Processor Host Azure Event Hub Direct Receivers > 1M Producers > 1GB/sec Aggregate Throughput Consumer Group(s) Partitions PartitionKey Event Producers Hash Throughput Units: 1 ≤ TUs ≤ Partition Count TU: 1 MB/s writes, 2 MB/s reads Up to 32 partitions via portal, more on request

Microsoft Azure IoT Services
Devices Device Connectivity Storage Analytics Presentation & Action Event Hubs SQL Database Machine Learning App Service Service Bus Table/Blob Storage Stream Analytics Power BI External Data Sources DocumentDB HDInsight Notification Hubs Data Factory Mobile Services BizTalk Services { }

Traditional analytics
Data at Rest Everything around us produce data From devices, sensors, infrastructures and applications Traditional Business Intelligence first collects data and analyzes it afterwards Typically 1 day latency, the day after But we live in a fast paced world Social media Internet of Things Just-in-time production Offline data is unuseful For many organizations, capturing and storing event data for later analysis is no longer enough

Data in motion We work with streaming data We want to monitor and analyze data in near real time Typically a few seconds up to a few minutes latency So we don’t have the time to stop, copy data and analyze, but we have to work with streams of data started/?WT.mc_id=Blog_SQL_Announce_DI

Scenarios Real-time ingestion, processing and archiving of data
Real-time Analytics Connected devices (Internet of Things) ingest a continuous stream of data and do in-flight processing like scrubbing PII information, adding geo-tagging, and doing IP lookups before being sent to a data store. Fraud Detection: Monitor financial transactions in real-time to detect fraudulent activity. Business Operations: Analyze real-time data to respond to dynamic environments in order to take immediate action provide real-time dashboarding where customers can see trends that happen immediately when they occur. Collect real-time metrics to gain immediate insight into a website’s usage patters or application performance. get real-time information from their connected devices like machines, buildings, or cars so that relevant action can be done. This can include scheduling a repair technician, pushing down software updates or to perform a specific automated action. Monitor and diagnose real-time data from connected devices such as vehicles, buildings, or machinery in order to generate alerts, respond to events, or optimize operations.

Why Stream Analytics in the Cloud?
Not all data is local Event data is already in the Cloud Event data is globally distributed Bring the processing to the data, not the data to the processing

Apply cloud principles
Focus on building solutions (PAAS or SAAS) Without having to manage complex infrastructure and software no hardware or other up-front costs and no time-consuming installation or setup has elastic scale where resources are efficiently allocated and paid for as requested Scale to any volume of data while still achieving high throughput, low-latency, and guaranteed resiliency Up and running in minutes

demo SCENARIO

An API can be a “thing” Api Apps, Logic Apps,
World-wide distributed API (Rest) Resource consuming (CPU, storage, network bandwidth) Each request is logged With Event Hub or in log files Evaluate how API is going on “real time” statistics Ex. ASP.NET apps logs directly on EventHub

INTRODUCTION TO Azure Stream Analytics

What is Azure Stream Analytics?
Azure Stream Analytics is a cost effective event processing engine Describe their desired transformations in SQL-like syntax Is a stream processing engine that is integrated with a scalable event queuing system like Azure Event Hubs

Canonical Stream Analytics Pattern
Tech Ready 15 5/16/2018 Canonical Stream Analytics Pattern Presentation and action Storage and Batch Analysis Stream Analysis Ingestion Collection Event production Event hubs Cloud gateways (web APIs) Field gateways Applications Legacy IOT (custom protocols) Devices IP-capable devices (Windows/Linux) Low-power devices (RTOS) Search and query Data analytics (Power BI) Web/thick client dashboards Event Hubs SQL DB Storage Tables Power BI Storage Blobs Stream Analytics Devices to take action Machine Learning more to come… © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Real-time analytics Intake millions of events per second
Build 2015 5/16/2018 5:03 AM Real-time analytics Intake millions of events per second Intake millions of events per second (up to 1 GB/s) At variable loads Scale that accommodates variable loads Low processing latency, auto adaptive (sub-second to seconds) Transform, augment, correlate, temporal operations Correlate between different streams, or with reference data Find patterns or lack of patterns in data in real-time Key Points: Stream Analytics provides processing events at scale – millions per second – with variable loads analyzing the data in real-time – event correlating with reference data. Talk track: Processes millions of events per second Scale accommodates variable loads and preserves even order on a per-device basis Performs continuous real-time analytics for transforming, augmenting, correlating using temporal operations. This allows pattern and anomaly detection Correlates streaming data with reference – more static – data Think of augmenting events containing IPs with geo-location data or real-time stock market trading events with stock information. © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

No challenges with scale
Elasticity of the cloud for scale out Spin up any number of resources on demand Scale from small to large when required Distributed, scale-out architecture

Fully managed No hardware (PaaS offering)
Bypasses deployment expertise No software provisioning and maintaining No performance tuning Spin up any number of resources on demand Expand your business globally leveraging Azure regions

Mission critical availability
Guaranteed events delivery Guaranteed not to lose events or incorrect output Guaranteed “once and only once” delivery of event Ability to replay events Guaranteed business continuity Guaranteed uptime (three nines of availability) Auto-recovery from failures Built in state management for fast recovery Effective Audits Privacy and security properties of solutions are evident Azure integration for monitoring and ops alerting Key Points: Stream Analytics has built-in guaranteed event delivery and business continuity which is critical for providing reliability and resiliency. Talk track: You will not lose any events. The service provides exactly once delivery of events. You don’t have to write any code for this and you can use it to replay events on failures or from a particular time based on the retention policy you have setup with Event Hubs. 3 9’s availability built into the service. Recovery from failures does not need to start at the beginning of a window. It can start from when the failure occurred in the window. This enables businesses to be as real-time as possible.

Lower costs Efficiently pay only for usage Typical cloud expense model
Architected for multi-tenancy Not paying for idle resources Typical cloud expense model Low startup costs Ability to incrementally add resources Reduce costs when business needs changes

Rapid development SQL like language Built in temporal semantics
High-level: focus on stream analytics solution Concise: less code to maintain First-class support for event streams and reference data Built in temporal semantics Built-in temporal windowing and joining Simple policy configuration to manage out-of-order events and late arrivals Stream Analytics gives developers the fastest productivity experience by abstracting the complexities of writing code for scale out over distributed systems and for custom analytics. Instead, developers need only describe the desired transformations using a declarative SQL language and the system will handle everything else. Normally, event processing solutions are arduous to implement because of the amount of custom code that needs to be written. Developers have to write code that reflects distributed systems taking into account coding for parallelization, deployment over a distributed platform, scheduling and monitoring. Furthermore, code for the analytical functions also must be written. While other cloud services for the most part have solutions that handle programming over the distributed platform, likely their code still is procedural and thus lower level and more complex to write (as compared to SQL commands. On-premise software may not even be designed to scale to data of high volumes through distributed scale out architectures. Key Points: Normally, event processing solutions are arduous to implement because of the amount of custom code that needs to be written. Developers have to write code that reflects distributed systems taking into account coding for parallelization, deployment over a distributed platform, scheduling and monitoring. Furthermore, code for the analytical functions also must be written. While other cloud services for the most part have solutions that handle programming over the distributed platform, likely their code still is procedural and thus lower level and more complex to write (as compared to SQL commands). On-premises software may not even be designed to scale to data of high volumes through distributed scale out architectures. Talk track: Developers focus on using a SQL-like language to construct stream processing logic and not worrying about accounting for parallelization, deployment to a distributed platform or creating temporal operators. Use the SQL-like language across streams to filter, project, aggregate, compare reference data, and perform temporal operations. Development, maintenance, and debugging can be done entirely through the Azure Management Portal. For public preview, support: Input: Azure Event Hubs, Azure Blobs Output: Azure Event Hubs, Azure Blobs, and Azure SQL Database, Azure Tables

Azure Stream Analytics
Outputs SQL Azure Azure Blobs Event Hub Service Bus Queue Service Bus Topics Table storage PowerBI  Temporal Semantics Guaranteed delivery Guaranteed up time Transform Temporal joins Filter Aggregates Projections Windows Etc. Enrich Correlate  Event Inputs Event Hub Azure Blob BI Dashboards Predictive Analytics ☁ Reference Data Azure Blob Azure Storage Data Source Collect Process Deliver Consume

Inputs sources for a Stream Analytics Job
Currently supported input Data Streams are Azure Event Hub , Azure IoT Hub and Azure Blob Storage. Multiple input Data Streams are supported. Advanced options lets you configure how the Job will read data from the input blob (which folders to read from, when a blob is ready to be read, etc). Reference data is usually static or changes very slowly over time. Must be stored in Azure Blob Storage. Cached for performance The Azure portal provides wizards to guide the user through the processing of adding inputs. Every Job must have at least one data stream source. It can have multiple data streams. Currently supported data stream sources are Event Hubs and Blob Storage. Reference data is optional. Ref data is usually data that changes infrequently. An example might be a product catalog, data that maps city name to zipcode, customer profile data etc. Ref data is cached in memory for improved performance. Currently ref data must be in Blob Storage. The wizard collects all the information needed to read events from the input data. Blob Storage advanced option lets you specify additional details such as: Blob Serialization boundary: This setting determines when a blob is ready for reading. Stream Analytics supports Blob Boundary (the blob can be uploaded as a single piece or in blocks, but every block must be committed before the blob is read and can't be appended) and Block Boundary (blocks can be continuously added, and each block is individually readable and can be read as it's committed). Path Pattern: The file path used to locate your blobs within the specified container. Within the path, you may choose to specify one or more instances of the following 3 variables: {date}, {time}, {partition}. Ex 1: cluster1/logs/{date}/{time}/{partition} or Ex 2: cluster1/logs/{date} You can test whether you entered the correct information by testing for connectivity Every data streams must have a name (Input Alias). You use this name in the query to refer to a specific data stream. [It is the name of the ‘table’ you select from; more on this later].

Defining Event Schema The serialization format and the encoding for the for the input data sources (both Data Streams and Reference Data) must be defined. Currently three formats are supported: CSV, JSON and Avro (binary JSON - ml) For CSV format a number of common delimiters are supported: (comma (,), semi-colon(;), colon(:), tab and space. For CSV and Avro optionally you can provide the schema for the input data. In additional the connectivity information for the data sources, you must also specify the serialization format for the events coming from the source. Currently 3 serialization formats are supported: JSON, CSV and Avro For CSV format you can specify the delimiter (comma, semi-colon, colon, tab or space). Only utf-8 encoding is supported for now.

Output for Stream Analytics Jobs
Currently data stores supported as outputs Azure Blob storage: creates log files with temporal query results Ideal for archiving Azure Table storage: More structured than blob storage, easier to setup than SQL database and durable (in contrast to event hub) SQL database: Stores results in Azure SQL Database table Ideal as source for traditional reporting and analysis Event hub: Sends an event to an event hub Ideal to generate actionable events such as alerts or notifications Service Bus Queue: sends an event on a queue Ideal for sending events sequentially Service Bus Topics: sends an event to subscribers Ideal for sending events to many consumers PowerBI.com: Ideal for near real time reporting! DocumentDb: Ideal if you work with json and object graphs The Azure portal provides wizards to guide the user through the processing of add an output. The process for adding outputs to a Job is similar to that of adding inputs. The wizard collects all the information required to connect and store the results in the ouptput. In addition to Blob Storage and Event Hubs, ASA also supports storing the results in an Azure SQL Database. Note that when you use an Azure SQL database the schema of the result event and the Azure SQL database table must be compatible. Just as with inputs you have to define the serialization formats for blob storage and event hubs. The three supported formats are CSV, JSON and Avro. Utf-8 is the supported encoding format.

demo preparation

Stream Analytics Query Language (SAQL)

SAQL – Language & Library
Build 2015 5/16/2018 5:03 AM SAQL – Language & Library DML SELECT FROM WHERE GROUP BY HAVING CASE WHEN THEN ELSE INNER/LEFT OUTER JOIN UNION CROSS/OUTER APPLY CAST INTO ORDER BY ASC, DSC Date and Time Functions DateName DatePart Day Month Year DateTimeFromParts DateDiff DateAdd Aggregate Functions Sum Count Avg Min Max StDev StDevP Var VarP Temporal Functions Lag, IsFirst CollectTop String Functions Len Concat CharIndex Substring PatIndex Windowing Extensions TumblingWindow HoppingWindow SlidingWindow Duration Scaling Extensions WITH PARTITION BY OVER © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Supported types Inputs will be casted into one of these types
Description bigint Integers in the range -2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807). float Floating point numbers in the range E+308 to -2.23E-308, 0, and 2.23E-308 to 1.79E+308. nvarchar(max) Text values, comprised of Unicode characters. Note: A value other than max is not supported. datetime Defines a date that is combined with a time of day with fractional seconds that is based on a 24-hour clock and relative to UTC (time zone offset 0). Inputs will be casted into one of these types We can control these types with a CREATE TABLE statement: This does not create a table, but just a data type mapping for the inputs

INTO clause Pipelining data from input to output
Without INTO clause we write to destination named ‘output’ We can have multiple outputs With INTO clause we can choose for every select the appropriate destination E.g. send events to blob storage for big data analysis, but send special events to event hub for alerting SELECT UserName, TimeZone INTO Output FROM InputStream WHERE Topic = 'XBox'

WHERE clause Specifies the conditions for the rows returned in the result set for a SELECT statement, query expression, or subquery There is no limit to the number of predicates that can be included in a search condition. SELECT UserName, TimeZone FROM InputStream WHERE Topic = 'XBox'

JOIN We can combine multiple event streams or an event stream with reference data via a join (inner join) or a left outer join In the join clause we can specify the time window in which we want the join to take place We use a special version of DateDiff for this

Build 2015 5/16/2018 5:03 AM Reference Data Seamless correlation of event streams with reference data Static or slowly-changing data stored in blobs CSV and JSON files in Azure Blobs scanned for new snapshots on a settable cadence JOIN (INNER or LEFT OUTER) between streams and reference data sources Reference data appears like another input: Currently reference data cannot be refreshed automatically. You need to stop the job and specify new snapshot with reference data. We are working on reference data refresh functionality, stay tuned for updates. SELECT myRefData.Name, myStream.Value FROM myStream JOIN myRefData ON myStream.myKey = myRefData.myKey © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Reference data tips Currently reference data cannot be refreshed automatically. You need to stop the job and specify new snapshot with reference data Reference Data are only in Blog Practice says that you use services like Azure Data Factory to move data from Azure Data Sources to Azure Blob Storage Have you followed Francesco Diaz’s session?

UNION TollId EntryTime LicensePlate … 1 :01:00.000 JNB 7001 :02:00.000 YXZ 1001 3 ABC 1004 TollId ExitTime LicensePlate 1 :03:00.000 JNB 7001 :03:00.000 YXZ 1001 3 :04:00.000 ABC 1004 Combines the results of two or more queries into a single result set that includes all the rows that belong to all the queries in the union Number and order of the columns must be the same in all queries Data types must be compatible If ‘ALL’ not specified duplicate rows are removed SELECT TollId, ENTime AS Time , LicensePlate FROM EntryStream TIMESTAMP BY ENTime UNION SELECT TollId, EXTime AS Time , LicensePlate FROM ExitStream TIMESTAMP BY EXTime UNION Combines the results of two or more queries into a single result set that includes all the rows that belong to all queries in the union. The UNION operation is different from using joins that combine columns from two tables. The following are basic rules for combining the result sets of two queries by using UNION: The number and the order of the columns must be the same in all queries. The data types must be compatible. ALL keyword Incorporates all rows into the results including duplicates. If not specified, duplicate rows are removed. TollId Time LicensePlate 1 :01:00.000 JNB 7001 :02:00.000 YXZ 1001 3 ABC 1004 :03:00.000 :03:00.000 :04:00.000

STORING, FILTERING AND DECODING
demo STORING, FILTERING AND DECODING

Handling time in Azure Stream Analytics

Traditional queries Traditional querying assumes the data doesn’t change while you are querying it: query a fixed state If the data is changing: snapshots and transactions ‘freeze’ the data while we query it Since we query a finite state, our query should finish in a finite amount of time table result table query

A different kind of query
When analyzing a stream of data, we deal with a potential infinite amount of data As a consequence our query will never end! To solve this problem most queries will use time windows stream result stream temporal query

Arrival Time Vs Application Time
Every event that flows through the system comes with a timestamp that can be accessed via System.Timestamp This timestamp can either be an application time which the user can specify in the query A record can have multiple timestamps associated with it The arrival time has different meanings based on the input sources. For the events from Azure Service Bus Event Hub, the arrival time is the timestamp given by the Event Hub For Blob storage, it is the blob’s last modified time. If the user wants to use an application time, they can do so using the TIMESTAMP BY keyword Data are sorted by timestamp column

Tech Ready 15 5/16/2018 Temporal Joins SELECT Make FROM EntryStream ES TIMESTAMP BY EntryTime JOIN ExitStream EX TIMESTAMP BY ExitTime ON ES.Make= EX.Make AND DATEDIFF(second,ES,EX) BETWEEN 0 AND 10 Join are used to combine events from two or more input sources Joins are temporal in nature – each JOIN must provide limits on how far the matching rows can be separated in time Time bounds are specified inside the ON clause using DATEDIFF function Supports LEFT OUTER JOIN to specify rows from the left table that do not meet the join condition Useful for pattern detection Toll Entry : {“Mazda”,6} {“BMW”,7} {“Honda”,2} {“Volvo”,3} Toll Exit : {“Mazda”,3} {“Honda”,2} {“BMW”,7} {“Volvo”,3} Like standard T-SQL, JOINs in the Azure Stream Analytics query language are used to combine records from two or more input sources. JOINs in Azure Stream Analytics are temporal in nature, meaning that each JOIN must provide some limits on how far the matching rows can be separated in time. For instance, saying “join EntryStream events with ExitStream events when they occur on the same LicensePlate and TollId and within 5 minutes of each other” is legitimate; but “join EntryStream events with ExitStream events when they occur on the same LicensePlate and TollId” is not – it would match each EntryStream with an unbounded and potentially infinite collection of all ExitStream to the same LicensePlate and TollId. The time bounds for the relationship are specified inside the ON clause of the JOIN, using the DATEDIFF function. The query in this slide joins events in the Entry and Exit Stream only if they are less than 10 seconds apart. The two “Mazda” events in the EntryStream and ExitStream will NOT be joined because they are more than 10 seconds apart. The two “Honda” events will not be joined because, even though they are less than 10 seconds apart, the event in the ExitStream has a timestamp earlier than event in the EntryStream. DATEDIFF(second,ES,EX) for these two events will be a negative number. Note: DATEDIFF used in the SELECT statement uses the general syntax where we pass a datetime column or expression as the second and third parameter. But when we use the DATEDIFF function inside the JOIN condition, we pass the input_source name or its alias. Internally the timestamp associated for each event in that source is picked. You cannot use SELECT * in JOINS Time (Seconds) 5 10 15 20 25 “Honda” – Not in result because event in Exit stream precedes event in Entry Stream “BMW” – Not in result because Entry and Exit stream events > 10 seconds apart Query Result = [Mazda, Volvo]

Tech Ready 15 5/16/2018 Windowing Concepts Common requirement to perform some set-based operation (count, aggregation etc) over events that arrive within a specified period of time Group by returns data aggregated over a certain subset of data How to define a subset in a stream? Windowing functions! Each Group By requires a windowing function Windowing (extensions to T-SQL) In applications that process real-time events, a common requirement is to perform some set-based computation (aggregation) or other operations over subsets of events that fall within some period of time. Because the concept of time is a fundamental necessity to complex event-processing systems, it’s important to have a simple way to work with the time component of query logic in the system. In ASA, these subsets of events are defined through windows to represent groupings by time. A window contains event data along a timeline and enables you to perform various operations against the events within that window. For example, you may want to sum the values of payload field. Every window operation outputs event at the end of the window. The windows of ASA are closed at the window start time and open at the window end time. For example, if you have a 5 minute window from 12:00 AM to 12:05 AM all events with timestamp greater than 12:00 AM and up to timestamp 12:05 AM inclusive will be included within this window. The output of the window will be a single event based on the aggregate function used with a timestamp equal to the window end time. The timestamp of the output event of the window can be projected in the SELECT statement using the System.Timestamp property using an alias. Every window automatically aligns itself to the zeroth hour. For example, a 5 minute tumbling window will align itself to (12:00-12:05] , (12:05-12:10], … Note: All windows should be used in a GROUP BY clause. In the example, the SUM of the events in first Window = = 18. Currently all window types are of fixed width (fixed interval) 1 5 4 2 6 8 t1 t2 t5 t6 t3 t4 Time Window 1 Window 2 Window 3 Aggregate Function (Sum) 18 14 Output Events

Three types of windows Every window operation outputs events at the end of the window The output of the window will be single event based on the aggregate function used. The event will have the time stamp of the window All windows have a fixed length 1 5 4 2 6 8 t1 t2 t5 t6 t3 t4 Time Window 1 Window 2 Window 3 Aggregate Function (Sum) 18 14 Output Events Tumbling window Aggregate per time interval Hopping window Schedule overlapping windows Sliding window Windows constant re-evaluated

A 10-second Tumbling Window
Tumbling windows: Repeat Are non-overlapping 1 5 4 2 6 8 20 10 Time (secs) A 10-second Tumbling Window 30 3 A 20-second Tumbling Window 1 5 4 6 2 8 6 5 3 6 1 An event can belong to only one tumbling window 10 20 30 40 50 60 Time (secs) Query: Count the total number of vehicles entering each toll booth every interval of 20 seconds. 1 5 4 6 2 8 6 SELECT TollId, COUNT(*) FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, TumblingWindow(second, 20) Tumbling windows specify a repeating, non-overlapping time interval of a fixed size. Syntax: TUMBLINGWINDOW(timeunit, windowsize) Timeunit – day, hour, minute, second, millisecond, microsecond, nanosecond. Windowsize – a bigInteger that described the size (width) of a window. Note that because tumbling windows are non-overlapping each event can only belong to one tumbling window. The query just counts the numbers of vehicles passing the toll station every 20 seconds, grouped by Toll Id. 5 3 6 1

A 10-second Hopping Window with a 5-second “Hop”
4 2 6 8 20 10 15 Time (secs) 25 A 10-second Hopping Window with a 5-second “Hop” 30 3 Hopping windows: Repeat Can overlap Hop forward in time by a fixed period Same as tumbling window if hop size = window size Events can belong to more than one hopping window A 20-second Hopping Window with a 10 second “Hop” 1 5 4 6 2 8 6 5 3 6 1 10 20 30 40 50 60 Time (secs) 1 5 4 6 2 4 6 2 QUERY: Count the number of vehicles entering each toll booth every interval of 20 seconds; update results every 10 seconds To get a finer granularity of time, we can use a generalized version of tumbling window, called Hopping Window. Hopping windows are windows that "hop" forward in time by a fixed period. The window is defined by two time spans: the hop size H and the window size S. For every H time unit, a new window of size S is created. The tumbling window is a special case of a hopping window where the hop size is equal to the window size. Syntax HOPPINGWINDOW ( timeunit , windowsize , hopsize ) HOPPINGWINDOW ( Duration( timeunit , windowsize ) , Hop (timeunit , windowsize ) Note: The Hopping Window can be used in the above two ways. If the windowsize and the hopsize has the same timeunit, you can use it without the Duration and Hop functions. The Duration function can also be used with other types of windows to specify the window size 8 6 8 6 5 3 SELECT COUNT(*), TollId FROM EntryStream TIMESTAMP BY EntryTime GROUP BY TollId, HoppingWindow (second, 20,10) 5 3 6 1

Every 10-second Sliding Window with changes
5 4 2 6 8 20 10 Time (secs) Every 10-second Sliding Window with changes 30 3 Sliding window: Continuously moves forward by an ε (epsilon) Produces an output only during the occurrence of an event Every windows will have at least one event Events can belong to more than one sliding window A 20-second Sliding Window 1 5 9 8 10 20 30 40 50 Time (secs) 1 5 1 Query: Find all the toll booths which have served more than 10 vehicles in the last 20 seconds 5 1 9 SELECT TollId, Count(*) FROM EntryStream ES GROUP BY TollId, SlidingWindow (second, 20) HAVING Count(*) > 10 A Sliding window is a fixed length window which moves forward by an (€) epsilon and produces an output only during the occurrence of an event. An epsilon is one hundredth of a nanosecond. Syntax SLIDINGWINDOW ( timeunit , windowsize ) SLIDINGWINDOW(DURATION(timeunit, windowsize), Hop(timeunit, windowsize)) 8

A 20-second Sliding Window
1 5 9 8 10 20 30 40 50 60 Time (secs) «1» enter 1 «5» enter 1 5 «9» enter 1 5 9 «1» exit 5 9 «5» exit 9 «9» exit «8» enter 8

demo TEMPORAL TASKS

Scaling Analytics

Steaming Unit Is a measure of the computing resource available for processing a Job A streaming unit can process up to 1 Mb / second By default every job consists of 1 streaming unit. Total number of streaming units that can be used depends on : rate of incoming events complexity of the query The number of streaming units that a job can utilize depends on the partition configuration for the inputs and the query defined for the job. Note also that a valid value for the stream units must be used. The valid values start at 1, 3, 6 and then upwards in increments of 6, as shown below.

Multiple steps, multiple outputs
Tech Ready 15 5/16/2018 Multiple steps, multiple outputs A query can have multiple steps to enable pipeline execution A step is a sub-query defined using WITH (“common table expression”) The only query outside of the WITH keyword is also counted as a step Can be used to develop complex queries more elegantly by creating a intermediary named result Each step’s output can be sent to multiple output targets using INTO WITH Step1 AS ( SELECT Count(*) AS CountTweets, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(second, 3), Topic, PartitionId ), Step2 AS ( SELECT Avg(CountTweets) FROM Step1 GROUP BY TumblingWindow(minute, 3) ) SELECT * INTO Output1 FROM Step1 SELECT * INTO Output2 FROM Step2 SELECT * INTO Output3 FROM Step2

Scaling Concepts – Partitions
When a query is partitioned, input events will be processed and aggregated in a separate partition groups Output events are produced for each partition group To read from Event Hubs ensure that the number of partitions match The query within the step must have the Partition By keyword If your input is a partitioned event hub, we can write partitioned queries and partitioned subqueries (WITH clause) A non-partitioned query with a 3-fold partitioned subquery can have (1+3) * 4 = 24 streaming units! Stream Analytics Query Result 1 Result 2 Result 3 PartitionId = 1 PartitionId = 2 PartitionId = 3 Event Hub SELECT Count(*) AS Count, Topic FROM TwitterStream PARTITION BY PartitionId GROUP BY TumblingWindow(minute, 3), Topic, PartitionId Partitioning a step enables more streaming units to be allocated to a job as there is a limit on the number of units that can be assigned to an un-partitioned step. Partitioning requires that all three conditions listed in the slide be satisfied. When a query is partitioned, the input events will be processed and aggregated in separate partition groups, and outputs events are generated for each of the groups. If a combined aggregate is desirable, you must create a second non-partitioned step to aggregate. The preview release of Azure Stream Analytics doesn't support partitioning by column names. You can only partition by the PartitionId field, which is a built-in field in your query. The PartitionId field indicates from which partition of source data stream the event is from. Since Event Hubs supports partitioning, you can easily develop partitioned queries that read data from Event Hubs. PartitionId = 1 PartitionId = 3 PartitionId = 2

Out of order inputs Event Hub guarantees monotonicity of the timestamp on each partition of the Event Hub All events from all partitions are merged by timestamp order, there will be no out of order events. When it's important for you to use sender's timestamp, so a timestamp from the event payload is chosen using "timestamp by," there can be several sources or disorderness introduced. Producers of the events have clock skews. Network delay from the producers sending the events to Event Hub. Clock skews between Event Hub partitions. Do we skip them (drop) or do we pretend they happened just now (adjust)?

Handling out of order events
On the configuration tab, you will find the following defaults. Using 0 seconds as the out of order tolerance window means you assert all events are in order all the time. To allow ASA to correct the disorderness, you can specify a non-zero out of order tolerance window size. ASA will buffer events up to that window and reorder them using the user chosen timestamp before applying the temporal transformation. Because of the buffering, the side effect is the output is delayed by the same amount of time As a result, you will need to tune the value to reduce the number of out of order events and keep the latency low.

Structuring and scaling query
demo Structuring and scaling query

Conclusions

Summary Azure Stream Analytics is the PaaS solution for Analytics on streaming data It is programmable with a SQL-like language Handling time is a special and central feature Scale with cloud principles: elastic, self service, multitenant, pay per use More questions: Other solutions Pricing What to do with that data? Futures

Microsoft real-time stream processing options
StreamInsight Stream Analytics Storm on HDInsight Built for… Complex event processing in SQL Server Ease of development and operationalization Flexibility and customizability Deployment Model On-premises or Azure IaaS Azure PaaS Open Source Support No Yes Development Language .NET/LINQ SQL SCP.NET, Java, Python Development Environment Visual Studio Web browser Microsoft offers both on-premises and cloud-based real-time stream processing options. StreamInsight is offered as part of SQL Server and should be used for on-premises deployments. The Microsoft Azure platform offers a vast set of data services, and while it’s a luxury to have such a broad array of capabilities to select from, it can also present a challenge. Designing a solution requires that you evaluate which offerings are best suited to your requirements as part of the planning and design project phases. There are a number of instances where Azure provides similar platforms for a given task. For example, Storm for Azure HDInsight and Azure Stream Analytics are both platform-as-a-service (PaaS) offerings providing real-time event stream processing. Both of these services are highly capable engines suitable for a range of solution deployments, however, some of the differences will influence the decision for which services is best suited to a project. Storm for Azure HDInsight is an Apache open-source stream analytics platform running on Microsoft Azure to do real-time data processing. Storm is highly flexible with contributions from the Hadoop community and highly customizable through any development language like Java and .NET (deep Visual Studio IDE integration). Azure Stream Analytics is a fully managed Microsoft first party event processing engine that provides real-time analytics in a SQL-based query language to speed time of development. Stream Analytics makes it easy to operationalize event processing with a small number of resources and drives a low price point with its multi-tenancy architecture.

Apache Storm (in HDInsight)
Apache Storm is a distributed, fault-tolerant, open source real-time event processing solution. Storm was originally used by Twitter to process massive streams of data from the Twitter firehose. Today, Storm is an incubator project as part of the Apache Software foundation. Typically, Storm will be integrated with a scalable event queuing system like Apache Kafka or Azure Event Hubs.

Stream Analytics vs Apache Storm
Data Transformation Can handle more dynamic data (if you're willing to program) Requires programming Stream Analytics Ease of Setup JSON and CSV format only Can change queries within 4 minutes Only takes inputs from Event Hub, Blob Storage Only outputs to Azure Blob, Azure Tables, Azure SQL, PowerBI analytics time-processing-for-hadoop.aspx?WT.mc_id=Blog_SQL_Announce_DI

Pricing Pricing based on volume per job: Volume of data processed
Streaming units required to process the data stream Price (USD) Volume of Data Processed Volume of data processed by the streaming job (in GB) € per GB Streaming Unit* Blended measure of CPU, memory, throughput. € per hour € 18,864 per month

Azure Machine Learning
Undestand the “sequence” of data in the history to predict the future But Azure can ‘learn’ which values preceded issues Azure Machine Learning

Power BI Solutions to create realtime dashboards SaaS Service
Inside Office 365

Futures [started] [planned] [under review]
Native integration with Azure Machine Learning Provide better ways to debug. [planned] Call to a REST endpoint to invoke custom code [under review] Take input from DocumentDb (

Thanks Don’t forget to compile evaluations form here Marco Parenzan
Marco Parenzan

Azure Stream Analytics

Similar presentations

Presentation on theme: "Azure Stream Analytics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Azure Stream Analytics

Similar presentations

Presentation on theme: "Azure Stream Analytics"— Presentation transcript:

Similar presentations

About project

Feedback