Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager,

Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager, Big Data © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Agenda Spark in Azure cloud (HDInsight) Spark 2.0 – Datasets
Spark 2.0 – Structured streams Spark 2.0 – Structured streaming internals Demo

Apache Spark in recent times
Great momentum in the industry Active and large community Supported by all major big data vendors Fast release cadence New in Spark 2.0 Dataset is the new unifying API Tungsten Phase 2 (3-10x speedup) Structured Streams [ALPHA]

Spark on Azure HDInsight
Fully Managed Service 100% open source Apache Spark and Hadoop bits Fully supported by Microsoft and Hortonworks 99.9% Azure Cloud SLA Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC Spark 2.0 Latest Spark 2.0 with 100+ stability fixes (available later this week on 9/30) Optimized for data exploration, experimentation and development Jupyter Notebooks (scala, python, automatic data visualizations) IntelliJ and Eclipse plugins (job submission, remote debugging) ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Integrated with Azure data pipelines
Batch pipelines Azure Data Factory orchestrates Spark ETL jobs PowerBI connector for Spark HA of job submission service Streaming pipelines High Availability for Spark Streaming using Yarn Azure connectors: EventHub, Power BI, Azure SQL, DW, Data Lake Github: Blog: saving-spark-dataframe-to-powerbi/

New in Spark 2.0 Datasets Spark as a compiler Structured Streams
Datasets become new unifying APIs Spark as a compiler Tungsten Phase 2 (3-10x speedup) Structured Streams Better re-architected streaming, but in Spark 2.0 in [ALPHA]

From RDD to Dataset: What is RDD?
2/3/2018 4:38 PM From RDD to Dataset: What is RDD? Partitions: distributed data Lineage: dependencies Compute function: Partition => Iterator[T] © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

From RDD to Dataset: What is RDD?
2/3/2018 4:38 PM From RDD to Dataset: What is RDD? Partitions: distributed data Lineage: dependencies Compute function: Partition => Iterator[T] Opaque computation and data © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

From RDD to Dataset: What is Dataset?
Structured computation and data Data types have known structure Common computations are transparent to optimizer Type-safe Operations on data can be expressed using domain objects

Structured APIs in Spark
SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors

DataFrame = Dataset[Row]
Spark 2.0 unifies these APIs Strongly-typed methods will downcast to generic Row objects Types can be enforced on generic rows by using df.as[MyClass]

Shared Optimization & Execution
DataFrames, Datasets and SQL share the same optimization/execution pipeline

Logical query plan == Optimized Logical Plan ==
Project [name#75,age#74L] +- Filter (age#74L > 30) +- Relation[age#74L,name#75] JSONRelation

Structured computation
Functions Expressions Scala code (x: Int) => x > 30 df(“x”) > 30 Compiles into class $function1 { def apply(Int): Boolean } GreaterThan(x,Literal(30))

Catalyst expressions Support 100+ native functions
Optimized codegen implementation Examples: String manipulation – concat, format_string, lower Data/time – current_timestamp, date_format, date_add Math – sqrt, randn Other – monotonicallyIncreasingId, sparkPartitionId

Predicate pushdown Spark code Query sent to SQL Server

Spark structured data model
Primitives Byte, Short, Integer, Long, Float, Double, Decimal, String, Binary, Boolean, Timestamp, Date Array[Type] Variable length collection Struct Fixed number of nested columns with fixed types Map[Type, Type] Variable length association

Tungsten’s compact row encoding
(123, “apache”, “spark”) field length field length 0x0 123 32 56 6 apache 5 spark offset to data null bitmask offset to data

Tungsten’s compact encoding types
Row oriented Column oriented (new in 2.0) Used in vectorized processing of data (for example by new Parquet reader)

Space efficiency GBs *Databricks reported data

Serialization performance
million object / second *Databricks reported data

Whole-stage codegen: Spark as a Compiler
Query plan Aggregate Project Filter Scan

Generated code operates directly on compact encoding
DataFrame code df(“age”) > 30 Catalyst expression GreaterThan(age,Literal(30)) Generated code JVM intrinsic operation, will be JITed into pointer arithmetic

TPC-DS benchmark results
TPC-DS performance improvement is in the range of 3-10x Databricks reported data

Spark 2.0 - Structured Streams

Existing Spark DStream APIs
Introduced in Spark 0.7 Capable of exactly-once semantics Fault-tolerant at scale High throughput

Limitations of DStream APIs
Processing data using event time DStreams natively support batch time only, which makes it hard to process late arrived data Interoperability between streaming and batch While APIs are already similar manual translation is still needed between the two Exactly-once processing semantics are not guaranteed Exactly-once semantics depend on the Receiver implementation and not always guaranteed Check-pointing efficiency Checkpointing saves complete RDD, no incremental storage possible.

Structured streaming Static DataFrame Continuous DataFrame
Single API ! The simplest way to perform streaming analytics is to not having to reason about streaming

Structured streaming High-level streaming API based on Datasets/DataFrames Event time, windowing, sessions, sources & sinks End-to-end exactly once processing semantics Unifies streaming, interactive and batch queries Aggregate data in a stream, then serve using JDBC Add, remove, change queries at runtime Build and apply ML models

Batch ETL pipeline Read from static json file Do query
val input = spark.read .format("json") .load("source-path") Read from static json file val output = input .select(“clientid“, “querytime”) .where(“querytime > 100") Do query output.write .format("json") .save(“dest-path") Write to static json file

Streaming pipeline val input = spark.readStream .format("json") .load("source-path") readStream…load() creates a streaming DataFrame, does not start any computation val output = input .select(“clientid“, “querytime”) .where(“querytime > 100") output.writeStream .format("json") .start(“dest-path") writeStream…start() specifies where to output the data and starts processing

Structured streaming model
Time Input Result state Output Input: input stream as append- only table Trigger: how frequently to check input for new data (every second) Query: processing operations (map, filter, window, aggregate) Result state: Result table updated every trigger interval data up to 1 complete output query 1 data up to 2 complete output query 2 data up to 3 complete output query 3

Structured streaming model
Time Input Result state Output Output: What part of result state to write to the sink after each trigger interval Output modes: Complete output – write full result state each time Append output – write only added rows Delta output – write only changed rows * Output mode applicability depends on the type of query data up to 1 delta output query 1 data up to 2 delta output query 2 data up to 3 delta output query 3

Continuous aggregations
input.avg(“querytime") Continuously compute average query time across all clients input.groupBy(“devicemake") .avg(“querytime") Continuously compute average query time of each type of device manufacturer

Continuous windowed aggregations
input.groupBy( $“devicemake”, window($“event-time”, “10 min”)) .avg(“querytime”) Continuously compute average query time of each type of device manufacturer in the last 10 minutes using event-time Simplifies event-time stream processing (not possible in DStreams) Works in both, streaming and batch jobs

Joining streams with static data
val kafkaDataset = spark.readStream .kafka(“device-updates”) .load() Join streaming data from Kafka with static dataset from JDBC source to enrich streaming data val staticDataset = spark.read .jdbc(“jdbc://”, “device-info”) val joinedDataset = kafkaDataset.join( staticDataset, “devicemake”)

Query management val query = result.writeStream .format(“parquet”)
.outputMode(“append”) .start(“dest-path”) query.stop() query.awaitTermination() query.exception() query.sourceStatuses() query.sinkStatuses() query: a handle to the running streaming computation Multiple queries can be active at the same time Each query has unique name to keep track of it’s state Stop the query Wait for it to terminate Get error, once terminated Get statuses

Structured Streaming Internals

Continuous incremental execution
Catalyst optimizer generates streaming logical plan DataFrame Logical Plan Planner Planner generates incremental execution plan for each input batch of data Incremental physical plan 1 Incremental physical plan 2 Catalyst optimizer enables structured streaming to benefit from Tungsten codegen and offheap memory optimizations Incremental physical plan 3

Planner Planner Planner periodically polls source for new data
Schedules incremental execution of new data and writing results to sink Incremental execution 1 Offsets: Count: 100 HDFS Event Hub Incremental execution 2 Offsets: Count: 110

Continuous aggregations
Planner Incremental execution 1 Offsets: Count: 100 HDFS State: 100 Event Hub Incremental execution 2 Offsets: Count: 110 Planner maintains in-memory state of aggregations and passes it from one iteration to another. State is backed by State Store in HDFS for fault-tolerance.

Fault tolerance To achieve fault-tolerance all components should be resilient to failure and should be able to recover their state or replay it Planner Incremental execution 1 State Source Sink Incremental execution 2

Fault tolerance Planner – writes range of offsets to HDFS-backed WAL before execution Planner failure – fails current execution Planner restart – reads latest offset range from WAL and re- executes that range Planner WAL Incremental execution 1 State Source Sink Incremental execution 2

Fault tolerance Sources – MUST BE replayable (like Kafka, EventHub). In case of failure planner will request execution of the previous range of offsets. Sources must be able to provide exactly the same data. Planner Incremental execution 1 State Source Sink replayable Incremental execution 2

Fault tolerance State – intermediate state is stored by workers in State Store. State Store is transactional, versioned key-value store backed by HDFS. After failure planner recovers state from the correct version in the State Store. Planner Incremental execution 1 State HDFS Source Sink Incremental execution 2

Fault tolerance Sinks – MUST BE idempotent. Which means output of the same data should not result in duplication. Planner Incremental execution 1 State Source Sink idempotent Incremental execution 2

Exactly-once processing semantics
Achieved as a sum of all requirements: Offset tracking in WAL + Versioned state management Replayable sources Idempotent sinks Exactly once semantics of event processing

Demo Maxim Lukiyanov

Roadmap

Structured streams roadmap
Structured Streaming in Spark 2.0 [ALPHA] Basic infrastructure and API Event time, windows, aggregations Append and complete output modes Support for a subset of batch queries Sources and sinks Sources: Files (*Kafka coming after 2.0 release) Sinks: Files and in-memory table Structured Streaming in Spark 2.1+ Stability and scalability Support for more queries Sources and Sinks ML integration

Related content Documentation Sessions
Tech Ready 15 2/3/2018 Related content Documentation Structured Streaming Programming Guide Sessions [BRK3226, 10:45am] Build interactive data analysis environments using Apache Spark [BRK3183, 2:15pm] Leverage R and Spark in Azure HDInsight for scalable machine learning [BRK3248] Build successful Big Data infrastructure using Azure HDInsight © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Free IT Pro resources To advance your career in cloud technology
Microsoft Ignite 2016 2/3/2018 4:38 PM Free IT Pro resources To advance your career in cloud technology Plan your career path Microsoft IT Pro Career Center Cloud role mapping Expert advice on skills needed Self-paced curriculum by cloud role $300 Azure credits and extended trials Pluralsight 3 month subscription (10 courses) Phone support incident Weekly short videos and insights from Microsoft’s leaders and engineers Connect with community of peers and Microsoft experts Get started with Azure Microsoft IT Pro Cloud Essentials Demos and how-to videos Microsoft Mechanics Connect with peers and experts Microsoft Tech Community © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Please evaluate this session
2/3/2018 4:38 PM Please evaluate this session Your feedback is important to us! From your PC or Tablet visit MyIgnite at From your phone download and use the Ignite Mobile App by scanning the QR code above or visiting © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Q & A Maxim Lukiyanov

Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager,

Similar presentations

Presentation on theme: "Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager,

Similar presentations

Presentation on theme: "Microsoft 2016 2/3/2018 4:38 PM BRK3184 Explore Spark 2.0 and structured streaming in Microsoft Azure HDInsight Maxim Lukiyanov Senior Program Manager,"— Presentation transcript:

Similar presentations

About project

Feedback