Python Spark Intro for Data Science

Python Spark Intro for Data Science
copyright 2017 Trainologic LTD

Who am I? Spark is a cluster computing engine. Alex Landa
Father of two  Problem solver at Trainologic BigData, Scala, Python, Java, DevOps Been over 12 years in the software industry: Developer, Researcher, Team lead and Architect Served in 8200 2 copyright 2017 Trainologic LTD

Spark Spark is a cluster computing engine.
Provides high-level API in Scala, Java, Python and R. Provides high level tools: Spark SQL. MLib. GraphX. Spark Streaming. 3 copyright 2017 Trainologic LTD

RDD The basic abstraction in Spark is the RDD.
Stands for: Resilient Distributed Dataset. It is a collection of items which their source may for example: Hadoop (HDFS). JDBC. ElasticSearch. And more… 4 copyright 2017 Trainologic LTD

Main RDD Concepts Main concepts regrading RDD: Partitions.
Spark Main RDD Concepts Main concepts regrading RDD: Partitions. Dependencies. Lazy computation 5 copyright 2017 Trainologic LTD

Partitions An RDD is partitioned.
Spark Partitions An RDD is partitioned. A partition is usually computed on a different process (usually on a different machine). This is the implementation of the distributed part of the RDD. 6 copyright 2017 Trainologic LTD

Dependencies RDD can depend on other RDDs.
Spark Dependencies RDD can depend on other RDDs. This is due to the fact that the RDD is lazy. So, if you perform a map operation on an RDD you’ll get a new RDD which depends on the original one. It will contain only meta-data (i.e., the computing function). Only on a specific command (like collect) the flow will be computed. 7 copyright 2017 Trainologic LTD

Where does it run Driver: Executes the main program Creates the RDDs
Spark Where does it run Driver: Executes the main program Creates the RDDs Collects the results Executors: Executes the RDD operations Participate in the shuffle 8 copyright 2017 Trainologic LTD

Spark Map-Reduce Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map) , and aggregate the results (reduce). Between the Map and Reduce parts, an alternative phase called ‘shuffle’ can be introduced. In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part. 10 copyright 2017 Trainologic LTD

Spark Mapper The Mapper represents the logic to be done on a key/value pair of input. It returns an intermediary key/value output. It’s like the ‘select’ clause from SQL (with simple ‘where’). Transformations in Spark (e.g.: map, flatMap, filter). 11 copyright 2017 Trainologic LTD

Spark Side Effects The mapper’s logic should be idempotent (i.e., without side effects). Should be pure functional. Functions that are passed to Spark are passed with their closure (dependent variables, classes, etc..) (approach also recommended to all code). 12 copyright 2017 Trainologic LTD

RDD from Collection You can create an RDD from a collection:
Spark RDD from Collection You can create an RDD from a collection: sc.parallelize(list) Takes a sequence from the driver and distributes it across the nodes. Note, the distribution is lazy so be careful with mutable collections! Important, if it’s a range collection, use the range method as it does not create the collection on the driver. 13 copyright 2017 Trainologic LTD

Spark RDD from File Spark supports reading files, directories and compressed files. The following out-of-the-box methods: textFile – retrieving an RDD[String] (lines). wholeTextFiles – retrieving an RDD[(String, String)] with filename and content. sequenceFile – Hadoop sequence files RDD[(K,V)]. 14 copyright 2017 Trainologic LTD

Spark RDD Actions The following (selected) methods evaluate the RDD (not lazy): collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. count() – returns the number of the elements in the RDD. first() – returns the first element of the RDD. foreach(f) – performs the function on each element of the RDD. 15 copyright 2017 Trainologic LTD

RDD Actions isEmpty – evaluates the RDD. max/min.
Spark RDD Actions isEmpty – evaluates the RDD. max/min. reduce((T,T) => T) – parallel reduction. take(n) – returns the first n elements. takeSample(). takeOrdered(n) – returns the first (smallest) n elements. top(n) – returns the first (largest) n elements. 16 copyright 2017 Trainologic LTD

RDD Actions countByKey – for pair RDDs. save*File. Spark 17
copyright 2017 Trainologic LTD

RDD Transformations Most of the other methods of the RDD are lazy.
Spark RDD Transformations Most of the other methods of the RDD are lazy. I.e., they create a new RDD with meta-data. Transformations are divided to two main types: Those who shuffle And those who don’t 18 copyright 2017 Trainologic LTD

Transformations without shuffle
Spark Transformations without shuffle map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Taken from the official Apache Spark documentation 19 copyright 2017 Trainologic LTD

Transformations without shuffle
Spark Transformations without shuffle mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. *Taken from the official Apache Spark documentation 20 copyright 2017 Trainologic LTD

Shuffle Shuffle operations repartition the data across the network.
Spark Shuffle Shuffle operations repartition the data across the network. Can be very expensive operations in Spark. You must be aware where and why shuffle happens. Order is not guaranteed inside a partition. Popular operations that cause shuffle are: groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect operations on multiple RDDs. 21 copyright 2017 Trainologic LTD

Shuffle transformations
Spark Shuffle transformations distinct([numTasks]) - Return a new dataset that contains the distinct elements of the source dataset. groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) -When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. *Taken from the official Apache Spark documentation 22 copyright 2017 Trainologic LTD

Shuffle transformations
Spark Shuffle transformations join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. Sort, SortByKey… More at guide.html *Taken from the official Apache Spark documentation 23 copyright 2017 Trainologic LTD

Jobs, Stages and Tasks A Job in Spark is an action execution.
Each action is seen in the UI as a job, named after the action method. A job is split to tasks (all the transformations). Tasks are grouped by stages. 24 copyright 2017 Trainologic LTD

Spark Stage It is important to understand that the shuffle operation is often the expensive part of the execution. So, everything that can be done without shuffling should stay local. A stage is a group of tasks that can be run serially on a partition. 25 copyright 2017 Trainologic LTD

Spark Caching One of the strongest features of Spark is the ability to cache an RDD. Spark can cache the items of the RDD in memory or on disk. You can avoid expensive re-calculations this way. The cache() and persist() store the content in memory. You can provide a different storage level by supplying it to the persist method. 26 copyright 2017 Trainologic LTD

Spark Caching You can store the RDD in memory, disk or try in memory and if it fails fallback to disk. In the memory it can be stored either deserialized or serialized. Note that serialized is more space-efficient but CPU works harder. You can also specify that the cache will be replicated, resulting in a faster cache in case of partition loss. 27 copyright 2017 Trainologic LTD

Spark In-memory Caching When using memory as a cache storage, you have the option of either using serialized or de-serialized state. Serialized state occupies less memory at the expense of more CPU. However, even in a de-serialized mode, Spark still needs to figure the size of the RDD partition. Spark uses the SizeEstimator utility for this purpose. Sometimes can be quite expensive (for complicated data structures). 28 copyright 2017 Trainologic LTD

unpersist Caching is based on LRU.
Spark unpersist Caching is based on LRU. If you don’t need a cache, remove it with the unpersist() method. 29 copyright 2017 Trainologic LTD

Spark SQL Spark SQL provides a unified engine (catalyst) with 3 APIs:
Dataframes (untyped). Datasets (typed) – Only Scala and Java for Spark 2.0 30 copyright 2017 Trainologic LTD

DataFrames and Datasets
Spark DataFrames and Datasets Originally Spark provided only DataFrames. A DataFrame is conceptually a table with typed columns. However, it is not typed at compilation. Starting with Spark 1.6, Datasets were introduced. Represent a table with columns, but, the row is typed at compilation. Datasets can be used in Scala and Java only 31 copyright 2017 Trainologic LTD

SparkSession The entry-point to the Spark SQL.
It doesn’t require a SparkContext (is managed within). Supports binding to the current thread (Inherited ThreadLocal) for use with getOrCreate(). Supports configuration settings (SparkConf, app name and master). 32 copyright 2017 Trainologic LTD

Creating DataFrames DataFrames are created using spark session
df = spark.read.json(“/data/people.json”) df = spark.read.csv(“/data/people.csv”) df = spark.read.x.. To create a DataFrame from an existing rdd you have to map it to an rdd of Row: ...map(lambda i: Row(single=i, double=i ** 2))) .toDF() 33 copyright 2017 Trainologic LTD

Spark Schema Spark knows in many cases to infer the schema of the data by himself. (By using reflection or sampling) Use printSchema() to explore the scehma that spark deduce Spark also enables to provide the schema yourself And to cast types to change existing schema – df.withColumn(”c1",df[”c1"].cast(IntegerType())) 34 copyright 2017 Trainologic LTD

Exploring the DataFrame
Spark Exploring the DataFrame Filtering by columns df.filter(“column > 2”) df.filter(df.column > 2) df.filter(df[“column”] > 2) Selecting specific fields: df.select(df[“column1”],df[“column2”]): 35 copyright 2017 Trainologic LTD

Exploring the DataFrame
Spark Exploring the DataFrame Aggregations df.groupBy(..).sum/count/avg/max/min/.. – aggregation on a group df.agg(functions.min(df[“col”])) – calculates only value Aliasing: df.alias(“table1”) df.withColumnRenamed(“old1”,”new1”) 36 copyright 2017 Trainologic LTD

Running SQL queries on DataFrame
Spark Running SQL queries on DataFrame Register DF as an SQL Table df.createOrReplaceTempView(“table1”) Remember the spark session? SessionBuilder.. Now we can use sql queries: spark.sql(“select count(*) from table1”) 37 copyright 2017 Trainologic LTD

Spark Query Planning All of the Spark SQL APIs (SQL, Datasets and DataFrames) result in a Catalyst execution. Query execution is composed of 4 stages: Logical plan. Optimized logical plan. Physical plan. Code-gen (Tungesten). 38 copyright 2017 Trainologic LTD

Spark Logical Plan In the first stage, the Catalyst creates a logical execution plan. A tree composed of query primitives and combinators that match the requested query. 39 copyright 2017 Trainologic LTD

Optimized Logical Plan
Spark Optimized Logical Plan Catalyst runs a set of rules for optimizing the logical plan. E.g.: Merge projections. Push down filters. Convert outer-joins to inner-joins. 40 copyright 2017 Trainologic LTD

Physical Plan Physical plan is composed of RDD operations.
Spark Physical Plan Physical plan is composed of RDD operations. There are usually several alternatives for physical plans for each logical plan. Spark uses a cost analyzer to decide which physical plan to used (resource estimation). 41 copyright 2017 Trainologic LTD

Python Spark Intro for Data Science

Similar presentations

Presentation on theme: "Python Spark Intro for Data Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Python Spark Intro for Data Science

Similar presentations

Presentation on theme: "Python Spark Intro for Data Science"— Presentation transcript:

Similar presentations

About project

Feedback