Presentation is loading. Please wait.

Presentation is loading. Please wait.

Python Spark Intro for Data Science

Similar presentations


Presentation on theme: "Python Spark Intro for Data Science"— Presentation transcript:

1 Python Spark Intro for Data Science
copyright 2017 Trainologic LTD

2 Who am I? Spark is a cluster computing engine. Alex Landa
Father of two  Problem solver at Trainologic BigData, Scala, Python, Java, DevOps Been over 12 years in the software industry: Developer, Researcher, Team lead and Architect Served in 8200 2 copyright 2017 Trainologic LTD

3 Spark Spark is a cluster computing engine.
Provides high-level API in Scala, Java, Python and R. Provides high level tools: Spark SQL. MLib. GraphX. Spark Streaming. 3 copyright 2017 Trainologic LTD

4 RDD The basic abstraction in Spark is the RDD.
Stands for: Resilient Distributed Dataset. It is a collection of items which their source may for example: Hadoop (HDFS). JDBC. ElasticSearch. And more… 4 copyright 2017 Trainologic LTD

5 Main RDD Concepts Main concepts regrading RDD: Partitions.
Spark Main RDD Concepts Main concepts regrading RDD: Partitions. Dependencies. Lazy computation 5 copyright 2017 Trainologic LTD

6 Partitions An RDD is partitioned.
Spark Partitions An RDD is partitioned. A partition is usually computed on a different process (usually on a different machine). This is the implementation of the distributed part of the RDD. 6 copyright 2017 Trainologic LTD

7 Dependencies RDD can depend on other RDDs.
Spark Dependencies RDD can depend on other RDDs. This is due to the fact that the RDD is lazy. So, if you perform a map operation on an RDD you’ll get a new RDD which depends on the original one. It will contain only meta-data (i.e., the computing function). Only on a specific command (like collect) the flow will be computed. 7 copyright 2017 Trainologic LTD

8 Where does it run Driver: Executes the main program Creates the RDDs
Spark Where does it run Driver: Executes the main program Creates the RDDs Collects the results Executors: Executes the RDD operations Participate in the shuffle 8 copyright 2017 Trainologic LTD

9 Spark Where does it run Taken from Spark wiki - 9 copyright 2017 Trainologic LTD

10 Spark Map-Reduce Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map) , and aggregate the results (reduce). Between the Map and Reduce parts, an alternative phase called ‘shuffle’ can be introduced. In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part. 10 copyright 2017 Trainologic LTD

11 Spark Mapper The Mapper represents the logic to be done on a key/value pair of input. It returns an intermediary key/value output. It’s like the ‘select’ clause from SQL (with simple ‘where’). Transformations in Spark (e.g.: map, flatMap, filter). 11 copyright 2017 Trainologic LTD

12 Spark Side Effects The mapper’s logic should be idempotent (i.e., without side effects). Should be pure functional. Functions that are passed to Spark are passed with their closure (dependent variables, classes, etc..) (approach also recommended to all code). 12 copyright 2017 Trainologic LTD

13 RDD from Collection You can create an RDD from a collection:
Spark RDD from Collection You can create an RDD from a collection: sc.parallelize(list) Takes a sequence from the driver and distributes it across the nodes. Note, the distribution is lazy so be careful with mutable collections! Important, if it’s a range collection, use the range method as it does not create the collection on the driver. 13 copyright 2017 Trainologic LTD

14 Spark RDD from File Spark supports reading files, directories and compressed files. The following out-of-the-box methods: textFile – retrieving an RDD[String] (lines). wholeTextFiles – retrieving an RDD[(String, String)] with filename and content. sequenceFile – Hadoop sequence files RDD[(K,V)]. 14 copyright 2017 Trainologic LTD

15 Spark RDD Actions The following (selected) methods evaluate the RDD (not lazy): collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. count() – returns the number of the elements in the RDD. first() – returns the first element of the RDD. foreach(f) – performs the function on each element of the RDD. 15 copyright 2017 Trainologic LTD

16 RDD Actions isEmpty – evaluates the RDD. max/min.
Spark RDD Actions isEmpty – evaluates the RDD. max/min. reduce((T,T) => T) – parallel reduction. take(n) – returns the first n elements. takeSample(). takeOrdered(n) – returns the first (smallest) n elements. top(n) – returns the first (largest) n elements. 16 copyright 2017 Trainologic LTD

17 RDD Actions countByKey – for pair RDDs. save*File. Spark 17
copyright 2017 Trainologic LTD

18 RDD Transformations Most of the other methods of the RDD are lazy.
Spark RDD Transformations Most of the other methods of the RDD are lazy. I.e., they create a new RDD with meta-data. Transformations are divided to two main types: Those who shuffle And those who don’t 18 copyright 2017 Trainologic LTD

19 Transformations without shuffle
Spark Transformations without shuffle map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Taken from the official Apache Spark documentation 19 copyright 2017 Trainologic LTD

20 Transformations without shuffle
Spark Transformations without shuffle mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. *Taken from the official Apache Spark documentation 20 copyright 2017 Trainologic LTD

21 Shuffle Shuffle operations repartition the data across the network.
Spark Shuffle Shuffle operations repartition the data across the network. Can be very expensive operations in Spark. You must be aware where and why shuffle happens. Order is not guaranteed inside a partition. Popular operations that cause shuffle are: groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect operations on multiple RDDs. 21 copyright 2017 Trainologic LTD

22 Shuffle transformations
Spark Shuffle transformations distinct([numTasks]) - Return a new dataset that contains the distinct elements of the source dataset. groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) -When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. *Taken from the official Apache Spark documentation 22 copyright 2017 Trainologic LTD

23 Shuffle transformations
Spark Shuffle transformations join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. Sort, SortByKey… More at guide.html *Taken from the official Apache Spark documentation 23 copyright 2017 Trainologic LTD

24 Jobs, Stages and Tasks A Job in Spark is an action execution.
Each action is seen in the UI as a job, named after the action method. A job is split to tasks (all the transformations). Tasks are grouped by stages. 24 copyright 2017 Trainologic LTD

25 Spark Stage It is important to understand that the shuffle operation is often the expensive part of the execution. So, everything that can be done without shuffling should stay local. A stage is a group of tasks that can be run serially on a partition. 25 copyright 2017 Trainologic LTD

26 Spark Caching One of the strongest features of Spark is the ability to cache an RDD. Spark can cache the items of the RDD in memory or on disk. You can avoid expensive re-calculations this way. The cache() and persist() store the content in memory. You can provide a different storage level by supplying it to the persist method. 26 copyright 2017 Trainologic LTD

27 Spark Caching You can store the RDD in memory, disk or try in memory and if it fails fallback to disk. In the memory it can be stored either deserialized or serialized. Note that serialized is more space-efficient but CPU works harder. You can also specify that the cache will be replicated, resulting in a faster cache in case of partition loss. 27 copyright 2017 Trainologic LTD

28 Spark In-memory Caching When using memory as a cache storage, you have the option of either using serialized or de-serialized state. Serialized state occupies less memory at the expense of more CPU. However, even in a de-serialized mode, Spark still needs to figure the size of the RDD partition. Spark uses the SizeEstimator utility for this purpose. Sometimes can be quite expensive (for complicated data structures). 28 copyright 2017 Trainologic LTD

29 unpersist Caching is based on LRU.
Spark unpersist Caching is based on LRU. If you don’t need a cache, remove it with the unpersist() method. 29 copyright 2017 Trainologic LTD

30 Spark SQL Spark SQL provides a unified engine (catalyst) with 3 APIs:
Dataframes (untyped). Datasets (typed) – Only Scala and Java for Spark 2.0 30 copyright 2017 Trainologic LTD

31 DataFrames and Datasets
Spark DataFrames and Datasets Originally Spark provided only DataFrames. A DataFrame is conceptually a table with typed columns. However, it is not typed at compilation. Starting with Spark 1.6, Datasets were introduced. Represent a table with columns, but, the row is typed at compilation. Datasets can be used in Scala and Java only 31 copyright 2017 Trainologic LTD

32 SparkSession The entry-point to the Spark SQL.
It doesn’t require a SparkContext (is managed within). Supports binding to the current thread (Inherited ThreadLocal) for use with getOrCreate(). Supports configuration settings (SparkConf, app name and master). 32 copyright 2017 Trainologic LTD

33 Creating DataFrames DataFrames are created using spark session
df = spark.read.json(“/data/people.json”) df = spark.read.csv(“/data/people.csv”) df = spark.read.x.. To create a DataFrame from an existing rdd you have to map it to an rdd of Row: ...map(lambda i: Row(single=i, double=i ** 2))) .toDF() 33 copyright 2017 Trainologic LTD

34 Spark Schema Spark knows in many cases to infer the schema of the data by himself. (By using reflection or sampling) Use printSchema() to explore the scehma that spark deduce Spark also enables to provide the schema yourself And to cast types to change existing schema – df.withColumn(”c1",df[”c1"].cast(IntegerType())) 34 copyright 2017 Trainologic LTD

35 Exploring the DataFrame
Spark Exploring the DataFrame Filtering by columns df.filter(“column > 2”) df.filter(df.column > 2) df.filter(df[“column”] > 2) Selecting specific fields: df.select(df[“column1”],df[“column2”]): 35 copyright 2017 Trainologic LTD

36 Exploring the DataFrame
Spark Exploring the DataFrame Aggregations df.groupBy(..).sum/count/avg/max/min/.. – aggregation on a group df.agg(functions.min(df[“col”])) – calculates only value Aliasing: df.alias(“table1”) df.withColumnRenamed(“old1”,”new1”) 36 copyright 2017 Trainologic LTD

37 Running SQL queries on DataFrame
Spark Running SQL queries on DataFrame Register DF as an SQL Table df.createOrReplaceTempView(“table1”) Remember the spark session? SessionBuilder.. Now we can use sql queries: spark.sql(“select count(*) from table1”) 37 copyright 2017 Trainologic LTD

38 Spark Query Planning All of the Spark SQL APIs (SQL, Datasets and DataFrames) result in a Catalyst execution. Query execution is composed of 4 stages: Logical plan. Optimized logical plan. Physical plan. Code-gen (Tungesten). 38 copyright 2017 Trainologic LTD

39 Spark Logical Plan In the first stage, the Catalyst creates a logical execution plan. A tree composed of query primitives and combinators that match the requested query. 39 copyright 2017 Trainologic LTD

40 Optimized Logical Plan
Spark Optimized Logical Plan Catalyst runs a set of rules for optimizing the logical plan. E.g.: Merge projections. Push down filters. Convert outer-joins to inner-joins. 40 copyright 2017 Trainologic LTD

41 Physical Plan Physical plan is composed of RDD operations.
Spark Physical Plan Physical plan is composed of RDD operations. There are usually several alternatives for physical plans for each logical plan. Spark uses a cost analyzer to decide which physical plan to used (resource estimation). 41 copyright 2017 Trainologic LTD

42 Spark Thank You! 42 copyright 2017 Trainologic LTD


Download ppt "Python Spark Intro for Data Science"

Similar presentations


Ads by Google