Python Spark Intro for Data Science

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Distributed Computing with Turing Machine. Turing machine  Turing machines are an abstract model of computation. They provide a precise, formal definition.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
CC Procesamiento Masivo de Datos Otoño Lecture 8: Apache Spark (Core)
Big Data is a Big Deal!.
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
Concept & Examples of pyspark
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Hadoop.
ITCS-3190.
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop Tutorials Spark
Large-scale file systems and Map-Reduce
MapReduce Types, Formats and Features
Open Source distributed document DB for an enterprise
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
CSE-291 (Cloud Computing) Fall 2016
Database Performance Tuning and Query Optimization
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
Database Management Systems (CS 564)
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to Apache
Distributed System Gang Wu Spring,2018.
Overview of big data tools
Spark and Scala.
Data processing with Hadoop
Charles Tappert Seidenberg School of CSIS, Pace University
Spark and Scala.
Chapter 11 Database Performance Tuning and Query Optimization
MAPREDUCE TYPES, FORMATS AND FEATURES
Introduction to Spark.
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Apache Hadoop and Spark
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Map Reduce, Types, Formats and Features
Presentation transcript:

Python Spark Intro for Data Science copyright 2017 Trainologic LTD

Who am I? Spark is a cluster computing engine. Alex Landa Father of two  Problem solver at Trainologic BigData, Scala, Python, Java, DevOps Been over 12 years in the software industry: Developer, Researcher, Team lead and Architect Served in 8200 2 copyright 2017 Trainologic LTD

Spark Spark is a cluster computing engine. Provides high-level API in Scala, Java, Python and R. Provides high level tools: Spark SQL. MLib. GraphX. Spark Streaming. 3 copyright 2017 Trainologic LTD

RDD The basic abstraction in Spark is the RDD. Stands for: Resilient Distributed Dataset. It is a collection of items which their source may for example: Hadoop (HDFS). JDBC. ElasticSearch. And more… 4 copyright 2017 Trainologic LTD

Main RDD Concepts Main concepts regrading RDD: Partitions. Spark Main RDD Concepts Main concepts regrading RDD: Partitions. Dependencies. Lazy computation 5 copyright 2017 Trainologic LTD

Partitions An RDD is partitioned. Spark Partitions An RDD is partitioned. A partition is usually computed on a different process (usually on a different machine). This is the implementation of the distributed part of the RDD. 6 copyright 2017 Trainologic LTD

Dependencies RDD can depend on other RDDs. Spark Dependencies RDD can depend on other RDDs. This is due to the fact that the RDD is lazy. So, if you perform a map operation on an RDD you’ll get a new RDD which depends on the original one. It will contain only meta-data (i.e., the computing function). Only on a specific command (like collect) the flow will be computed. 7 copyright 2017 Trainologic LTD

Where does it run Driver: Executes the main program Creates the RDDs Spark Where does it run Driver: Executes the main program Creates the RDDs Collects the results Executors: Executes the RDD operations Participate in the shuffle 8 copyright 2017 Trainologic LTD

Spark Where does it run Taken from Spark wiki - https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 9 copyright 2017 Trainologic LTD

Spark Map-Reduce Map-reduce was introduced by Google and it is an interface that can break a task to sub-tasks distribute them to be executed in parallel (map) , and aggregate the results (reduce). Between the Map and Reduce parts, an alternative phase called ‘shuffle’ can be introduced. In the Shuffle phase, the output of the Map operations is sorted and aggregated for the Reduce part. 10 copyright 2017 Trainologic LTD

Spark Mapper The Mapper represents the logic to be done on a key/value pair of input. It returns an intermediary key/value output. It’s like the ‘select’ clause from SQL (with simple ‘where’). Transformations in Spark (e.g.: map, flatMap, filter). 11 copyright 2017 Trainologic LTD

Spark Side Effects The mapper’s logic should be idempotent (i.e., without side effects). Should be pure functional. Functions that are passed to Spark are passed with their closure (dependent variables, classes, etc..) (approach also recommended to all code). 12 copyright 2017 Trainologic LTD

RDD from Collection You can create an RDD from a collection: Spark RDD from Collection You can create an RDD from a collection: sc.parallelize(list) Takes a sequence from the driver and distributes it across the nodes. Note, the distribution is lazy so be careful with mutable collections! Important, if it’s a range collection, use the range method as it does not create the collection on the driver. 13 copyright 2017 Trainologic LTD

Spark RDD from File Spark supports reading files, directories and compressed files. The following out-of-the-box methods: textFile – retrieving an RDD[String] (lines). wholeTextFiles – retrieving an RDD[(String, String)] with filename and content. sequenceFile – Hadoop sequence files RDD[(K,V)]. 14 copyright 2017 Trainologic LTD

Spark RDD Actions The following (selected) methods evaluate the RDD (not lazy): collect() – returns an list containing all the elements of the RDD. This is the main method that evaluates the RDD. count() – returns the number of the elements in the RDD. first() – returns the first element of the RDD. foreach(f) – performs the function on each element of the RDD. 15 copyright 2017 Trainologic LTD

RDD Actions isEmpty – evaluates the RDD. max/min. Spark RDD Actions isEmpty – evaluates the RDD. max/min. reduce((T,T) => T) – parallel reduction. take(n) – returns the first n elements. takeSample(). takeOrdered(n) – returns the first (smallest) n elements. top(n) – returns the first (largest) n elements. 16 copyright 2017 Trainologic LTD

RDD Actions countByKey – for pair RDDs. save*File. Spark 17 copyright 2017 Trainologic LTD

RDD Transformations Most of the other methods of the RDD are lazy. Spark RDD Transformations Most of the other methods of the RDD are lazy. I.e., they create a new RDD with meta-data. Transformations are divided to two main types: Those who shuffle And those who don’t 18 copyright 2017 Trainologic LTD

Transformations without shuffle Spark Transformations without shuffle map(func) - Return a new distributed dataset formed by passing each element of the source through a function func. filter(func) - Return a new dataset formed by selecting those elements of the source on which func returns true. flatMap(func) - Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). Taken from the official Apache Spark documentation 19 copyright 2017 Trainologic LTD

Transformations without shuffle Spark Transformations without shuffle mapPartitions(func) - Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T. *Taken from the official Apache Spark documentation 20 copyright 2017 Trainologic LTD

Shuffle Shuffle operations repartition the data across the network. Spark Shuffle Shuffle operations repartition the data across the network. Can be very expensive operations in Spark. You must be aware where and why shuffle happens. Order is not guaranteed inside a partition. Popular operations that cause shuffle are: groupBy*, reduceBy*, sort*, aggregateBy* and join/intersect operations on multiple RDDs. 21 copyright 2017 Trainologic LTD

Shuffle transformations Spark Shuffle transformations distinct([numTasks]) - Return a new dataset that contains the distinct elements of the source dataset. groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. reduceByKey(func, [numTasks]) -When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. *Taken from the official Apache Spark documentation 22 copyright 2017 Trainologic LTD

Shuffle transformations Spark Shuffle transformations join(otherDataset, [numTasks]) - When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. Sort, SortByKey… More at http://spark.apache.org/docs/latest/programming- guide.html *Taken from the official Apache Spark documentation 23 copyright 2017 Trainologic LTD

Jobs, Stages and Tasks A Job in Spark is an action execution. Each action is seen in the UI as a job, named after the action method. A job is split to tasks (all the transformations). Tasks are grouped by stages. 24 copyright 2017 Trainologic LTD

Spark Stage It is important to understand that the shuffle operation is often the expensive part of the execution. So, everything that can be done without shuffling should stay local. A stage is a group of tasks that can be run serially on a partition. 25 copyright 2017 Trainologic LTD

Spark Caching One of the strongest features of Spark is the ability to cache an RDD. Spark can cache the items of the RDD in memory or on disk. You can avoid expensive re-calculations this way. The cache() and persist() store the content in memory. You can provide a different storage level by supplying it to the persist method. 26 copyright 2017 Trainologic LTD

Spark Caching You can store the RDD in memory, disk or try in memory and if it fails fallback to disk. In the memory it can be stored either deserialized or serialized. Note that serialized is more space-efficient but CPU works harder. You can also specify that the cache will be replicated, resulting in a faster cache in case of partition loss. 27 copyright 2017 Trainologic LTD

Spark In-memory Caching When using memory as a cache storage, you have the option of either using serialized or de-serialized state. Serialized state occupies less memory at the expense of more CPU. However, even in a de-serialized mode, Spark still needs to figure the size of the RDD partition. Spark uses the SizeEstimator utility for this purpose. Sometimes can be quite expensive (for complicated data structures). 28 copyright 2017 Trainologic LTD

unpersist Caching is based on LRU. Spark unpersist Caching is based on LRU. If you don’t need a cache, remove it with the unpersist() method. 29 copyright 2017 Trainologic LTD

Spark SQL Spark SQL provides a unified engine (catalyst) with 3 APIs: Dataframes (untyped). Datasets (typed) – Only Scala and Java for Spark 2.0 30 copyright 2017 Trainologic LTD

DataFrames and Datasets Spark DataFrames and Datasets Originally Spark provided only DataFrames. A DataFrame is conceptually a table with typed columns. However, it is not typed at compilation. Starting with Spark 1.6, Datasets were introduced. Represent a table with columns, but, the row is typed at compilation. Datasets can be used in Scala and Java only 31 copyright 2017 Trainologic LTD

SparkSession The entry-point to the Spark SQL. It doesn’t require a SparkContext (is managed within). Supports binding to the current thread (Inherited ThreadLocal) for use with getOrCreate(). Supports configuration settings (SparkConf, app name and master). 32 copyright 2017 Trainologic LTD

Creating DataFrames DataFrames are created using spark session df = spark.read.json(“/data/people.json”) df = spark.read.csv(“/data/people.csv”) df = spark.read.x.. To create a DataFrame from an existing rdd you have to map it to an rdd of Row: ...map(lambda i: Row(single=i, double=i ** 2))) .toDF() 33 copyright 2017 Trainologic LTD

Spark Schema Spark knows in many cases to infer the schema of the data by himself. (By using reflection or sampling) Use printSchema() to explore the scehma that spark deduce Spark also enables to provide the schema yourself And to cast types to change existing schema – df.withColumn(”c1",df[”c1"].cast(IntegerType())) 34 copyright 2017 Trainologic LTD

Exploring the DataFrame Spark Exploring the DataFrame Filtering by columns df.filter(“column > 2”) df.filter(df.column > 2) df.filter(df[“column”] > 2) Selecting specific fields: df.select(df[“column1”],df[“column2”]): 35 copyright 2017 Trainologic LTD

Exploring the DataFrame Spark Exploring the DataFrame Aggregations df.groupBy(..).sum/count/avg/max/min/.. – aggregation on a group df.agg(functions.min(df[“col”])) – calculates only value Aliasing: df.alias(“table1”) df.withColumnRenamed(“old1”,”new1”) 36 copyright 2017 Trainologic LTD

Running SQL queries on DataFrame Spark Running SQL queries on DataFrame Register DF as an SQL Table df.createOrReplaceTempView(“table1”) Remember the spark session? SessionBuilder.. Now we can use sql queries: spark.sql(“select count(*) from table1”) 37 copyright 2017 Trainologic LTD

Spark Query Planning All of the Spark SQL APIs (SQL, Datasets and DataFrames) result in a Catalyst execution. Query execution is composed of 4 stages: Logical plan. Optimized logical plan. Physical plan. Code-gen (Tungesten). 38 copyright 2017 Trainologic LTD

Spark Logical Plan In the first stage, the Catalyst creates a logical execution plan. A tree composed of query primitives and combinators that match the requested query. 39 copyright 2017 Trainologic LTD

Optimized Logical Plan Spark Optimized Logical Plan Catalyst runs a set of rules for optimizing the logical plan. E.g.: Merge projections. Push down filters. Convert outer-joins to inner-joins. 40 copyright 2017 Trainologic LTD

Physical Plan Physical plan is composed of RDD operations. Spark Physical Plan Physical plan is composed of RDD operations. There are usually several alternatives for physical plans for each logical plan. Spark uses a cost analyzer to decide which physical plan to used (resource estimation). 41 copyright 2017 Trainologic LTD

Spark Thank You! 42 copyright 2017 Trainologic LTD