Download presentation
Presentation is loading. Please wait.
1
Concept & Examples of pyspark
2
Apache Spark? Roughly speaking, Spark = RDD + Interface
Support RDD-based processing interface This ppt is based on pyspark
3
Apache Spark? Roughly speaking, Spark = RDD + Interface Lazy Execution
Immutable Lineage Essential concepts
4
What is RDD? RDD(Resilient Distributed Dataset)
Resilient : If data in memory is lost, it can be recreated Distributed : Stored in memory across the cluster Dataset : Initial data can come from a file or be created programmatically Fundamental data unit of data in Spark
5
Apache Spark? Roughly speaking, Spark = RDD + Interface
Lazy Execution Immutable Lineage Essential concepts RDD cannot be modified but newly generated
6
Apache Spark? Roughly speaking, Spark = RDD + Interface
Lazy Execution Immutable Lineage Essential concepts RDD processing doesn’t occur until action operation comes
7
Apache Spark? Roughly speaking, Spark = RDD + Interface
Lazy Execution Immutable Lineage Essential concepts Fault tolerant policy of Spark system(logging on lineage)
8
Creating RDD RDD can be created from
- A file or set of files(.txt, .json, .csv, object file etc.) - Data in memory - From another RDD
9
Creating RDD with pyspark
Initializing pyspark Configured appName Define configuration option and create a SparkContext
10
Creating RDD with pyspark
Create RDD with the text file “purplecow.txt” from local Print out the lines of the number of lines of the “purplecow.txt”
11
RDD Operations Two types of RDD Operations Actions : return values
Transformations : Define a new RDD based on the current ones [Actions] [Transformations]
12
Action Operations with pyspark
Action operations : operations that return values. count : return the number of elements take(n) : return an array of the first n-elements
13
Transformations Operations with pyspark
Transformation opeations : create a new RDD from an existing one map(function) : creates a new RDD by performing a function on each record in the base RDD filter(function) : creates a new RDD by including or excluding each record in the base RDD according to a boolean function
14
Lazy Execution Data in RDDs is not processed until an action is performed RDD process is not performed RDD process is performed
15
Lazy Execution When using transformation funtions, there are no RDD process. It creates the lineage log. When using action functions, RDD process are done based on the lineage log.
16
Functional Programming in spark
Key Concepts Passing functions as input to other functions Anonymous functions Passing functions example with pyspark Define a function ‘toUpper’ Passing function as input to map function
17
Functional Programming in spark
Anonymous Functions Functions defined in-line without an identifier Supported in python and scala(ex. lambda x: …)
18
Working with RDDs RDDs can hold any type of element
Primitive types : integers, characters, booleans, etc. Sequence types : strings, lists, arrays, tuples, dicts, etc. Scala/Java objects(if serializable) Mixed types Some types of RDDs have additional functionality Pair RDDs : RDDs consisting of key-value pairs Double RDDs : RDDs consisting of numeric data
19
Some other general RDD operations
Transformations flatMap : maps one element in the base RDD to multiple elements distinct : filter out duplicates union : add all elements of two RDDs into a single new RDD Other RDD operations – first – return the first element of the RDD – foreach – apply a function to each element in an RDD – top(n) – return the largest n elements using natural ordering
20
flatMap and distinct Example of flatMap and distinct
[without distinct] [with distinct]
21
Pair RDD Pair RDDs are a special form of RDD
Each element must be a key –value pair(a two element tuple) Keys and values can be any type. Commonly used functions to create pair RDDs map flatMap/flatMapValues keyBy
22
Pair RDD Examples Examples creating example
Split by ‘\t’ and map the data into key-value pair RDD [Sample.txt]
23
Example. Keying web log by User ID
Split by empty space and key by 3rd element
24
Mapping Single Rows to Multiple Pairs
Textfile format Number and IDs are delimited by ‘\t’ Multiple IDs are delimited by ‘:’ [Original textfile]
25
Mapping Single Rows to Multiple Pairs
Split the text by ‘\t’ line by line Create pair RDD flatMapValues by delimiter ‘:’ Multiple pairs
26
MapReduce - Mapping the textfile data
27
MapReduce - Reducing phase
28
MapReduce with pyspark
29
Other Pair RDD Operations
30
Other Pair RDD Operations
31
Example. Join Web log with KB articles
Join the requested file which has a format ‘KBDOC-[0-9]* of weblogs with Article ID of kblist data
32
Example. Join Web log with KB articles
Mapping kblist.txt
33
Example. Join Web log with KB articles
Mapping ‘ log’
34
Example. Join Web log with KB articles
Joining two RDD by key
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.