Download presentation
Presentation is loading. Please wait.
1
Hadoop Tutorials Spark
Kacper Surdy Prasanth Kothuri
2
About the tutorial The third session in Hadoop tutorial series
this time given by Kacper and Prasanth Session fully dedicated to Spark framework Extensively discussed Actively developed Used in production Mixture of a talk and hands-on exercises
3
What is Spark A framework for performing distributed computations
Scalable, applicable for processing TBs of data Easy programming interface Supports Java, Scala, Python, R Varied APIs: DataFrames, SQL, MLib, Streaming Multiple cluster deployment modes Multiple data sources HDFS, Cassandra, HBase, S3 why is this good? TB of data, massive computations, scalability
4
Compared to Impala Similar concept of the workload distribution
Overlap in SQL functionalities Spark is more flexible and offers richer API Impala is fine tuned for SQL queries Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1 Node 2 Node 3 Node 4 Node 5 Node X
5
Evolution from MapReduce
MapReduce paper Hadoop Summit Apache Spark top-level 2002 2004 2006 2008 2010 2012 2014 quote from google 'we stopped using mapreduce since' mapreduce decomissioned 2014 MapReduce at Google Hadoop at Yahoo! Spark paper Based on a Databricks' slide
6
Deployment modes Local mode Cluster modes: Standalone Apache Mesos
Make use of multiple cores/CPUs with the thread-level parallelism Cluster modes: Standalone Apache Mesos Hadoop YARN typical for hadoop clusters drop facilitate, (use your cores or something) with centralised resource management
7
How to use it Interactive shell spark-shell pyspark
Job submission spark-submit (use also for python) Notebooks jupyter zeppelin SWAN (centralised CERN jupyter service) terminal Interfaces -> how to use it? web interface
8
Example – Pi estimation
import scala.math.random val slices = 2 val n = * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " * count / n) + a slide with instructions
9
Example – Pi estimation
Login to on of the cluster nodes haperf1[01-12] Start spark-shell Copy or retype the content of /afs/cern.ch/user/k/kasurdy/public/pi.spark
10
Spark concepts and terminology
11
Transformations, actions (1)
Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation. examples: map, filter, union, distinct, join Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation. examples: reduce, collect, count, first, take
12
Transformations, actions (2)
import scala.math.random val slices = 2 val n = * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " * count / n) 1 2 3 … n-1 n Map function 1 … diagram Reduce function 4 * count = / n =
13
Driver, worker, executor
typically on a cluster import scala.math.random val slices = 2 val n = * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " * count / n) Worker Executor Driver Worker Cluster Manager SparkContext Executor Worker Executor
14
Stages, tasks Task - A unit of work that will be sent to one executor
Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points
15
Your job as a directed acyclic graph (DAG)
full name
16
Monitoring pages If you can, scroll the console output up to get an application ID (e.g. application_ _0070) and open Otherwise use the link below to find you application on the list
17
Monitoring pages
18
Data APIs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.