SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

SPARK CLUSTER SETTING 賴家民 1

2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext with RDD

Why Use Spark  Speed  Ease of Use  Generality  Runs Everywhere

Spark Runtime Architecture 4 The Driver 1.Converting a user program into tasks 2.Scheduling tasks on executors Executor worker processes responsible for running the individual tasks

Spark Runtime Architecture in this course Spark Driver scml1 Spark Driver scml1 Cluster Master Standalone Executor scml1 Executor scml1 Cluster Worker Executor scml2 Executor scml2 Cluster Worker

Build Spark - Download it first 6  Go to https://spark.apache.org/downloads.htmlhttps://spark.apache.org/downloads.html  Chose Spark version with the package fit your system.  wget ………………….

Build Spark - Start to build 7  Set JAVA_HOME  Vim ~/.bashrc ;  export JAVA_HOME=/usr/lib/jvm/java-7-oracle  Built with the right package  mvn -DskipTests clean package  Built in our course  mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 - Dhadoop.version=2.6.0 -Phive –DskipTests  If you don’t built like this, you will get IPC connection fail.

Build Spark – Set SSH certificates 8  ssh-keygen -t rsa -P ""  cat $~/.ssh/id_rsa.pub >> $~/.ssh/authorized_keys  scp ~/.ssh/authorized_keys scml@scml2:~/.ssh Because of setting it when building hadoop, we ignore this step.

Local mode 9  Interactive shells  bin/pyspark  bin/spark-shell  Launch application  bin/spark-submit my_script.py  If Spark cluster start, bin/spark-submit –master local my_script.py

Standalone Cluster Manager 10  scml1  cp slaves.template slave  vim ~/spark-1.2.0/conf/slaves  ADD scml1 and scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=192.168.56.111  ADD export SPARK_LOCAL_IP=192.168.56.111

Standalone Cluster Manager 11  scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=192.168.56.111  ADD export SPARK_LOCAL_IP=192.168.56.112  Check this standalone run or not  spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py  Spark Standalone Web UI  192.168.56.1:8080

Standalone Cluster Manager 12  Parameter  --executor-memory  --driver-memory  --total-executor-cores  --executor-cores  --deploy-mode cluster or client  HDFS input  rawData = sc.textFile("hdfs://192.168.56.111:9000/kddcup.data_1 0_percent")  Depend on your hadoop core-site.xml setting

Hadoop Yarn 13  scml1  Vim spark-1.2.0/conf/spark-env.sh; Add export HADOOP_CONF_DIR="/home/scml/hadoop- 2.6.0/etc/hadoop/”  Check this Yarn Manager run or not  spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py

SparkContext with RDD 14  Main entry point to Spark functionality  Available in shell as variable sc

Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)

Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java :Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side

> lines = sc.textFile(“hdfs://192.168.56.111:9000/hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda words : (word, 1)).reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)

Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

23 Thank sena.lai1982@gmail.com

SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

Similar presentations

Presentation on theme: "SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

Similar presentations

Presentation on theme: "SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext."— Presentation transcript:

Similar presentations

About project

Feedback