Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

Similar presentations


Presentation on theme: "SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext."— Presentation transcript:

1 SPARK CLUSTER SETTING 賴家民 1

2 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext with RDD

3 Why Use Spark  Speed  Ease of Use  Generality  Runs Everywhere

4 Spark Runtime Architecture 4 The Driver 1.Converting a user program into tasks 2.Scheduling tasks on executors Executor worker processes responsible for running the individual tasks

5 Spark Runtime Architecture in this course Spark Driver scml1 Spark Driver scml1 Cluster Master Standalone Executor scml1 Executor scml1 Cluster Worker Executor scml2 Executor scml2 Cluster Worker

6 Build Spark - Download it first 6  Go to https://spark.apache.org/downloads.htmlhttps://spark.apache.org/downloads.html  Chose Spark version with the package fit your system.  wget ………………….

7 Build Spark - Start to build 7  Set JAVA_HOME  Vim ~/.bashrc ;  export JAVA_HOME=/usr/lib/jvm/java-7-oracle  Built with the right package  mvn -DskipTests clean package  Built in our course  mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 - Dhadoop.version=2.6.0 -Phive –DskipTests  If you don’t built like this, you will get IPC connection fail.

8 Build Spark – Set SSH certificates 8  ssh-keygen -t rsa -P ""  cat $~/.ssh/id_rsa.pub >> $~/.ssh/authorized_keys  scp ~/.ssh/authorized_keys scml@scml2:~/.ssh Because of setting it when building hadoop, we ignore this step.

9 Local mode 9  Interactive shells  bin/pyspark  bin/spark-shell  Launch application  bin/spark-submit my_script.py  If Spark cluster start, bin/spark-submit –master local my_script.py

10 Standalone Cluster Manager 10  scml1  cp slaves.template slave  vim ~/spark-1.2.0/conf/slaves  ADD scml1 and scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=192.168.56.111  ADD export SPARK_LOCAL_IP=192.168.56.111

11 Standalone Cluster Manager 11  scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=192.168.56.111  ADD export SPARK_LOCAL_IP=192.168.56.112  Check this standalone run or not  spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py  Spark Standalone Web UI  192.168.56.1:8080

12 Standalone Cluster Manager 12  Parameter  --executor-memory  --driver-memory  --total-executor-cores  --executor-cores  --deploy-mode cluster or client  HDFS input  rawData = sc.textFile("hdfs://192.168.56.111:9000/kddcup.data_1 0_percent")  Depend on your hadoop core-site.xml setting

13 Hadoop Yarn 13  scml1  Vim spark-1.2.0/conf/spark-env.sh; Add export HADOOP_CONF_DIR="/home/scml/hadoop- 2.6.0/etc/hadoop/”  Check this Yarn Manager run or not  spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py

14 SparkContext with RDD 14  Main entry point to Spark functionality  Available in shell as variable sc

15 Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)

16 Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)

17 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)

18 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

19 Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java :Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

20 Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side

21 > lines = sc.textFile(“hdfs://192.168.56.111:9000/hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda words : (word, 1)).reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)

22 Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))

23 23 Thank sena.lai1982@gmail.com


Download ppt "SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext."

Similar presentations


Ads by Google