Download presentation
Presentation is loading. Please wait.
Published byAriel Parsons Modified over 8 years ago
1
SPARK CLUSTER SETTING 賴家民 1
2
2 Why Use Spark Spark Runtime Architecture Build Spark Local mode Standalone Cluster Manager Hadoop Yarn SparkContext with RDD
3
Why Use Spark Speed Ease of Use Generality Runs Everywhere
4
Spark Runtime Architecture 4 The Driver 1.Converting a user program into tasks 2.Scheduling tasks on executors Executor worker processes responsible for running the individual tasks
5
Spark Runtime Architecture in this course Spark Driver scml1 Spark Driver scml1 Cluster Master Standalone Executor scml1 Executor scml1 Cluster Worker Executor scml2 Executor scml2 Cluster Worker
6
Build Spark - Download it first 6 Go to https://spark.apache.org/downloads.htmlhttps://spark.apache.org/downloads.html Chose Spark version with the package fit your system. wget ………………….
7
Build Spark - Start to build 7 Set JAVA_HOME Vim ~/.bashrc ; export JAVA_HOME=/usr/lib/jvm/java-7-oracle Built with the right package mvn -DskipTests clean package Built in our course mvn package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 - Dhadoop.version=2.6.0 -Phive –DskipTests If you don’t built like this, you will get IPC connection fail.
8
Build Spark – Set SSH certificates 8 ssh-keygen -t rsa -P "" cat $~/.ssh/id_rsa.pub >> $~/.ssh/authorized_keys scp ~/.ssh/authorized_keys scml@scml2:~/.ssh Because of setting it when building hadoop, we ignore this step.
9
Local mode 9 Interactive shells bin/pyspark bin/spark-shell Launch application bin/spark-submit my_script.py If Spark cluster start, bin/spark-submit –master local my_script.py
10
Standalone Cluster Manager 10 scml1 cp slaves.template slave vim ~/spark-1.2.0/conf/slaves ADD scml1 and scml2 cp spark-env.sh.template spark-env.sh ADD export SPARK_MASTER_IP=192.168.56.111 ADD export SPARK_LOCAL_IP=192.168.56.111
11
Standalone Cluster Manager 11 scml2 cp spark-env.sh.template spark-env.sh ADD export SPARK_MASTER_IP=192.168.56.111 ADD export SPARK_LOCAL_IP=192.168.56.112 Check this standalone run or not spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py Spark Standalone Web UI 192.168.56.1:8080
12
Standalone Cluster Manager 12 Parameter --executor-memory --driver-memory --total-executor-cores --executor-cores --deploy-mode cluster or client HDFS input rawData = sc.textFile("hdfs://192.168.56.111:9000/kddcup.data_1 0_percent") Depend on your hadoop core-site.xml setting
13
Hadoop Yarn 13 scml1 Vim spark-1.2.0/conf/spark-env.sh; Add export HADOOP_CONF_DIR="/home/scml/hadoop- 2.6.0/etc/hadoop/” Check this Yarn Manager run or not spark-1.2.0/bin/spark-submit --master spark://192.168.56.111:7077 spark_test.py
14
SparkContext with RDD 14 Main entry point to Spark functionality Available in shell as variable sc
15
Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)
16
Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)
17
Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
18
Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
19
Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java :Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b
20
Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side
21
> lines = sc.textFile(“hdfs://192.168.56.111:9000/hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda words : (word, 1)).reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)
22
Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “1.2.3.4”), (“about.html”, “3.4.5.6”), (“index.html”, “1.3.3.1”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“1.2.3.4”, “Home”)) # (“index.html”, (“1.3.3.1”, “Home”)) # (“about.html”, (“3.4.5.6”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“1.2.3.4”, “1.3.3.1”], [“Home”])) # (“about.html”, ([“3.4.5.6”], [“About”]))
23
23 Thank sena.lai1982@gmail.com
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.