SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

Slides:



Advertisements
Similar presentations
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Parallel Programming With Spark
Pat McDonough - Databricks
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Fast and Expressive Big Data Analytics with Python
1 Hadoop HDFS Install Hadoop HDFS with Ubuntu
Patrick Wendell Databricks Deploying and Administering Spark.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Introduction to Apache Spark
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
SparkR: Enabling Interactive Data Science at Scale
Parallel Programming With Spark
Matei Zaharia UC Berkeley Parallel Programming With Spark UC BERKELEY.
Overview Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications.
Parallel Programming With Spark March 15, 2013 ECNU, Shanghai, China.
Intro to Spark Lightning-fast cluster computing. What is Spark? Spark Overview: A fast and general-purpose cluster computing system.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Apache Mahout. Prerequisites for Building MAHOUT Java JDK 1.6 Maven 3.0 or higher ( ). Subversion (optional)
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Hadoop Joshua Nester, Garrison Vaughan, Calvin Sauerbier, Jonathan Pingilley, and Adam Albertson.
CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Running Apache Spark on HPC clusters
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
Hadoop Architecture Mr. Sriram
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Pyspark 최 현 영 컴퓨터학부.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
2018 Valid Cloudera CCA175 Dumps Questions - CCA175 Braindumps Dumpsprofessor.com
Ministry of Higher Education
Kishore Pusukuri, Spring 2018
CS110: Discussion about Spark
Introduction to Apache
Spark and Scala.
Spark and Scala.
Introduction to Spark.
Apache Hadoop and Spark
CS5412 / Lecture 25 Apache Spark and RDDs
Hadoop Installation Fully Distributed Mode
Spark and Python Instructor: Bei Kang.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Presentation transcript:

SPARK CLUSTER SETTING 賴家民 1

2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext with RDD

Why Use Spark  Speed  Ease of Use  Generality  Runs Everywhere

Spark Runtime Architecture 4 The Driver 1.Converting a user program into tasks 2.Scheduling tasks on executors Executor worker processes responsible for running the individual tasks

Spark Runtime Architecture in this course Spark Driver scml1 Spark Driver scml1 Cluster Master Standalone Executor scml1 Executor scml1 Cluster Worker Executor scml2 Executor scml2 Cluster Worker

Build Spark - Download it first 6  Go to  Chose Spark version with the package fit your system.  wget ………………….

Build Spark - Start to build 7  Set JAVA_HOME  Vim ~/.bashrc ;  export JAVA_HOME=/usr/lib/jvm/java-7-oracle  Built with the right package  mvn -DskipTests clean package  Built in our course  mvn package -Pyarn -Dyarn.version= Phadoop Dhadoop.version= Phive –DskipTests  If you don’t built like this, you will get IPC connection fail.

Build Spark – Set SSH certificates 8  ssh-keygen -t rsa -P ""  cat $~/.ssh/id_rsa.pub >> $~/.ssh/authorized_keys  scp ~/.ssh/authorized_keys Because of setting it when building hadoop, we ignore this step.

Local mode 9  Interactive shells  bin/pyspark  bin/spark-shell  Launch application  bin/spark-submit my_script.py  If Spark cluster start, bin/spark-submit –master local my_script.py

Standalone Cluster Manager 10  scml1  cp slaves.template slave  vim ~/spark-1.2.0/conf/slaves  ADD scml1 and scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=  ADD export SPARK_LOCAL_IP=

Standalone Cluster Manager 11  scml2  cp spark-env.sh.template spark-env.sh  ADD export SPARK_MASTER_IP=  ADD export SPARK_LOCAL_IP=  Check this standalone run or not  spark-1.2.0/bin/spark-submit --master spark:// :7077 spark_test.py  Spark Standalone Web UI  :8080

Standalone Cluster Manager 12  Parameter  --executor-memory  --driver-memory  --total-executor-cores  --executor-cores  --deploy-mode cluster or client  HDFS input  rawData = sc.textFile("hdfs:// :9000/kddcup.data_1 0_percent")  Depend on your hadoop core-site.xml setting

Hadoop Yarn 13  scml1  Vim spark-1.2.0/conf/spark-env.sh; Add export HADOOP_CONF_DIR="/home/scml/hadoop /etc/hadoop/”  Check this Yarn Manager run or not  spark-1.2.0/bin/spark-submit --master spark:// :7077 spark_test.py

SparkContext with RDD 14  Main entry point to Spark functionality  Available in shell as variable sc

Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Creating RDDs # Turn a Python collection into an RDD > sc.parallelize([1, 2, 3]) # Load text file from local FS, HDFS, or S3 > sc.textFile(“file.txt”) > sc.textFile(“directory/*.txt”) > sc.textFile(“hdfs://namenode:9000/path/file”) # Use existing Hadoop InputFormat (Java/Scala only) > sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)

Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python : pair = (a, b) pair[0] # => a pair[1] # => b Scala : val pair = (a, b) pair._1 // => a pair._2 // => b Java :Tuple2 pair = new Tuple2(a, b); pair._1 // => a pair._2 // => b

Some Key-Value Operations > pets = sc.parallelize( [(“cat”, 1), (“dog”, 1), (“cat”, 2)]) > pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)} > pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])} > pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side

> lines = sc.textFile(“hdfs:// :9000/hamlet.txt”) > counts = lines.flatMap(lambda line: line.split(“ ”)).map(lambda words : (word, 1)).reduceByKey(lambda x, y: x + y) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)

Other Key-Value Operations > visits = sc.parallelize([ (“index.html”, “ ”), (“about.html”, “ ”), (“index.html”, “ ”) ]) > pageNames = sc.parallelize([ (“index.html”, “Home”), (“about.html”, “About”) ]) > visits.join(pageNames) # (“index.html”, (“ ”, “Home”)) # (“index.html”, (“ ”, “Home”)) # (“about.html”, (“ ”, “About”)) > visits.cogroup(pageNames) # (“index.html”, ([“ ”, “ ”], [“Home”])) # (“about.html”, ([“ ”], [“About”]))

23 Thank