Spark and Scala.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
HADOOP ADMIN: Session -2
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Introduction to Hadoop and HDFS
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
An Introduction to HDInsight June 27 th,
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Matthew Winter and Ned Shawa
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Running Apache Spark on HPC clusters
Big Data is a Big Deal!.
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
Concept & Examples of pyspark
MapReduce Compiler RHadoop
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Hadoop.
Introduction to Distributed Platforms
ITCS-3190.
Spark.
An Open Source Project Commonly Used for Processing Big Data Sets
Metis Data Science Meetup:
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Central Florida Business Intelligence User Group
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Spark and Scala.
Data processing with Hadoop
Working with Spark With Focus on Lab3.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Working with Spark With Focus on Lab3.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Pig Hive HBase Zookeeper
Presentation transcript:

Spark and Scala

Spark - What is it? Why does it matter? Spark is a general-purpose distributed data processing engine that is suitable for use in a wide range of circumstances. On top of the Spark core data processing engine, there are libraries for SQL, machine learning, graph computation, and stream processing, which can be used together in an application. Programming languages supported by Spark include: Java, Python, Scala, and R. Application developers and data scientists incorporate Spark into their applications to rapidly query, analyze, and transform data at scale. Tasks most frequently associated with Spark include ETL and SQL batch jobs across large data sets, processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks.

Name a few programming languages supported by Spark?

Layers and Packages in Spark Spark SQL - Queries can be made on structured data using both SQL(Structured Query Language) and HQL (Hive Query Language). MLlib - designed to support multiple machine learning algorithms. It runs 100 times faster than Hadoop’s MapReduce. Spark Streaming - Spark allows Streaming for the creation of interactive applications that can perform data analysis operations on live streamed data. GraphX - is an engine supported by Spark that allows the user to make computations using graphs.

What are the names of the layers and packages in Spark?

Downloading Spark And Installation Go to the Link: https://spark.apache.org/downloads.html and click the download button. Spark requires Java to run and Spark is implemented using Scala. So make sure you have both installed on your computer. Spark Environment can be configured according to your preferences using SparkConf or Spark Shell tools. It can also be done while the Installation Wizard is installing Spark. You need to initialize a new SparkContext using your preferred language (i.e. Python, Java, Scala, R). This will set up services and connects to an execution environment for Spark applications.

Simple Spark Application Spark application can be developed in three of the following supported languages: 1) Scala 2)Python 3)Java It provides two abstraction in its application: RDD(Resilient Distributed Dataset) Two types of shared variable in parallel Operation: Broadcast variable Accumulators 3 basic steps for completing an application: 1) Writing the Application 2) Compiling and packaging the Scala and Java Applications 3) Running the application Dataset in the newest version. Broadcast variable can be stored in the cache of each system. Accumulators can help with aggregation function such as addition.

How a Spark Application Runs on a cluster? A Spark application runs as independent processes, coordinated by the SparkSession object in the driver program. The resource or cluster manager assigns tasks to workers, one task per partition. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Because iterative algorithms apply operations repeatedly to data, they benefit from caching datasets across iterations. Results are sent back to the driver application or can be saved to disk.

Simple Spark Application Continued After linking the spark library The next step is to create a Spark context object with the desired spark configuration that tells Apache Spark on how to access a cluster. “Word Count” – This is the name of the application that you want to run. “local”- This parameter denotes the master URL to connect the spark application to.  /usr/local/spark- This parameter denotes the home directory of Apache Spark. Map() – The first map  specifies the  environment whilst the second one specifies the variables to work nodes.\ Creating a Spark RDD The next step in the Spark Word count example creates an input Spark RDD that reads the text file input.txt using the Spark Context created in the previous step-            Spark RDD Transformations in Wordcount Example Spark application code transform the input RDD to count RDD -     In this piece of code, flatMap () is used to tokenize the lines from input text file into words. Map () method counts the frequency of each word. reduceByKey () method counts the repetitions of word in the text file. Running Spark Application We will submit the word count example in Apache Spark using the Spark shell instead of running the word count program as a whole $ spark-shell # create Spark context with Spark configuration # get threshold # read in text file and split each document into words # count the occurrence of each word # filter out words with fewer than threshold occurrences # count characters

True or False: In order to run Spark, you don’t need Java.

Spark Abstraction Resilient Distributed Database (RDD) It is the fundamental abstraction in Apache Spark. It is the basic data structure. RDD in Apache Spark is an immutable collection of objects which computes on the different node of the cluster. Resilient- i.e. fault-tolerant ,so able to recompute missing or damaged partitions due to node failures. Distributed- since Data resides on multiple nodes. Dataset -represents records of the data you work with. The user can load the data set externally which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

Spark Abstraction Data Frame Spark Streaming GraphX We can term DataFrame as Dataset organized into columns. DataFrames are similar to the table in a relational database or data frame in R /Python. Spark Streaming It is a Spark’s core extension, which allows Real-time stream processing from several sources. To offer a unified, continuous DataFrame abstraction that can be used for interactive and batch queries these two sources work together. It offers scalable, high-throughput and fault-tolerant processing. GraphX It is one more example of specialized data abstraction. It enables developers to analyze social networks. Also, other graphs alongside Excel-like two-dimensional data.

Which Spark Abstraction is known to be fault tolerant?

PYSPARK PySpark in an interactive shell for Spark in Python. Using Pyspark we can work with RDD’s in python programming language by using Pyj4 library. Pyspark offers pyspark shell which links the Python API to the spark core and initializes the spark context. The PySpark supports Python 2.6 and above. The Python shell can be used explore data interactively and is a simple way to learn the API

Example: from pyspark import SparkContext sc = SparkContext("local", "count app") words = sc.parallelize ( ["scala", "java", "hadoop", "spark", "akka", "spark vs hadoop", "pyspark", "pyspark and spark"] ) counts = words.count() print "Number of elements in RDD -> %i" % (counts) Output is : Number of elements in RDD → 8

Question What are the languages supported by Apache Spark for developing big data applications?

Python API In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types. It is dynamically typed hence because of that RDDs can hold objects of multiple types. You can also pass functions that are defined with the def keyword; this is useful for longer functions that can’t be expressed using lambda.

Python API Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back: PySpark will automatically ship these functions to workers, along with any objects that they reference. Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers.

Do, we have machine learning API in Python? Question Do, we have machine learning API in Python? (True/False)

What is Scala? Scala is a general purpose language that can be used to develop solutions for any software problem. Scala combines object-oriented and functional programming in one concise, high-level language. Scala's static types help avoid bugs in complex applications, and its JVM and JavaScript runtimes let you build high-performance systems with easy access to huge ecosystems of libraries. Scala offers a toolset to write scalable concurrent applications in a simple way with more confidence in their correctness. Scala is an excellent base of parallel, distributed, and concurrent computing, which is widely thought to be a very big challenge in software development but by the unique combination of features has won this challenge. Scala can also be termed as a machine compiled language.

Why Scala? Using Scala apps are less costlier to maintain and easier to evolve Scala because Scala is a functional and object-oriented programming language that makes light bend reactive and helps developers write code that's more concise than other options. Scala is used outside of its killer-app domain as well, of course, and certainly for a while there was a hype about the language that meant that even if the problem at hand could easily be solved in Java, Scala would still be the preference, as the language was seen as a future replacement for Java. It reduces the amount of code developers to write the code.

What is a drawback of Scala?

Example programs with Scala and Python WordCount: In this example we use few transformations to build a dataset of (String, int) pairs called counts and save it to a file. Python: text_file = sc.textFile("hdfs://...") # Read data from file counts = text_file.flatMap(lambda line: line.split(" ")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...") #Save output into the file Scala: val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Prediction with Regressions utilizing MLlib Scala: // Read data into dataframe val df = sqlContext.createDataFrame(data).toD F("label", "features") val lr = new LogisticRegression().setMaxIter(10) val model = lr.fit(df) val weights = model.weights // Predicting the results using the model model.transform(df).show() Python: #Read data into dataframe df = sqlContext.createDataFrame(data, ["label","features"]) lr = LogisticRegression(maxIter=10) #Fit the model to the data model = lr.fit(df) # Predicting the results using the model model.transform(df).show()

Questions?