ITCS-3190.

Slides:

Advertisements

Similar presentations

More about Ruby Maciej Mensfeld Presented by: Maciej Mensfeld More about Ruby dev.mensfeld.pl github.com/mensfeld.

Advertisements

Spark Lightning-Fast Cluster Computing UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.

Chapter 6: Hostile Code Guide to Computer Network Security.

Julie McEnery1 Installing the ScienceTools The release manager automatically compiles each release of the Science Tools, it creates a set of wrapper scripts.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

10/5/2015CS346 PHP1 Module 1 Introduction to PHP.

An Introduction to HDInsight June 27 th,

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Matthew Winter and Ned Shawa

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.

PySpark Tutorial - Learn to use Apache Spark with Python

Image taken from: slideshare

Applications Active Web Documents Active Web Documents.

Fundamentals of Programming I Overview of Programming

TensorFlow– A system for large-scale machine learning

Development Environment

Big Data is a Big Deal!.

Running a Forms Developer Application

Spark Programming By J. H. Wang May 9, 2017.

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Machine Learning Library for Apache Ignite

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Hadoop Tutorials Spark

Spark Presentation.

Platform as a Service.

Variables, Expressions, and IO

Neelesh Srinivas Salian

Data Platform and Analytics Foundational Training

Introduction to Operating System (OS)

Projects on Extended Apache Spark

Introduction Enosis Learning.

PHP / MySQL Introduction

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Introduction Enosis Learning.

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

Server & Tools Business

Chapter 2: System Structures

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

Spark and Scala.

CSE 491/891 Lecture 21 (Pig).

Charles Tappert Seidenberg School of CSIS, Pace University

Spark and Scala.

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Introduction to Spark.

Apache Hadoop and Spark

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

Server & Tools Business

CS639: Data Management for Data Science

Web Application Development Using PHP

Presentation transcript:

ITCS-3190

Overview Its a fast general-purpose engine for large-scale data processing. Speed Ease of use Generality Runs everywhere

Overview Spark Core Engine allows writing raw Spark programs and Scala programs and launch them; it also allows writing Java programs before launching them. All these are being executed by Spark Core Engine. Spark SQL for querying structured data via SQL and Hive Query Language (HQL) Spark Streaming mainly enables you to create analytical and interactive applications for live streaming data. You can do the streaming of the data and then, Spark can run its operations from the streamed data itself. MLLib is a machine learning library that is built on top of Spark, and has the provision to support many machine learning algorithms. But the point difference is that it runs almost 100 times faster than MapReduce. Spark has its own Graph Computation Engine, called GraphX.

Downloading Download a recent released version of spark at http://spark.apache.org/downloads.html Select package type of “Pre-built for Hadoop 2.4 and later” Click Direct Download which will download a compressed TAR file Unpack the file using: The tar command-line tool that comes with most Unix and Linux variants, including Mac OS X a free TAR extractor

Getting Started Spark comes with shells much like operating system shells such as Bash or Windows Command Prompt. The difference in these is spark shells allow you to manipulate data that is distributed across many machines. To open a Spark shell, go to your Spark directory and type: bin/pyspark for Python version (bin\pyspark in windows) bin/spark-shell for the Scala version In Spark, operations on distributed collections are expressed as RDD’s (resilient distributed datasets) The variable lines is an RDD created from a text file. Then we can run various operations on the RDD such as count the lines of text or print the first one.

Spark Application Like MapReduce applications, each Spark application is a self-contained computation that runs user-supplied code to compute a result. In Spark, the highest-level unit of computation is an application. A Spark application can be used for a single batch job, an interactive session with multiple jobs, or a long-lived server continually satisfying requests. A Spark job can consist of more than just a single map and reduce. A Spark application can have processes running on its behalf even when it's not running a job. Multiple tasks can run within the same executor Therefore, this enables extremely fast task startup time as well as in-memory data storage, resulting in orders of magnitude faster performance

Simple Spark Apps Simple Spark APP: Pi Estimation Spark can be used to compute very intensive tasks This code estimates pi by “throwing darts” at a circle -pick random points in unit circle ((0,0) to (1,1)) -check how many fall in the unit circle The fraction should be pi/4 Then that estimates the value for pi

Simple Spark Apps Simple Spark APP: Word Count Creates a Spark Conf and SparkContext. A Spark application corresponds to an instance of the SparkContext class. Gets a word frequency threshold. Reads an input set of text documents. Counts the number of times each word appears. Filters out all words that appear fewer times than the threshold. For the remaining words, counts the number of times each letter occurs.

Simple Spark Apps In MapReduce, this requires two MapReduce applications, as well as persisting the intermediate data to HDFS between them. In Spark, this application requires about 90 percent fewer lines of code than one developed using the MapReduce API

Introduction to Scala $ scala starts interpreter A scalable programming language Influenced by Haskell and Java Can use any Java code in Scala, making it almost as fast as Java but with much shorter code Allows fewer errors – no Null Pointer errors More flexible - Every Scala function is a value, every value is an object Scala Interpreter is an interactive shell for writing expressions $ scala starts interpreter scala> 3 + 5 expression to be evaluated by interpreter Unnamed0: Int = 8 result of evaluation scala> :quit quits interpreter

Scala A scalable language Object-oriented language Every value is an object functional language concepts Start using it like Java Gradually use more functional style syntax Runs on the JVM Many design patterns are already natively supported Pair<Integer, String> p = new Pair<Integer, String>(1, "Scala"); val p = new MyPair(1, "scala")

PySpark Python is dynamically typed, so RDDs can hold objects of multiple types. PySpark does not yet support a few API calls, such as look up and non- text input files. PySpark requires Python 2.6 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. They have not tested PySpark with Python 3 or with alternative Python interpreters, such as PyPy or Jython.

Python API Key Differences In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types You can also pass functions that are defined with the def keyword; this is useful for longer functions that can’t be expressed using lambda

Python API Key Differences Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated back: PySpark will automatically ship these functions to workers, along with any objects that they reference. Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers.

Interactive Use The bin/pyspark script launches a Python interpreter that is configured to run PySpark applications. The Python shell can be used explore data interactively and is a simple way to learn the API: By default, the bin/pyspark shell creates SparkContext that runs applications locally on a single core. To connect to a non-local cluster, or use multiple cores, set the MASTER environment variable PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using bin/pyspark.

Spark Code Examples Text search of error messages in a log file