Hadoop Tutorials Spark

Slides:

Advertisements

Similar presentations

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.

Advertisements

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

Spark Lightning-Fast Cluster Computing UC BERKELEY.

Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Spark: Cluster Computing with Working Sets

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Fast and Expressive Big Data Analytics with Python

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Patrick Wendell Databricks Deploying and Administering Spark.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Data Engineering How MapReduce Works

Matthew Winter and Ned Shawa

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

LARGE-SCALE DATA ANALYSIS WITH APACHE SPARK ALEXEY SVYATKOVSKIY.

Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:

PySpark Tutorial - Learn to use Apache Spark with Python

Image taken from: slideshare

Presented by: Omar Alqahtani Fall 2016

Running Apache Spark on HPC clusters

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

Apache hadoop & Mapreduce

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Spark Presentation.

Data Platform and Analytics Foundational Training

Iterative Computing on Massive Data Sets

Pyspark 최 현 영 컴퓨터학부.

Hadoop Clusters Tess Fulkerson.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Kishore Pusukuri, Spring 2018

Introduction to Apache Spark

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

Spark and Scala.

Spark and Scala.

Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper

Introduction to Spark.

Apache Hadoop and Spark

CS5412 / Lecture 25 Apache Spark and RDDs

Fast, Interactive, Language-Integrated Cluster Computing

Streaming data processing using Spark

Big-Data Analytics with Azure HDInsight

COS 518: Distributed Systems Lecture 11 Mike Freedman

Lecture 29: Distributed Systems

CS639: Data Management for Data Science

Pig Hive HBase Zookeeper

Presentation transcript:

Hadoop Tutorials Spark Kacper Surdy Prasanth Kothuri

About the tutorial The third session in Hadoop tutorial series this time given by Kacper and Prasanth Session fully dedicated to Spark framework Extensively discussed Actively developed Used in production Mixture of a talk and hands-on exercises

What is Spark A framework for performing distributed computations Scalable, applicable for processing TBs of data Easy programming interface Supports Java, Scala, Python, R Varied APIs: DataFrames, SQL, MLib, Streaming Multiple cluster deployment modes Multiple data sources HDFS, Cassandra, HBase, S3 why is this good? TB of data, massive computations, scalability

Compared to Impala Similar concept of the workload distribution Overlap in SQL functionalities Spark is more flexible and offers richer API Impala is fine tuned for SQL queries Interconnect network MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks MEMORY CPU Disks Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Evolution from MapReduce MapReduce paper Hadoop Summit Apache Spark top-level 2002 2004 2006 2008 2010 2012 2014 quote from google 'we stopped using mapreduce since' mapreduce decomissioned 2014 MapReduce at Google Hadoop at Yahoo! Spark paper Based on a Databricks' slide

Deployment modes Local mode Cluster modes: Standalone Apache Mesos Make use of multiple cores/CPUs with the thread-level parallelism Cluster modes: Standalone Apache Mesos Hadoop YARN typical for hadoop clusters drop facilitate, (use your cores or something) with centralised resource management

How to use it Interactive shell spark-shell pyspark Job submission spark-submit (use also for python) Notebooks jupyter zeppelin SWAN (centralised CERN jupyter service) terminal Interfaces -> how to use it? web interface

Example – Pi estimation import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) + a slide with instructions https://en.wikipedia.org/wiki/Monte_Carlo_method

Example – Pi estimation Login to on of the cluster nodes haperf1[01-12] Start spark-shell Copy or retype the content of /afs/cern.ch/user/k/kasurdy/public/pi.spark

Spark concepts and terminology

Transformations, actions (1) Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation. examples: map, filter, union, distinct, join Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation. examples: reduce, collect, count, first, take

Transformations, actions (2) import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) 1 2 3 … n-1 n Map function 1 … diagram Reduce function 4 * count = 157018 / n = 200000

Driver, worker, executor typically on a cluster import scala.math.random val slices = 2 val n = 100000 * slices val rdd = sc.parallelize(1 to n, slices) val sample = rdd.map { i => val x = random val y = random if (x*x + y*y < 1) 1 else 0 } val count = sample.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) Worker Executor Driver Worker Cluster Manager SparkContext Executor Worker Executor

Stages, tasks Task - A unit of work that will be sent to one executor Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points

Your job as a directed acyclic graph (DAG) full name https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/

Monitoring pages If you can, scroll the console output up to get an application ID (e.g. application_1468968864481_0070) and open http://haperf100.cern.ch:8088/proxy/application_XXXXXXXXXXXXX_XXXX/ Otherwise use the link below to find you application on the list http://haperf100.cern.ch:8088/cluster

Monitoring pages

Data APIs