Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Slides:

Advertisements

Similar presentations

Making Fly Parviz Deyhim

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

Spark Lightning-Fast Cluster Computing UC BERKELEY.

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.

DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.

Storage in Big Data Systems

© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Data Engineering How MapReduce Works

Matthew Winter and Ned Shawa

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

BIG DATA/ Hadoop Interview Questions.

Ignite in Sberbank: In-Memory Data Fabric for Financial Services

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

PySpark Tutorial - Learn to use Apache Spark with Python

Image taken from: slideshare

Pipe Engineering.

COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT

Big Data is a Big Deal!.

PROTECT | OPTIMIZE | TRANSFORM

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Smart Building Solution

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Machine Learning Library for Apache Ignite

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

Hadoop Tutorials Spark

Spark Presentation.

Smart Building Solution

Data Platform and Analytics Foundational Training

Iterative Computing on Massive Data Sets

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Kishore Pusukuri, Spring 2018

Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo

Server & Tools Business

CS110: Discussion about Spark

Introduction to Apache

Overview of big data tools

Spark and Scala.

Spark and Scala.

Introduction to Spark.

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

Introduction to Azure Data Lake

CS639: Data Management for Data Science

Presentation transcript:

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL

 What is Spark ?  Spark architecture  Core concept : Resilient Distributed Dataset  Spark features  Lsst use case

 Cluster computing platform for data analysis  Wide range of workload in a unified framework  Very speed : in-Memory Caching  API for Scala, Java, and Python  High availability  Best suited for batch applications that apply the same operation to all elements of a dataset.  Core concept : RDD (Matei’s research paper)research paper What is Apache Spark ?

What is spark ? BIG Data Source BIG Data Source Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task block SPARK Destination: Database Shared File System … Destination: Database Shared File System … RDDRDD RDDRDD Python/Scala/Java program

 An RDD in Spark is simply an immutable distributed collection of objects  Users handle external dataset through the RDD object ◦ lines = sc.textFile( ’mydataset.file’)  RDDs offer two types of operations ◦ Transformations: they construct a new RDD from a previous one  pythonLines = lines.filter(lambda line: "Python" in line) ◦ Actions :they compute a result based on an RDD  Lines.count()  MapReduce paradigm can be easily implemented ◦ Line.map(func) ◦ Line.reduce(func)  Join RDD ◦ pairRDD1 : {(1, 2), (3, 4), (3, 6)}) ◦ pairRDD2 : {(3, 9)} ◦ One pair RDD transformation :  RDD1.groupByKey() -> {(1, [2]), (3, [4, 6])} ◦ Two pair RDD transformation :  RDD1.join(RDD2) -> {(3, (4, 9)), (3, (6, 9))} Resilient Distributed Dataset

Spark architecture Driver / application Mesos YARN Spark Cluster Cluster manager Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task Spark driver converts a user program into tasks (execution plan) Create your application : API for Scala, Java, and Python Submit your application to the Spark Driver Spark driver asks resources to the cluster manager When executors are started they register themselves with the driver Executors run tasks and return results to the driver / external source Cache Data Storage : - Local FS - NFS, AFS - Amazon S3 - HDFS

 Spark features ◦ Mllib (Machine Learning lib) ◦ Streaming  Fume, kafka, ElasticSearch ◦ Spark SQL  JDBC, ODBC, Hive, mongoGB ◦ GraphX  Graph manipulating Spark features

 Build a FITS dictionary  CSV file listing all files in /sps/snls12 (GPFS) ◦ 33 million files ◦ 9 million FITS files  FITS metadata extraction ◦ cards : number of Key words contained into the header ◦ Size of the header unit (byte) ◦ Used space for the header unit (byte) ◦ Size of the data unit (byte) ◦ Used space for the data unit (byte) ◦ … LSST use case

Worker 1 Executor Task cach e Master Worker 2 Executor Task cach e Worker 10 Executor Task cach e /sps/snls12 mongoDB … csvcsv csvcsv csvcsv csvcsv csvcsv csvcsv SPARK Driver

Spark & mongoDB SGBDPlateforme Connections / transactions Loading stepData size MariaDB / Galera 3 machines (16 core & 32GB of RAM) ~ 9 millions146 heures400 GB MongoDB1 VM (4 cores & 4 GB of RAM) ~ 9 millions32 heures46 GB

 SPARK is VERY easy to install and to use  The application needs to be thought by integrating the RDD concept  The FITS use case is a good approach to test SPARK but is not representative of the SPARK possibilties Conclusion

 A centralized FITS dictionary ◦ IO Simulator  Image co-addition : the co-addition is CPU intensive and does not consume so much memory, SPARK might be a good candidate for this task.  File monitoring ◦ Moving all files no more accessed on GPFS to HPSS ◦ Using mongodb with sharding Perspectives

 Web sites : ◦ : 4Quant ◦ ◦ ◦  Book : ◦ Learning Spark ◦ Advanced Analytics with Spark: Patterns for Learning from Data at Scale