Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Slides:



Advertisements
Similar presentations
Making Fly Parviz Deyhim
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Storage in Big Data Systems
© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
BIG DATA/ Hadoop Interview Questions.
Ignite in Sberbank: In-Memory Data Fabric for Financial Services
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Pipe Engineering.
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Smart Building Solution
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Machine Learning Library for Apache Ignite
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Smart Building Solution
Data Platform and Analytics Foundational Training
Iterative Computing on Massive Data Sets
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Kishore Pusukuri, Spring 2018
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Server & Tools Business
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Spark and Scala.
Spark and Scala.
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Introduction to Azure Data Lake
CS639: Data Management for Data Science
Presentation transcript:

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL

 What is Spark ?  Spark architecture  Core concept : Resilient Distributed Dataset  Spark features  Lsst use case

 Cluster computing platform for data analysis  Wide range of workload in a unified framework  Very speed : in-Memory Caching  API for Scala, Java, and Python  High availability  Best suited for batch applications that apply the same operation to all elements of a dataset.  Core concept : RDD (Matei’s research paper)research paper What is Apache Spark ?

What is spark ? BIG Data Source BIG Data Source Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task block SPARK Destination: Database Shared File System … Destination: Database Shared File System … RDDRDD RDDRDD Python/Scala/Java program

 An RDD in Spark is simply an immutable distributed collection of objects  Users handle external dataset through the RDD object ◦ lines = sc.textFile( ’mydataset.file’)  RDDs offer two types of operations ◦ Transformations: they construct a new RDD from a previous one  pythonLines = lines.filter(lambda line: "Python" in line) ◦ Actions :they compute a result based on an RDD  Lines.count()  MapReduce paradigm can be easily implemented ◦ Line.map(func) ◦ Line.reduce(func)  Join RDD ◦ pairRDD1 : {(1, 2), (3, 4), (3, 6)}) ◦ pairRDD2 : {(3, 9)} ◦ One pair RDD transformation :  RDD1.groupByKey() -> {(1, [2]), (3, [4, 6])} ◦ Two pair RDD transformation :  RDD1.join(RDD2) -> {(3, (4, 9)), (3, (6, 9))} Resilient Distributed Dataset

Spark architecture Driver / application Mesos YARN Spark Cluster Cluster manager Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task Spark driver converts a user program into tasks (execution plan) Create your application : API for Scala, Java, and Python Submit your application to the Spark Driver Spark driver asks resources to the cluster manager When executors are started they register themselves with the driver Executors run tasks and return results to the driver / external source Cache Data Storage : - Local FS - NFS, AFS - Amazon S3 - HDFS

 Spark features ◦ Mllib (Machine Learning lib) ◦ Streaming  Fume, kafka, ElasticSearch ◦ Spark SQL  JDBC, ODBC, Hive, mongoGB ◦ GraphX  Graph manipulating Spark features

 Build a FITS dictionary  CSV file listing all files in /sps/snls12 (GPFS) ◦ 33 million files ◦ 9 million FITS files  FITS metadata extraction ◦ cards : number of Key words contained into the header ◦ Size of the header unit (byte) ◦ Used space for the header unit (byte) ◦ Size of the data unit (byte) ◦ Used space for the data unit (byte) ◦ … LSST use case

Worker 1 Executor Task cach e Master Worker 2 Executor Task cach e Worker 10 Executor Task cach e /sps/snls12 mongoDB … csvcsv csvcsv csvcsv csvcsv csvcsv csvcsv SPARK Driver

Spark & mongoDB SGBDPlateforme Connections / transactions Loading stepData size MariaDB / Galera 3 machines (16 core & 32GB of RAM) ~ 9 millions146 heures400 GB MongoDB1 VM (4 cores & 4 GB of RAM) ~ 9 millions32 heures46 GB

 SPARK is VERY easy to install and to use  The application needs to be thought by integrating the RDD concept  The FITS use case is a good approach to test SPARK but is not representative of the SPARK possibilties Conclusion

 A centralized FITS dictionary ◦ IO Simulator  Image co-addition : the co-addition is CPU intensive and does not consume so much memory, SPARK might be a good candidate for this task.  File monitoring ◦ Moving all files no more accessed on GPFS to HPSS ◦ Using mongodb with sharding Perspectives

 Web sites : ◦ : 4Quant ◦ ◦ ◦  Book : ◦ Learning Spark ◦ Advanced Analytics with Spark: Patterns for Learning from Data at Scale