Download presentation
Presentation is loading. Please wait.
Published byMeryl Webster Modified over 8 years ago
1
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL
2
What is Spark ? Spark architecture Core concept : Resilient Distributed Dataset Spark features Lsst use case
3
Cluster computing platform for data analysis Wide range of workload in a unified framework Very speed : in-Memory Caching API for Scala, Java, and Python High availability Best suited for batch applications that apply the same operation to all elements of a dataset. Core concept : RDD (Matei’s research paper)research paper What is Apache Spark ?
4
What is spark ? BIG Data Source BIG Data Source Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task block SPARK Destination: Database Shared File System … Destination: Database Shared File System … RDDRDD RDDRDD Python/Scala/Java program
5
An RDD in Spark is simply an immutable distributed collection of objects Users handle external dataset through the RDD object ◦ lines = sc.textFile( ’mydataset.file’) RDDs offer two types of operations ◦ Transformations: they construct a new RDD from a previous one pythonLines = lines.filter(lambda line: "Python" in line) ◦ Actions :they compute a result based on an RDD Lines.count() MapReduce paradigm can be easily implemented ◦ Line.map(func) ◦ Line.reduce(func) Join RDD ◦ pairRDD1 : {(1, 2), (3, 4), (3, 6)}) ◦ pairRDD2 : {(3, 9)} ◦ One pair RDD transformation : RDD1.groupByKey() -> {(1, [2]), (3, [4, 6])} ◦ Two pair RDD transformation : RDD1.join(RDD2) -> {(3, (4, 9)), (3, (6, 9))} Resilient Distributed Dataset
6
Spark architecture Driver / application Mesos YARN Spark Cluster Cluster manager Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task 1 2 3 4 5 6 Spark driver converts a user program into tasks (execution plan) Create your application : API for Scala, Java, and Python Submit your application to the Spark Driver Spark driver asks resources to the cluster manager When executors are started they register themselves with the driver Executors run tasks and return results to the driver / external source Cache Data Storage : - Local FS - NFS, AFS - Amazon S3 - HDFS
7
Spark features ◦ Mllib (Machine Learning lib) ◦ Streaming Fume, kafka, ElasticSearch ◦ Spark SQL JDBC, ODBC, Hive, mongoGB ◦ GraphX Graph manipulating Spark features
8
Build a FITS dictionary CSV file listing all files in /sps/snls12 (GPFS) ◦ 33 million files ◦ 9 million FITS files FITS metadata extraction ◦ cards : number of Key words contained into the header ◦ Size of the header unit (byte) ◦ Used space for the header unit (byte) ◦ Size of the data unit (byte) ◦ Used space for the data unit (byte) ◦ … LSST use case
9
Worker 1 Executor Task cach e Master Worker 2 Executor Task cach e Worker 10 Executor Task cach e /sps/snls12 mongoDB … csvcsv csvcsv csvcsv csvcsv csvcsv csvcsv SPARK Driver
10
Spark & mongoDB SGBDPlateforme Connections / transactions Loading stepData size MariaDB / Galera 3 machines (16 core & 32GB of RAM) ~ 9 millions146 heures400 GB MongoDB1 VM (4 cores & 4 GB of RAM) ~ 9 millions32 heures46 GB
11
SPARK is VERY easy to install and to use The application needs to be thought by integrating the RDD concept The FITS use case is a good approach to test SPARK but is not representative of the SPARK possibilties Conclusion
12
A centralized FITS dictionary ◦ IO Simulator Image co-addition : the co-addition is CPU intensive and does not consume so much memory, SPARK might be a good candidate for this task. File monitoring ◦ Moving all files no more accessed on GPFS to HPSS ◦ Using mongodb with sharding Perspectives
13
Web sites : ◦ https://spark-summit.org : 4Quant https://spark-summit.org ◦ http://thunder-project.org http://thunder-project.org ◦ https://spark.apache.org https://spark.apache.org ◦ https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Book : ◦ Learning Spark ◦ Advanced Analytics with Spark: Patterns for Learning from Data at Scale
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.