Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL
What is Spark ? Spark architecture Core concept : Resilient Distributed Dataset Spark features Lsst use case
Cluster computing platform for data analysis Wide range of workload in a unified framework Very speed : in-Memory Caching API for Scala, Java, and Python High availability Best suited for batch applications that apply the same operation to all elements of a dataset. Core concept : RDD (Matei’s research paper)research paper What is Apache Spark ?
What is spark ? BIG Data Source BIG Data Source Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task block SPARK Destination: Database Shared File System … Destination: Database Shared File System … RDDRDD RDDRDD Python/Scala/Java program
An RDD in Spark is simply an immutable distributed collection of objects Users handle external dataset through the RDD object ◦ lines = sc.textFile( ’mydataset.file’) RDDs offer two types of operations ◦ Transformations: they construct a new RDD from a previous one pythonLines = lines.filter(lambda line: "Python" in line) ◦ Actions :they compute a result based on an RDD Lines.count() MapReduce paradigm can be easily implemented ◦ Line.map(func) ◦ Line.reduce(func) Join RDD ◦ pairRDD1 : {(1, 2), (3, 4), (3, 6)}) ◦ pairRDD2 : {(3, 9)} ◦ One pair RDD transformation : RDD1.groupByKey() -> {(1, [2]), (3, [4, 6])} ◦ Two pair RDD transformation : RDD1.join(RDD2) -> {(3, (4, 9)), (3, (6, 9))} Resilient Distributed Dataset
Spark architecture Driver / application Mesos YARN Spark Cluster Cluster manager Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task Spark driver converts a user program into tasks (execution plan) Create your application : API for Scala, Java, and Python Submit your application to the Spark Driver Spark driver asks resources to the cluster manager When executors are started they register themselves with the driver Executors run tasks and return results to the driver / external source Cache Data Storage : - Local FS - NFS, AFS - Amazon S3 - HDFS
Spark features ◦ Mllib (Machine Learning lib) ◦ Streaming Fume, kafka, ElasticSearch ◦ Spark SQL JDBC, ODBC, Hive, mongoGB ◦ GraphX Graph manipulating Spark features
Build a FITS dictionary CSV file listing all files in /sps/snls12 (GPFS) ◦ 33 million files ◦ 9 million FITS files FITS metadata extraction ◦ cards : number of Key words contained into the header ◦ Size of the header unit (byte) ◦ Used space for the header unit (byte) ◦ Size of the data unit (byte) ◦ Used space for the data unit (byte) ◦ … LSST use case
Worker 1 Executor Task cach e Master Worker 2 Executor Task cach e Worker 10 Executor Task cach e /sps/snls12 mongoDB … csvcsv csvcsv csvcsv csvcsv csvcsv csvcsv SPARK Driver
Spark & mongoDB SGBDPlateforme Connections / transactions Loading stepData size MariaDB / Galera 3 machines (16 core & 32GB of RAM) ~ 9 millions146 heures400 GB MongoDB1 VM (4 cores & 4 GB of RAM) ~ 9 millions32 heures46 GB
SPARK is VERY easy to install and to use The application needs to be thought by integrating the RDD concept The FITS use case is a good approach to test SPARK but is not representative of the SPARK possibilties Conclusion
A centralized FITS dictionary ◦ IO Simulator Image co-addition : the co-addition is CPU intensive and does not consume so much memory, SPARK might be a good candidate for this task. File monitoring ◦ Moving all files no more accessed on GPFS to HPSS ◦ Using mongodb with sharding Perspectives
Web sites : ◦ : 4Quant ◦ ◦ ◦ Book : ◦ Learning Spark ◦ Advanced Analytics with Spark: Patterns for Learning from Data at Scale