Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL

 What is Spark ?  Spark architecture  Core concept : Resilient Distributed Dataset  Spark features  Lsst use case

 Cluster computing platform for data analysis  Wide range of workload in a unified framework  Very speed : in-Memory Caching  API for Scala, Java, and Python  High availability  Best suited for batch applications that apply the same operation to all elements of a dataset.  Core concept : RDD (Matei’s research paper)research paper What is Apache Spark ?

What is spark ? BIG Data Source BIG Data Source Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task block SPARK Destination: Database Shared File System … Destination: Database Shared File System … RDDRDD RDDRDD Python/Scala/Java program

 An RDD in Spark is simply an immutable distributed collection of objects  Users handle external dataset through the RDD object ◦ lines = sc.textFile( ’mydataset.file’)  RDDs offer two types of operations ◦ Transformations: they construct a new RDD from a previous one  pythonLines = lines.filter(lambda line: "Python" in line) ◦ Actions :they compute a result based on an RDD  Lines.count()  MapReduce paradigm can be easily implemented ◦ Line.map(func) ◦ Line.reduce(func)  Join RDD ◦ pairRDD1 : {(1, 2), (3, 4), (3, 6)}) ◦ pairRDD2 : {(3, 9)} ◦ One pair RDD transformation :  RDD1.groupByKey() -> {(1, [2]), (3, [4, 6])} ◦ Two pair RDD transformation :  RDD1.join(RDD2) -> {(3, (4, 9)), (3, (6, 9))} Resilient Distributed Dataset

Spark architecture Driver / application Mesos YARN Spark Cluster Cluster manager Worker 1 Executor Task Worker 2 Executor Task Worker … Executor Task Worker N Executor Task 1 2 3 4 5 6 Spark driver converts a user program into tasks (execution plan) Create your application : API for Scala, Java, and Python Submit your application to the Spark Driver Spark driver asks resources to the cluster manager When executors are started they register themselves with the driver Executors run tasks and return results to the driver / external source Cache Data Storage : - Local FS - NFS, AFS - Amazon S3 - HDFS

 Spark features ◦ Mllib (Machine Learning lib) ◦ Streaming  Fume, kafka, ElasticSearch ◦ Spark SQL  JDBC, ODBC, Hive, mongoGB ◦ GraphX  Graph manipulating Spark features

 Build a FITS dictionary  CSV file listing all files in /sps/snls12 (GPFS) ◦ 33 million files ◦ 9 million FITS files  FITS metadata extraction ◦ cards : number of Key words contained into the header ◦ Size of the header unit (byte) ◦ Used space for the header unit (byte) ◦ Size of the data unit (byte) ◦ Used space for the data unit (byte) ◦ … LSST use case

Worker 1 Executor Task cach e Master Worker 2 Executor Task cach e Worker 10 Executor Task cach e /sps/snls12 mongoDB … csvcsv csvcsv csvcsv csvcsv csvcsv csvcsv SPARK Driver

Spark & mongoDB SGBDPlateforme Connections / transactions Loading stepData size MariaDB / Galera 3 machines (16 core & 32GB of RAM) ~ 9 millions146 heures400 GB MongoDB1 VM (4 cores & 4 GB of RAM) ~ 9 millions32 heures46 GB

 SPARK is VERY easy to install and to use  The application needs to be thought by integrating the RDD concept  The FITS use case is a good approach to test SPARK but is not representative of the SPARK possibilties Conclusion

 A centralized FITS dictionary ◦ IO Simulator  Image co-addition : the co-addition is CPU intensive and does not consume so much memory, SPARK might be a good candidate for this task.  File monitoring ◦ Moving all files no more accessed on GPFS to HPSS ◦ Using mongodb with sharding Perspectives

 Web sites : ◦ https://spark-summit.org : 4Quant https://spark-summit.org ◦ http://thunder-project.org http://thunder-project.org ◦ https://spark.apache.org https://spark.apache.org ◦ https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf  Book : ◦ Learning Spark ◦ Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Similar presentations

Presentation on theme: "Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Similar presentations

Presentation on theme: "Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL."— Presentation transcript:

Similar presentations

About project

Feedback