DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.

Slides:



Advertisements
Similar presentations
Shark:SQL and Rich Analytics at Scale
Advertisements

Three Perspectives & Two Problems Shivnath Babu Duke University.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
DM_PPT_NP_v01 SESIP_0715_GH1 America runs on Excel and HDF5 Gerd Heber The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under.
Storage in Big Data Systems
DisCo: Distributed Co-clustering with Map-Reduce S. Papadimitriou, J. Sun IBM T.J. Watson Research Center Speaker: 吳宏君 陳威遠 洪浩哲.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
DM_PPT_NP_v01 SESIP_0715_GH3 His Expert’s Voice Gerd Heber The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
DM_PPT_NP_v01 SESIP_0715_JR HDF Server HDF for the Web John Readey The HDF Group Champaign Illinois USA.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
The HDF Group January 8, ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Scaling up R computation with high performance computing resources.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
PySpark Tutorial - Learn to use Apache Spark with Python
Presented by: Omar Alqahtani Fall 2016
Big Data is a Big Deal!.
Data Are from Mars, Tools Are from Venus
PROTECT | OPTIMIZE | TRANSFORM
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
Scaling Spark on HPC Systems
Hadoop Tutorials Spark
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
Data Platform and Analytics Foundational Training
Pyspark 최 현 영 컴퓨터학부.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Selectivity Estimation of Big Spatial Data
Introduction to Spark.
The Basics of Apache Hadoop
CS110: Discussion about Spark
Hadoop for SQL Server Pros
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Spark and Scala.
Hierarchical Data Format (HDF) Status Update
Introduction to Spark.
CS639: Data Management for Data Science
Apache Hadoop and Spark
Big-Data Analytics with Azure HDInsight
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
MapReduce: Simplified Data Processing on Large Clusters
CS639: Data Management for Data Science
Presentation transcript:

DM_PPT_NP_v01 SESIP_0715_GH2 Putting some into HDF5 Gerd Heber & Joe Lee The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

DM_PPT_NP_v01 SESIP_0715_GH2 2 The Return of

DM_PPT_NP_v01 SESIP_0715_GH2 Outline “The Big Schism” A Shiny New Engine Getting off the Ground Future Work 3 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 “The Big Schism” An HDF5 file is a Smart Data Container “This is what happens, Larry, when you copy an HDF5 file into HDFS!” (Walter Sobchak) 4 July 14 – 17, 2015 Natural Habitat: Traditional File SystemBlock Store: Hadoop “File System” (HDFS) Don’t mess with HDF5!

DM_PPT_NP_v01 SESIP_0715_GH2 Now What? Ask questions: –Who want’s HDF5 files in Hadoop? (volatile) Who wants to program MapReduce? (nobody) –How big are your HDF5 files? (long tailed distrib.) No size (solution) fits all... Do experiments: –Reverse-engineer the format (students, weirdos) –In-core processing (fiddly) –Convert to Avro (some success) Sit tight and wait for something better! 5 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 Spark Concepts Formally, an RDD is a read-only, partitioned collection of records. RDDs can be only created through deterministic operations on either (1) a dataset in stable storage or (2) other existing RDDs. 6 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 What’s Great about Spark Refreshingly abstract Supports Python Typically runs in RAM Has batteries included 7 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 Experimental Setup GSSTF_NCEP.3 collection 7/1/1987 to 12/31/2008 7,850 HDF-EOS5 files, 16 MB per file, ~120 GB total 4 variables on daily 1440x720 grid –Sea level pressure (hPa) –2m air temperature (C) –Sea surface skin temperature (C) –Sea surface saturation humidity (g/kg) Lenovo ThinkPad X230T –Intel Core i5-3320M (2 cores, 4 threads), 8GB of RAM, Samsung SSD 840 Pro –Windows 8.1 (64-bit), Apache Spark July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 Getting off the Ground 9 July 14 – 17, 2015 Where do they dwell?

DM_PPT_NP_v01 SESIP_0715_GH2 General Strategy 1.Create our first RDD – “list of file names/paths/...” a.Traverse base directory, compile list of HDF5 files b.Partition the list via SparkContext. parallelize() 2. Use the RDD’s flatMap method to calculate something interesting, e.g., summary statistics 10 July 14 – 17, 2015 RDD Calculating Tair_2m mean and median for 3.5 years took about 10 seconds on my notebook.

DM_PPT_NP_v01 SESIP_0715_GH2 Variations Instead of traversing directories, you can provide a CSV file of [HDF5 file names, path names, hyperslab selections, etc.] to partition A fast SSD array goes a long way If you have a distributed file system (e.g., GPFS, Lustre, Ceph), you should be able to feed large numbers of Spark workers (running on a cluster) If you don’t have a parallel file system and use most of the data in a file, you can stage (copy) the files first on the cluster nodes 11 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 Conclusion Forget MapReduce, stop worrying about HDFS With Spark, exploiting data parallelism has never been more accessible (easier and cheaper) Current HDF5 to Spark on-ramps can be effective under the right circumstances, but are kludgy Work with us to build the right things right! 12 July 14 – 17, 2015

DM_PPT_NP_v01 SESIP_0715_GH2 References 13 July 14 – 17, 2015 [BigHDF] [Blog] eos/ [Report]Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, UCBerkeley [Spark] [YouTube]Mark Madsen: Big Data, Bad Analogies, 2014.

DM_PPT_NP_v01 SESIP_0715_GH2 14

DM_PPT_NP_v01 SESIP_0715_GH2 15 This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C