National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Part of the AIST Framework.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
RAMADDA for Big Climate Data Don Murray NOAA/ESRL/PSD and CU-CIRES Boulder/Denver Big Data Meetup - June 18, 2014.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Metadata-Centric Discovery.
Hot With an Increasing Chance of Clouds... Data Management Research Forecast: Hot With an Increasing Chance of Clouds... Michael Carey Information Systems.
--What is a Database--1 What is a database What is a Database.
The International Surface Pressure Databank (ISPD) and Twentieth Century Reanalysis at NCAR Thomas Cram - NCAR, Boulder, CO Gilbert Compo & Chesley McColl.
CLIMATE SCIENTISTS’ BIG CHALLENGE: REPRODUCIBILITY USING BIG DATA Kyo Lee, Chris Mattmann, and RCMES team Jet Propulsion Laboratory (JPL), Caltech.
Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
8/9/2015 Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, dtsouma, Computing Systems Laboratory.
DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
C. Mattmann 1, C. Goodale 1, J. Kim 2, D.E. Waliser 1,2, D. Crichton 1, A. Hart 1, P. Zimdars 1 and Peter Lean* The International Workshop on CORDEX-East.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Understanding System_T By Mao Xianling
Web Services for Earth Science Data Edward Armstrong, Thomas Huang, Charles Thompson, Nga Quach, Richard Kim, Zhangfan Xing Winter ESIP 2014 Washington.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.
A NoSQL Database - Hive Dania Abed Rabbou.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California EDGE: The Multi-Metadata.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Regional Climate Model Evaluation System based on satellite and other observations for application to CMIP/AR downscaling Peter Lean 1, Jinwon Kim 1,3,
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Ed Armstrong – PI Luca Cinquini Chris Mattmann NASA Jet Propulsion Laboratory Frank O’Brien Zach Siegrist System Science Applications, Inc. 18 July 2012.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Michael J. Carey Information Systems Group UCI CS Department Big Data 2.0.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
The HDF Group January 8, ESIP Winter Meeting Data Container Study: HDF5 in a POSIX File System or HDF5 C 3 : Compression, Chunking,
IPCC WG II + III Requirements for AR5 Data Management GO-ESSP Meeting, Paris, Michael Lautenschlager, Hans Luthardt World Data Center Climate.
Data Discovery and Access to The International Surface Pressure Databank (ISPD) 1 Thomas Cram Gilbert P. Compo* Doug Schuster Chesley McColl* Steven Worley.
Figure 3. Overview of system architecture for RCMES. A Regional Climate Model Evaluation System based on Satellite and other Observations Peter Lean 1.
Physical Oceanography Distributed Active Archive Center THUANG June 9-13, 20089th GHRSST-PP Science Team Meeting GHRSST GDAC and EOSDIS PO.DAAC.
1. Gridded Data Sub-setting Services through the RDA at NCAR Doug Schuster, Steve Worley, Bob Dattore, Dave Stepaniak.
Sea Surface Temperature Distribution from the Physical Oceanography DAAC Ed Armstrong JPL PO.DAAC MODIS Science Team Meeting.
Presented by: Omar Alqahtani Fall 2016
CS 405G: Introduction to Database Systems
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
CS122A: Introduction to Data Management Lecture #16: AsterixDB
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
In-situ Visualization using VisIt
Pig Latin - A Not-So-Foreign Language for Data Processing
Dynamic Indexing in SpatialHadoop
Big Data - in Performance Engineering
Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.
Overview of big data tools
Storage Structure and Efficient File Access
CS246: Search-Engine Scale
Scaling Bathymetry: Data handling for large volumes
BIOPAMA Data Management
Presentation transcript:

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Part of the AIST Framework for Comparing Data Containers Study Thomas Huang Ed Armstrong, Namrata Malarout, Chris Mattmann Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive Pasadena, CA United States of America 2016 ESIP Winter Meeting

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California What is AsterixDB? Big Data Management System (BDMS) Semi-structured NoSQL style data model, Asterix Data Model (ADM) Extended JSON with object database support (JSON++) Expressive and declarative query language, Asterix Query Language (AQL) Parallel runtime query execution engine, Hyracks. Currently supports up to cores and 500+ disks Partitioned LSM-based data storage and indexing Queries data stored in HDFS as well as data stored in native AsterixDB Rich type support (spatial, temporal, …) Records, Lists, Bags Open v.s Closed types Secondary indexing options: B+ trees, R trees, and inverted keyword index types Transactional THUANG/JPL2016 ESIP Winter Meeting2 Semi-structured Data Management Parallel Database Systems World of Hadoop & Friends

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California AsterixDB System Overview THUANG/JPL2016 ESIP Winter Meeting3

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Hyracks: The Parallel Runtime Execution Engine Partitioned-parallel platform for data-intensive computing Job = dataflow DAG of operators and connectors – Operators consume and produce partitions of data – Connectors route (repartition) data between operators Hyracks vs. the “competition” – Based on time-tested parallel database principles – vs. Hadoop: More flexible model and less “pessimistic” – (vs. Dryad: Supports data as a first-class citizen) – Faster job activation, data pipelining, binary format, state- of-the-art DB style operators (hash-based, indexed,...) Tested at Yahoo! Labs on 180 nodes (1440 cores, 720 disks) THUANG/JPL2016 ESIP Winter Meeting4 Asterix Software Stack

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Preprocessing Earth Science Application Dynamic Subsetting: Extract the specified spatial-temporal extend of a given variable Statistics aggregation using selected oceanographic data Dataset GHRSST L4 (1991 – present) CMC 0.2deg Global Foundation Sea Surface Template Analysis Temporal resolution: daily Spatial resolution: 0.2 degrees (Latitude) x 0.2 degrees (Longitude) Spatial and temporal resolution being use in this study: no change, same as raw data Subsetting details: raw data size is 1800 x 901 grid, subset: each record contains 50 x 50 grid subset Final size of the subset being used: 2.43 GB (for 4 months of data) A single NetCDF file of the sample data is about 2.3 MB. After running the script the ingestion file produced is 19.7 MB. Similarly, when the script was run on 13 files, the ingestion file created is MB ingestion file. The increase in size of the record is almost 9 times. The size increase is due to going from compressed netCDF4 to uncompressed ASCII JSON This is consistent with the If a single file produces 19.7 MB then we can expect a huge difference in size between one year of data and the ingestion file will be very large. THUANG/JPL2016 ESIP Winter Meeting5

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Workflow 1.Use wget to download data set from PO.DAAC ftp://podaac-ftp.jpl.nasa.gov/ [path to dataset]. ftp://podaac-ftp.jpl.nasa.gov/ [path to dataset] 2.Developed form_json.py python script uses ncdump-json to dump the metadata and data associated with every variable in JSON format. 3.Then the form_adm.py script makes changes to the JSON output to conform to the concepts and syntax of the Asterix Data Model(ADM). The output of this script is a ‘.adm’ file. 4.The chunk.py script reads every record from the produced adm file and divides them into 50x50 size chunks with the associated spatial information. 5.Using the AsterixDB Web API, we create a schema for the dataset and load the created ingestion file. THUANG/JPL2016 ESIP Winter Meeting6

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Observations: Handling Large Datasets AsterixDB is not able to handle the large amount of data in the current version available (0.8.6) and the snapshot of the upcoming version (0.8.7). Ingestion of only the metadata and data of some variables work fine. The variable which are ingested are: Latitude Longitude time But for the other important variables, namely analysed_sst, mask, analysis_error and sea_ice_fraction the size of the array isn’t handled. The main limitations the AsterixDB team have come across lately in this area have been from the object model (e.g. 65k limit on string size) or from the storage layer (objects cannot be bigger than half a page). THUANG/JPL2016 ESIP Winter Meeting7

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Metrics Preprocessing For 4 months of data (year 1991: data available from day 243 to 365) Time taken for chunking: ~ s Time taken for ncdump: s Time taken for ‘.adm’ file construction: s The increase in size of the record is almost 9 times. Data Ingestion Wall Clock Time to convert NetCDF to.adm file: ~ s Disk space required for 4 months: raw data = 235 MB vs AsterixDB friendly format = 2.43 GB Disk space for entire dataset ( ) = 17 GB THUANG/JPL2016 ESIP Winter Meeting8

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Observation: Data Loading AsterixDB is not able to handle the size of an entire file during ingestion. Problem solved by chunking the data. Loading multiple ADM files to populate the dataset. The solution is to setup a data feed adapter. The queries for aggregation currently throw errors as they aren’t working with ordered lists and collection of objects. We are working with the AsterixDB Dev team to figure out workarounds until the bugs are resolved. THUANG/JPL2016 ESIP Winter Meeting9

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Current Activities Integration into the NEXUS architecture to compare performance between AsterixDB backend vs. the current Cassandra backend Constructing AQL queries to Find average of data in individual chunks Subset data based on input of search region THUANG/JPL2016 ESIP Winter Meeting10 NEXUS: Deep Data Platform Credit: T. Huang, et.al 2015

National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California THANKS Questions, and more information 2016 ESIP Winter MeetingTHUANG/JPL11