Download presentation
Presentation is loading. Please wait.
Published byAubrey Horn Modified over 8 years ago
1
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Part of the AIST Framework for Comparing Data Containers Study Thomas Huang Ed Armstrong, Namrata Malarout, Chris Mattmann Jet Propulsion Laboratory California Institute of Technology 4800 Oak Grove Drive Pasadena, CA 91109-8099 United States of America 2016 ESIP Winter Meeting
2
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California What is AsterixDB? Big Data Management System (BDMS) Semi-structured NoSQL style data model, Asterix Data Model (ADM) Extended JSON with object database support (JSON++) Expressive and declarative query language, Asterix Query Language (AQL) Parallel runtime query execution engine, Hyracks. Currently supports up to 1000+ cores and 500+ disks Partitioned LSM-based data storage and indexing Queries data stored in HDFS as well as data stored in native AsterixDB Rich type support (spatial, temporal, …) Records, Lists, Bags Open v.s Closed types Secondary indexing options: B+ trees, R trees, and inverted keyword index types Transactional THUANG/JPL2016 ESIP Winter Meeting2 Semi-structured Data Management Parallel Database Systems World of Hadoop & Friends
3
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California AsterixDB System Overview THUANG/JPL2016 ESIP Winter Meeting3
4
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Hyracks: The Parallel Runtime Execution Engine Partitioned-parallel platform for data-intensive computing Job = dataflow DAG of operators and connectors – Operators consume and produce partitions of data – Connectors route (repartition) data between operators Hyracks vs. the “competition” – Based on time-tested parallel database principles – vs. Hadoop: More flexible model and less “pessimistic” – (vs. Dryad: Supports data as a first-class citizen) – Faster job activation, data pipelining, binary format, state- of-the-art DB style operators (hash-based, indexed,...) Tested at Yahoo! Labs on 180 nodes (1440 cores, 720 disks) THUANG/JPL2016 ESIP Winter Meeting4 Asterix Software Stack
5
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Preprocessing Earth Science Application Dynamic Subsetting: Extract the specified spatial-temporal extend of a given variable Statistics aggregation using selected oceanographic data Dataset GHRSST L4 (1991 – present) CMC 0.2deg Global Foundation Sea Surface Template Analysis Temporal resolution: daily Spatial resolution: 0.2 degrees (Latitude) x 0.2 degrees (Longitude) Spatial and temporal resolution being use in this study: no change, same as raw data Subsetting details: raw data size is 1800 x 901 grid, subset: each record contains 50 x 50 grid subset Final size of the subset being used: 2.43 GB (for 4 months of data) A single NetCDF file of the sample data is about 2.3 MB. After running the script the ingestion file produced is 19.7 MB. Similarly, when the script was run on 13 files, the ingestion file created is 256.2 MB ingestion file. The increase in size of the record is almost 9 times. The size increase is due to going from compressed netCDF4 to uncompressed ASCII JSON This is consistent with the If a single file produces 19.7 MB then we can expect a huge difference in size between one year of data and the ingestion file will be very large. THUANG/JPL2016 ESIP Winter Meeting5
6
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Workflow 1.Use wget to download data set from PO.DAAC ftp://podaac-ftp.jpl.nasa.gov/ [path to dataset]. ftp://podaac-ftp.jpl.nasa.gov/ [path to dataset] 2.Developed form_json.py python script uses ncdump-json to dump the metadata and data associated with every variable in JSON format. 3.Then the form_adm.py script makes changes to the JSON output to conform to the concepts and syntax of the Asterix Data Model(ADM). The output of this script is a ‘.adm’ file. 4.The chunk.py script reads every record from the produced adm file and divides them into 50x50 size chunks with the associated spatial information. 5.Using the AsterixDB Web API, we create a schema for the dataset and load the created ingestion file. THUANG/JPL2016 ESIP Winter Meeting6
7
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Observations: Handling Large Datasets AsterixDB is not able to handle the large amount of data in the current version available (0.8.6) and the snapshot of the upcoming version (0.8.7). Ingestion of only the metadata and data of some variables work fine. The variable which are ingested are: Latitude Longitude time But for the other important variables, namely analysed_sst, mask, analysis_error and sea_ice_fraction the size of the array isn’t handled. The main limitations the AsterixDB team have come across lately in this area have been from the object model (e.g. 65k limit on string size) or from the storage layer (objects cannot be bigger than half a page). THUANG/JPL2016 ESIP Winter Meeting7
8
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Metrics Preprocessing For 4 months of data (year 1991: data available from day 243 to 365) Time taken for chunking: ~ 4855.7628s Time taken for ncdump: 353.6958599s Time taken for ‘.adm’ file construction: 2331.6372s The increase in size of the record is almost 9 times. Data Ingestion Wall Clock Time to convert NetCDF to.adm file: ~ 7541.0958599s Disk space required for 4 months: raw data = 235 MB vs AsterixDB friendly format = 2.43 GB Disk space for entire dataset (1991 - 2015) = 17 GB THUANG/JPL2016 ESIP Winter Meeting8
9
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Observation: Data Loading AsterixDB is not able to handle the size of an entire file during ingestion. Problem solved by chunking the data. Loading multiple ADM files to populate the dataset. The solution is to setup a data feed adapter. The queries for aggregation currently throw errors as they aren’t working with ordered lists and collection of objects. We are working with the AsterixDB Dev team to figure out workarounds until the bugs are resolved. THUANG/JPL2016 ESIP Winter Meeting9
10
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California Current Activities Integration into the NEXUS architecture to compare performance between AsterixDB backend vs. the current Cassandra backend Constructing AQL queries to Find average of data in individual chunks Subset data based on input of search region THUANG/JPL2016 ESIP Winter Meeting10 NEXUS: Deep Data Platform Credit: T. Huang, et.al 2015
11
National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Pasadena, California THANKS Questions, and more information Thomas.Huang@jpl.nasa.gov 2016 ESIP Winter MeetingTHUANG/JPL11
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.