Download presentation
Presentation is loading. Please wait.
Published byJack Lyons Modified over 9 years ago
1
1 Sangmi Pallickara Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi Pallickara Computer Science Department Colorado State University sangmi@cs.colostate.edu
2
2 Sangmi Pallickara Big Data in Geosciences Volume Velocity Variety 04/18/2013 2
3
3 Sangmi Pallickara Storage must be over a collection of machines Avoid central coordinators Cope with failures Preserve data locality without introducing storage imbalances And the accompanying query hotspots Support range queries and fast ingest of new data 04/18/2013 3
4
4 Sangmi Pallickara Galileo Design Considerations Symmetric storage nodes No special-function or “controller” nodes Storage and retrievals may go to any node, and will be forwarded to the targeted node(s) Incremental scale-up Failure-resiliency Accounts for geospatial component in data 04/18/2013 4
5
5 Sangmi Pallickara Galileo key features Support for large numbers (10 9 ) of small files High throughput storage and retrieval Data is multidimensional with multiple types Time-series data Support for exact match and range queries (with wildcards) along multiple dimensions Support for multiple data formats netCDF, BUFR, HDF 4/5, and data from the Defense Meteorological Satellite Program 04/18/2013 5
6
6 Sangmi Pallickara Planned/Ongoing deployments for Galileo International Centre for Radio Astronomy Research Australian SKA Pathfinder telescope ~ 1 PB of time-series data CSU Atmospheric Sciences & Precision Wind (Boulder) Short-term wind forecast predictions CSU Civil & Environmental Engineering department Sustainable management of watershed systems Climate.org 04/18/2013 6
7
7 Sangmi Pallickara Related work Google File system BigTable Distributed Hash Table (DHT) based Systems Pastry, Chord, Dynamo, and CAN SciDB MongoDB 04/18/2013 7
8
8 Sangmi Pallickara Dataset used in performance evaluations Sourced from NOAA NAM Project Dimensions/Features: Geospatial: Latitude, Longitude Time Series: Start Time, End Time Temperature Relative Humidity Wind Speed Snow Depth Composed of 1 billion files (8 TB) 04/18/2013 8
9
9 Sangmi Pallickara Storage Throughput Block is about 8 KB of data 56,000 blocks per second in a system with 48- nodes 04/18/2013 9
10
10 Sangmi Pallickara Query Performance Query First Result (ms) Last Result (ms) Dataset Size Dataset Creation (ms) Download Time (ms) No Match 42.0947.0500.01N/A One Match 42.9650.3910.0150.47 Standard Query 44.155.571,4110.02241.45 Temporal Range 47.54588.898,5350.299,142.36 Spatial Range (US) 48.07261.8131,4130.051,845.67 Spatial Range (CO) 43.0857.731,6430.01252.03 Spatial Range (NE CO) 42.8157.233980.0162.13 Exhaustive Feature Search 53.9764,069.308,230,6123.66459,297.52 04/18/2013 10
11
11 Sangmi Pallickara Thank you! Galileo http://galileo.cs.colostate.edu Sangmi Pallickara sangmi@cs.colostate.edu sangmi@cs.colostate.edu http://www.cs.colostate.edu/~sangmi 04/18/2013 11
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.