Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Sangmi Pallickara Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi Pallickara Computer Science Department Colorado.

Similar presentations


Presentation on theme: "1 Sangmi Pallickara Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi Pallickara Computer Science Department Colorado."— Presentation transcript:

1 1 Sangmi Pallickara Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi Pallickara Computer Science Department Colorado State University sangmi@cs.colostate.edu

2 2 Sangmi Pallickara Big Data in Geosciences  Volume  Velocity  Variety 04/18/2013 2

3 3 Sangmi Pallickara Storage must be over a collection of machines  Avoid central coordinators  Cope with failures  Preserve data locality without introducing storage imbalances  And the accompanying query hotspots  Support range queries and fast ingest of new data 04/18/2013 3

4 4 Sangmi Pallickara Galileo Design Considerations  Symmetric storage nodes  No special-function or “controller” nodes  Storage and retrievals may go to any node, and will be forwarded to the targeted node(s)  Incremental scale-up  Failure-resiliency  Accounts for geospatial component in data 04/18/2013 4

5 5 Sangmi Pallickara Galileo key features  Support for large numbers (10 9 ) of small files  High throughput storage and retrieval  Data is multidimensional with multiple types  Time-series data  Support for exact match and range queries (with wildcards) along multiple dimensions  Support for multiple data formats  netCDF, BUFR, HDF 4/5, and data from the Defense Meteorological Satellite Program 04/18/2013 5

6 6 Sangmi Pallickara Planned/Ongoing deployments for Galileo  International Centre for Radio Astronomy Research  Australian SKA Pathfinder telescope  ~ 1 PB of time-series data  CSU Atmospheric Sciences & Precision Wind (Boulder)  Short-term wind forecast predictions  CSU Civil & Environmental Engineering department  Sustainable management of watershed systems  Climate.org 04/18/2013 6

7 7 Sangmi Pallickara Related work  Google File system  BigTable  Distributed Hash Table (DHT) based Systems  Pastry, Chord, Dynamo, and CAN  SciDB  MongoDB 04/18/2013 7

8 8 Sangmi Pallickara Dataset used in performance evaluations  Sourced from NOAA NAM Project  Dimensions/Features:  Geospatial: Latitude, Longitude  Time Series: Start Time, End Time  Temperature  Relative Humidity  Wind Speed  Snow Depth  Composed of 1 billion files (8 TB) 04/18/2013 8

9 9 Sangmi Pallickara Storage Throughput  Block is about 8 KB of data  56,000 blocks per second in a system with 48- nodes 04/18/2013 9

10 10 Sangmi Pallickara Query Performance Query First Result (ms) Last Result (ms) Dataset Size Dataset Creation (ms) Download Time (ms) No Match 42.0947.0500.01N/A One Match 42.9650.3910.0150.47 Standard Query 44.155.571,4110.02241.45 Temporal Range 47.54588.898,5350.299,142.36 Spatial Range (US) 48.07261.8131,4130.051,845.67 Spatial Range (CO) 43.0857.731,6430.01252.03 Spatial Range (NE CO) 42.8157.233980.0162.13 Exhaustive Feature Search 53.9764,069.308,230,6123.66459,297.52 04/18/2013 10

11 11 Sangmi Pallickara Thank you!  Galileo  http://galileo.cs.colostate.edu  Sangmi Pallickara  sangmi@cs.colostate.edu sangmi@cs.colostate.edu  http://www.cs.colostate.edu/~sangmi 04/18/2013 11


Download ppt "1 Sangmi Pallickara Working with Big Data in the Geosciences - Finding the Needle in the Haystack Sangmi Pallickara Computer Science Department Colorado."

Similar presentations


Ads by Google