Download presentation
Presentation is loading. Please wait.
Published byJunior Chandler Modified over 9 years ago
1
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
2
Gagan Agrawal, Ohio State University Big Data Vs. (Traditional) HPC They will clearly co-exist –Fine-grained simulations will prompt more `big-data’ problems –Ability to analyze data will prompt finer-grained simulations –Even instrument data can prompt more simulations Third and Fourth Pillars of Scientific Research Critical Need –HPC community must get very engaged in `big-data’ ICPP Big Data Panel (09/12/2012)
3
Gagan Agrawal, Ohio State University Other Thoughts Onus on HPC Community –Database, Cloud, and Viz communities active for a while now Abstractions like MapReduce are neat! So are Parallel and Streaming Visualization Solutions –Many existing solutions very low on performance Do people realize how slow Hadoop really is? And, yet, one of the most successful open source software? –We are needed! Programming model design and implementation community hasn’t even looked at `big-data’ applications –We must engage application scientists Who are often struck in `I don’t want to deal with the mess’ ICPP Big Data Panel (09/12/2012)
4
Gagan Agrawal, Ohio State University Impact on Leadership Class Systems Unlike HPC, commercial Sector has a lot of experience in `Big- Data’ –Facebook, Google They seem to do fine with large fault-tolerant commodity clusters `Big-Data’ might create a push back from memory / I/O Bound architecture trends –Might make journey to Exascale harder though `Big-data’ problems should certainly be considered while addressing fault-tolerance and power challenges ICPP Big Data Panel (09/12/2012)
5
Gagan Agrawal, Ohio State University Open Questions How do we develop parallel data analysis solutions? –Hadoop? –MPI + file I/O calls? –SciDB – array analytics? –Parallel R? Desiderata –No reloading of data (rules out SciDB and Hadoop) – Performance while implementing new algorithms (rules out parallel R) –Transparency with respect to data layouts and parallel architectures ICPP Big Data Panel (09/12/2012)
6
Gagan Agrawal, Ohio State University Our Ongoing Work: MATE++ A very efficient Map-Reduce-like System for Scientific Data Analytics –MapReduce and another reduction based API –Can plug and play with different data formats –No reloading of data –Flexibly use different forms of parallelism GPUs, Fusion Architecture … ICPP Big Data Panel (09/12/2012)
7
Gagan Agrawal, Ohio State University Data Management/Reduction Solutions Must provide Server-side data sub-setting, aggregation and sampling –Without reloading data into a `system’ Our Approach: Light-weight data management solutions –Automatic Data Virtualization –Support virtual (e.g. relational) view over NetCDF, HDF5 etc. –Support sub-setting and aggregation using a high-level language –A new sampling approach based on bit-vector Create lower-resolutions representative datasets Measure loss of information with respect to key statistical measures ICPP Big Data Panel (09/12/2012)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.