Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Programming models for data-intensive computing. A multi-dimensional problem Sophistication of the target user – N(data analysts) > N(computational scientists)
7 +/- 2 Maybe Good Ideas John Caron June (1) NetCDF-Java (aka CDM) has lots of functionality, but only available in Java – NcML Aggregation – Access.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
1 Challenges Facing Modeling and Simulation in HPC Environments Panel remarks ECMS Multiconference HPCS 2008 Nicosia Cyprus June Geoffrey Fox Community.
Slide 1 Auburn University Computer Science and Software Engineering Scientific Computing in Computer Science and Software Engineering Kai H. Chang Professor.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce With a SQL-MapReduce focus by Curt A. Monash, Ph.D. President, Monash Research Editor, DBMS2
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Extreme scale parallel and distributed systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
SALSASALSASALSASALSA Cloud Panel Session CloudCom 2009 Beijing Jiaotong University Beijing December Geoffrey Fox
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
1 The Good  HPC brings a wealth of parallelization experience, petaflop scaling and hybrid architectures.  Analytics brings new algorithms and new markets.
Research Overview Gagan Agrawal Associate Professor.
SALSASALSA Large-Scale Data Analysis Applications Computer Vision Complex Networks Bioinformatics Deep Learning Data analysis plays an important role in.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
NCSA Strategic Retreat: System Software Trends Bill Gropp.
Top Advantages of SQL on Hadoop. More people Can Now access Hadoop It seems that SQL on Hadoop has made more egalitarian within the sense that wider groups.
Geoffrey Fox Panel Talk: February
Panel: Beyond Exascale Computing
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Big Data A Quick Review on Analytical Tools
Status and Challenges: January 2017
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Tools and Techniques for Processing (and Management) of Data
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Big Data - in Performance Engineering
Tutorial Overview February 2017
CS110: Discussion about Spark
Yi Wang, Wei Jiang, Gagan Agrawal
Big DATA.
Panel on Research Challenges in Big Data
New (Applications of) Compiler Techniques for Data Grids
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Big Data Vs. (Traditional) HPC  They will clearly co-exist –Fine-grained simulations will prompt more `big-data’ problems –Ability to analyze data will prompt finer-grained simulations –Even instrument data can prompt more simulations  Third and Fourth Pillars of Scientific Research  Critical Need –HPC community must get very engaged in `big-data’ ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Other Thoughts  Onus on HPC Community –Database, Cloud, and Viz communities active for a while now Abstractions like MapReduce are neat! So are Parallel and Streaming Visualization Solutions –Many existing solutions very low on performance Do people realize how slow Hadoop really is? And, yet, one of the most successful open source software? –We are needed! Programming model design and implementation community hasn’t even looked at `big-data’ applications –We must engage application scientists Who are often struck in `I don’t want to deal with the mess’ ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Impact on Leadership Class Systems  Unlike HPC, commercial Sector has a lot of experience in `Big- Data’ –Facebook, Google  They seem to do fine with large fault-tolerant commodity clusters  `Big-Data’ might create a push back from memory / I/O Bound architecture trends –Might make journey to Exascale harder though  `Big-data’ problems should certainly be considered while addressing fault-tolerance and power challenges ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Open Questions  How do we develop parallel data analysis solutions? –Hadoop? –MPI + file I/O calls? –SciDB – array analytics? –Parallel R?  Desiderata –No reloading of data (rules out SciDB and Hadoop) – Performance while implementing new algorithms (rules out parallel R) –Transparency with respect to data layouts and parallel architectures ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Our Ongoing Work: MATE++  A very efficient Map-Reduce-like System for Scientific Data Analytics –MapReduce and another reduction based API –Can plug and play with different data formats –No reloading of data –Flexibly use different forms of parallelism GPUs, Fusion Architecture … ICPP Big Data Panel (09/12/2012)

Gagan Agrawal, Ohio State University Data Management/Reduction Solutions  Must provide Server-side data sub-setting, aggregation and sampling –Without reloading data into a `system’  Our Approach: Light-weight data management solutions –Automatic Data Virtualization –Support virtual (e.g. relational) view over NetCDF, HDF5 etc. –Support sub-setting and aggregation using a high-level language –A new sampling approach based on bit-vector Create lower-resolutions representative datasets Measure loss of information with respect to key statistical measures ICPP Big Data Panel (09/12/2012)