Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Slides:



Advertisements
Similar presentations
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
Advertisements

Three Perspectives & Two Problems Shivnath Babu Duke University.
A Local-Optimization based Strategy for Cost-Effective Datasets Storage of Scientific Applications in the Cloud Many slides from authors’ presentation.
Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Spark: Cluster Computing with Working Sets
UNCLASSIFIED: LA-UR Data Infrastructure for Massive Scientific Visualization and Analysis James Ahrens & Christopher Mitchell Los Alamos National.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
CS An Overlay Routing Scheme For Moving Large Files Su Zhang Kai Xu.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
1 / 23 Presenter: Dong Dai, DISCL Lab. TTU Data-Intensive Scalable Computing Laboratory Department of Computer Science Accelerating Scientific.
Model-driven Data Layout Selection for Improving Read Performance Jialin Liu 1, Bin Dong 2, Surendra Byna 2, Kesheng Wu 2, Yong Chen 1 Texas Tech University.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.
Using Pattern-Models to Guide SSD Deployment for Big Data in HPC Systems Junjie Chen 1, Philip C. Roth 2, Yong Chen 1 1 Data-Intensive Scalable Computing.
Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)
Presented by: Omar Alqahtani Fall 2016
Pathology Spatial Analysis February 2017
MapReduce Types, Formats and Features
Locality-driven High-level I/O Aggregation
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
CSCE 990: Advanced Distributed Systems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
湖南大学-信息科学与工程学院-计算机与科学系
Ch 4. The Evolution of Analytic Scalability
Data-Intensive Computing: From Clouds to GPU Clusters
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
Laura Bright David Maier Portland State University
Resource Allocation for Distributed Streaming Applications
A General Approach to Real-time Workflow Monitoring
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Contention-Aware Resource Scheduling for Burst Buffer Systems
Map Reduce, Types, Formats and Features
Presentation transcript:

Jialin Liu, Surendra Byna, Yong Chen Oct Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented Analysis for Reducing Data Movement 1 Big Data 2013

Outline  Motivation and Idea  Related Work & Potentials  System Design  Evaluation  Conclusion and Future Work 2

Motivation and Idea  Many scientific applications nowadays generate a few terabytes (TB) of data in a single run and the data sizes are expected to reach petabytes (PB) in the near future.  VPIC, Vector Particle in Cell, Plasma physics, 26 bytes per particle, 30TB  Climate applications.  Post-analysis based on subset query generates huge amounts of overlapping I/O.

Motivation and Idea  CDO: Climate Data Operator  200 operators to manipulate the NetCDF dataset Task1: cdo ensmean in1 in2 in3 ofile1 Task2: cdo ensmean in3 in4 in5 ofile2 Task3: cdo ensmean in1 in2 in5 ofile3

Motivation and Idea Analysis QueryResults Analysis Results Reusing Data Movement is Reduced via Reusing Results

Challenges Basic Idea:  Segmented Analysis to reuse query results by detecting overlapping. Challenges:  How to detect the overlapping  How to reuse the results

Related Work and Potential  Database: Materialized View (snapshot) [Source: wiki]  A database object that contains the results of a query. E.g., a local copy of data located remotely or a summary.  MapReduce: Intermediate Results [Source: VLDB’12]  Intermediate results from MapReduce jobs and reuse them for future workflows. No work in HPC Scientific Data Management  FlexQuery: Online Query for Visualization [Georgia Tech]  SDS: Scientific Data Service [LBNL]  FASM: Fast Data Analysis with Statistical Metadata [Texas Tech]

System Design: Overview Task Overlap Detection Cache Aggregation Optimized I/O In-situ Segmentation File Systems Result

System Design: Overlap Detection Overlapping Condition: Computation and I/O Computation Max Mean Histogram Start(15:300:50) Length(30:20:40) Start(15:300:50) Length(30:20:40) Start(1:3:5) Length(10:200:30) Start(1:3:5) Length(10:200:30) Start(80:1000:20) Length(3:5:10) Start(80:1000:20) Length(3:5:10) Task i

System Design: In-situ Segmentation  Low-level Chunking: User specified a fixed size chunk  High-level Segmenting: Dimension-driven flexible segmentation Low- level Chunking Data Sub-Results Results High-level Segmenting Computation I/O

Evaluation 4D NetCDF datasets, 108GBs, 40 OST, 1M stipe size, 640 node. 76 MB per request per process, 100 Tasks sequentially.  1.2X at 10% overlap to 13.5 X at 90% overlap  2X to 8X at overlapping rate from 10% to 90%

Evaluation Cache Data vs Cache results  Segmented analysis achieves least total execution time  Bandwidth close to data cache. (Data movement reduced)

Evaluation  Overhead  Cache file read: 72.1%  5% of total execution time  High-level segmentation  Co-existing performs better  Match with I/O pattern

Conclusion and Future Work  Conclusion  Reuse the query results and perform the Segmented Analysis  In big data analysis, such data centric optimization can potentially reduce the huge amounts of data movements.  The segmented analysis idea have potential in real application, e.g., real-time analysis, interactive system, etc.  Future work  Optimal partial result reusing  Prefetching-like segmented analysis

Segmented Analysis for Reducing Data Movement Thanks Q&A