Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented Analysis for Reducing Data Movement 1 Big Data 2013

Outline  Motivation and Idea  Related Work & Potentials  System Design  Evaluation  Conclusion and Future Work 2

Motivation and Idea  Many scientific applications nowadays generate a few terabytes (TB) of data in a single run and the data sizes are expected to reach petabytes (PB) in the near future.  VPIC, Vector Particle in Cell, Plasma physics, 26 bytes per particle, 30TB  Climate applications.  Post-analysis based on subset query generates huge amounts of overlapping I/O.

Motivation and Idea  CDO: Climate Data Operator  200 operators to manipulate the NetCDF dataset Task1: cdo ensmean in1 in2 in3 ofile1 Task2: cdo ensmean in3 in4 in5 ofile2 Task3: cdo ensmean in1 in2 in5 ofile3

Motivation and Idea Analysis QueryResults Analysis Results Reusing Data Movement is Reduced via Reusing Results

Challenges Basic Idea:  Segmented Analysis to reuse query results by detecting overlapping. Challenges:  How to detect the overlapping  How to reuse the results

Related Work and Potential  Database: Materialized View (snapshot) [Source: wiki]  A database object that contains the results of a query. E.g., a local copy of data located remotely or a summary.  MapReduce: Intermediate Results [Source: VLDB’12]  Intermediate results from MapReduce jobs and reuse them for future workflows. No work in HPC Scientific Data Management  FlexQuery: Online Query for Visualization [Georgia Tech]  SDS: Scientific Data Service [LBNL]  FASM: Fast Data Analysis with Statistical Metadata [Texas Tech]

System Design: Overview Task Overlap Detection Cache Aggregation Optimized I/O In-situ Segmentation File Systems Result

System Design: Overlap Detection Overlapping Condition: Computation and I/O Computation Max Mean Histogram Start(15:300:50) Length(30:20:40) Start(15:300:50) Length(30:20:40) Start(1:3:5) Length(10:200:30) Start(1:3:5) Length(10:200:30) Start(80:1000:20) Length(3:5:10) Start(80:1000:20) Length(3:5:10) Task i

System Design: In-situ Segmentation  Low-level Chunking: User specified a fixed size chunk  High-level Segmenting: Dimension-driven flexible segmentation Low- level Chunking Data Sub-Results Results High-level Segmenting Computation I/O

Evaluation 4D NetCDF datasets, 108GBs, 40 OST, 1M stipe size, 640 node. 76 MB per request per process, 100 Tasks sequentially.  1.2X at 10% overlap to 13.5 X at 90% overlap  2X to 8X at overlapping rate from 10% to 90%

Evaluation Cache Data vs Cache results  Segmented analysis achieves least total execution time  Bandwidth close to data cache. (Data movement reduced)

Evaluation  Overhead  Cache file read: 72.1%  5% of total execution time  High-level segmentation  Co-existing performs better  Match with I/O pattern

Conclusion and Future Work  Conclusion  Reuse the query results and perform the Segmented Analysis  In big data analysis, such data centric optimization can potentially reduce the huge amounts of data movements.  The segmented analysis idea have potential in real application, e.g., real-time analysis, interactive system, etc.  Future work  Optimal partial result reusing  Prefetching-like segmented analysis

Segmented Analysis for Reducing Data Movement Thanks Q&A http://discl.cs.ttu.edu

Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Similar presentations

Presentation on theme: "Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Similar presentations

Presentation on theme: "Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented."— Presentation transcript:

Similar presentations

About project

Feedback