Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented.

Similar presentations


Presentation on theme: "Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented."— Presentation transcript:

1 Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented Analysis for Reducing Data Movement 1 Big Data 2013

2 Outline  Motivation and Idea  Related Work & Potentials  System Design  Evaluation  Conclusion and Future Work 2

3 Motivation and Idea  Many scientific applications nowadays generate a few terabytes (TB) of data in a single run and the data sizes are expected to reach petabytes (PB) in the near future.  VPIC, Vector Particle in Cell, Plasma physics, 26 bytes per particle, 30TB  Climate applications.  Post-analysis based on subset query generates huge amounts of overlapping I/O.

4 Motivation and Idea  CDO: Climate Data Operator  200 operators to manipulate the NetCDF dataset Task1: cdo ensmean in1 in2 in3 ofile1 Task2: cdo ensmean in3 in4 in5 ofile2 Task3: cdo ensmean in1 in2 in5 ofile3

5 Motivation and Idea Analysis QueryResults Analysis Results Reusing Data Movement is Reduced via Reusing Results

6 Challenges Basic Idea:  Segmented Analysis to reuse query results by detecting overlapping. Challenges:  How to detect the overlapping  How to reuse the results

7 Related Work and Potential  Database: Materialized View (snapshot) [Source: wiki]  A database object that contains the results of a query. E.g., a local copy of data located remotely or a summary.  MapReduce: Intermediate Results [Source: VLDB’12]  Intermediate results from MapReduce jobs and reuse them for future workflows. No work in HPC Scientific Data Management  FlexQuery: Online Query for Visualization [Georgia Tech]  SDS: Scientific Data Service [LBNL]  FASM: Fast Data Analysis with Statistical Metadata [Texas Tech]

8 System Design: Overview Task Overlap Detection Cache Aggregation Optimized I/O In-situ Segmentation File Systems Result

9 System Design: Overlap Detection Overlapping Condition: Computation and I/O Computation Max Mean Histogram Start(15:300:50) Length(30:20:40) Start(15:300:50) Length(30:20:40) Start(1:3:5) Length(10:200:30) Start(1:3:5) Length(10:200:30) Start(80:1000:20) Length(3:5:10) Start(80:1000:20) Length(3:5:10) Task i

10 System Design: In-situ Segmentation  Low-level Chunking: User specified a fixed size chunk  High-level Segmenting: Dimension-driven flexible segmentation Low- level Chunking Data Sub-Results Results High-level Segmenting Computation I/O

11 Evaluation 4D NetCDF datasets, 108GBs, 40 OST, 1M stipe size, 640 node. 76 MB per request per process, 100 Tasks sequentially.  1.2X at 10% overlap to 13.5 X at 90% overlap  2X to 8X at overlapping rate from 10% to 90%

12 Evaluation Cache Data vs Cache results  Segmented analysis achieves least total execution time  Bandwidth close to data cache. (Data movement reduced)

13 Evaluation  Overhead  Cache file read: 72.1%  5% of total execution time  High-level segmentation  Co-existing performs better  Match with I/O pattern

14 Conclusion and Future Work  Conclusion  Reuse the query results and perform the Segmented Analysis  In big data analysis, such data centric optimization can potentially reduce the huge amounts of data movements.  The segmented analysis idea have potential in real application, e.g., real-time analysis, interactive system, etc.  Future work  Optimal partial result reusing  Prefetching-like segmented analysis

15 Segmented Analysis for Reducing Data Movement Thanks Q&A http://discl.cs.ttu.edu


Download ppt "Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented."

Similar presentations


Ads by Google