Download presentation
Presentation is loading. Please wait.
Published byBarbra Willis Modified over 8 years ago
1
Jialin Liu, Surendra Byna, Yong Chen Oct. 08. 2013 Data-Intensive Scalable Computing Laboratory (DISCL) Lawrence Berkeley National Lab (LBNL) Segmented Analysis for Reducing Data Movement 1 Big Data 2013
2
Outline Motivation and Idea Related Work & Potentials System Design Evaluation Conclusion and Future Work 2
3
Motivation and Idea Many scientific applications nowadays generate a few terabytes (TB) of data in a single run and the data sizes are expected to reach petabytes (PB) in the near future. VPIC, Vector Particle in Cell, Plasma physics, 26 bytes per particle, 30TB Climate applications. Post-analysis based on subset query generates huge amounts of overlapping I/O.
4
Motivation and Idea CDO: Climate Data Operator 200 operators to manipulate the NetCDF dataset Task1: cdo ensmean in1 in2 in3 ofile1 Task2: cdo ensmean in3 in4 in5 ofile2 Task3: cdo ensmean in1 in2 in5 ofile3
5
Motivation and Idea Analysis QueryResults Analysis Results Reusing Data Movement is Reduced via Reusing Results
6
Challenges Basic Idea: Segmented Analysis to reuse query results by detecting overlapping. Challenges: How to detect the overlapping How to reuse the results
7
Related Work and Potential Database: Materialized View (snapshot) [Source: wiki] A database object that contains the results of a query. E.g., a local copy of data located remotely or a summary. MapReduce: Intermediate Results [Source: VLDB’12] Intermediate results from MapReduce jobs and reuse them for future workflows. No work in HPC Scientific Data Management FlexQuery: Online Query for Visualization [Georgia Tech] SDS: Scientific Data Service [LBNL] FASM: Fast Data Analysis with Statistical Metadata [Texas Tech]
8
System Design: Overview Task Overlap Detection Cache Aggregation Optimized I/O In-situ Segmentation File Systems Result
9
System Design: Overlap Detection Overlapping Condition: Computation and I/O Computation Max Mean Histogram Start(15:300:50) Length(30:20:40) Start(15:300:50) Length(30:20:40) Start(1:3:5) Length(10:200:30) Start(1:3:5) Length(10:200:30) Start(80:1000:20) Length(3:5:10) Start(80:1000:20) Length(3:5:10) Task i
10
System Design: In-situ Segmentation Low-level Chunking: User specified a fixed size chunk High-level Segmenting: Dimension-driven flexible segmentation Low- level Chunking Data Sub-Results Results High-level Segmenting Computation I/O
11
Evaluation 4D NetCDF datasets, 108GBs, 40 OST, 1M stipe size, 640 node. 76 MB per request per process, 100 Tasks sequentially. 1.2X at 10% overlap to 13.5 X at 90% overlap 2X to 8X at overlapping rate from 10% to 90%
12
Evaluation Cache Data vs Cache results Segmented analysis achieves least total execution time Bandwidth close to data cache. (Data movement reduced)
13
Evaluation Overhead Cache file read: 72.1% 5% of total execution time High-level segmentation Co-existing performs better Match with I/O pattern
14
Conclusion and Future Work Conclusion Reuse the query results and perform the Segmented Analysis In big data analysis, such data centric optimization can potentially reduce the huge amounts of data movements. The segmented analysis idea have potential in real application, e.g., real-time analysis, interactive system, etc. Future work Optimal partial result reusing Prefetching-like segmented analysis
15
Segmented Analysis for Reducing Data Movement Thanks Q&A http://discl.cs.ttu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.