Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Jialin Liu, Bradly Crysler, Yin Lu, Yong Chen Oct. 15. Seminar Data-Intensive Scalable Computing Laboratory (DISCL) Locality-driven High-level.
CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.
File Consistency in a Parallel Environment Kenin Coloma
Seunghwa Kang David A. Bader Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
OpenFOAM on a GPU-based Heterogeneous Cluster
Reference: Message Passing Fundamentals.
I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.
Optimization and evaluation of parallel I/O in BIPS3D parallel irregular application Performance Modelling, Evaluation, and optimization of Parallel and.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Memory Management 2010.
Topic Overview One-to-All Broadcast and All-to-One Reduction
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Memory Management ◦ Operating Systems ◦ CS550. Paging and Segmentation  Non-contiguous memory allocation  Fragmentation is a serious problem with contiguous.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Alexander Afanasyev Tutors: Seung-Hoon Lee, Uichin Lee Content Distribution in VANETs using Network Coding: Evaluation of the Generation Selection Algorithms.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Lecture 3 Secondary Storage and System Software I
Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.
Parallel Algorithm Design
CSCE 990: Advanced Distributed Systems
Spatial Online Sampling and Aggregation
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
CENG 351 Data Management and File Structures
Parallel Programming in C with MPI and OpenMP
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Data Locality Aware Strategy for Two-Phase Collective I/O. Rosa Filgueira, David E.Singh, Juan C. Pichel, Florin Isaila, and Jesús Carretero. Universidad Carlos III de Madrid (Spain).

Sumary Problem description. Main objectives. Locality Aware strategy for Two Phase I/O:  Linear Assignment Problem.  LA-Two-Phase I/O (LATP). Evaluation. Results Conclusions.

1. Problem description(I) Parallel scientific application generate lots of data Access pattern:  Individual process read/write non-contiguously.  Collective access: contiguous. Collective I/O: aggregates individual small requests into larger ones  Disk-directed I/O (aggregation close to disk).  Two-phase I/O (aggregation at compute nodes): our optimization target.

1. Problem description (II) Two-Phase I/O phases: Shuffle: aggregate data into contiguous buffers. I/O: transfer contiguous buffer to file system. Before these two phases:  File region is divided into equal contiguous regions called File Domains (FD).  Each FD is assigned to a subset of compute nodes (aggregators). Each aggregator is responsible for transferring all data from its FD to the file system. Cause of inefficiency: The assignment of FD to aggregators is independent of data distribution.

1. Problem description (III) Fd0Fd1Fd2Fd3 1º2º 3º Vector P is written to a file in parallel by 4 processes.

2. Main Objectives Replacing the rigid assignment of FDs by an assignment dependent of the initial data distribution. Our assignment increases the I/O efficiency and reduces: The number of communication operations. The volume of communication. The total execution time.

3. Locality aware strategy of Two Phase I/O. This work presents Locality-Aware Two-Phase (LATP) I/O.  LATP employs the Linear Assignment Problem (LAP) for finding an optimal assignment of FD to processes during the I/O stage.

3.1 Linear Assigment Problem (I) LAP computes the optimal assignment of m items to n elements given an m x n cost matrix. Several algorithms have been developed for LAP:  Hungarian algorithm.  Jonker and Volgenant algorithm.  APC and APS Algorithms. All algorithms produce the same assignment. The difference is the time to compute the optimal allocation.

3.1 Linear Assigment Problem (II)

3.2 LA-Two-Phase I/O Original communication Interval/Process LATP communication

4. Evaluation (I) Platform  Magerit Cluster (CESVIMA),1200eServer BladeCenter nodes.  Node  2 processor IBM 64 bits, 64 GB RAM adn 40 GB HD.  Interconnection  Myrinet.  MPICH version  MPICHGM NOGM.  File system  PVFS with 1 metadata server and 8 I/O (64KB striping factor).

Application  BISP3D:  Semiconductor devices simulator based on finite element methods.  Problem input: an unstructured mesh The mesh is divided into several sub-domains (METIS library). Each sub-domain is assigned to one process. Each process makes calculations on assigned data. The results are written to a file. 4. Evaluation (II)

4. Evaluation (III) Performed evaluations:  Different meshes.  Different load. The file size (in MB) of each file based on the mesh and load Mesh 4Mesh 3Mesh 2Mesh 1 Load

4. Evaluation (IV) Two-Phase I/O stages:  File offsets and lengths calculation (st1).  File offsets and lengths communication (st2).  Interval assignment (st3).  File domain calculation (st4).  Access request calculation (st5).  Metadata transfer (st6).  Buffer writting (st7).  File writting (st8). Mesh1 with load 100 and 16 processes Mesh1 with load 100 and 64 processes

5. Results Percentage of improvement in stages 6 and 7. Reduction of transfered data volume. Overall Improvement.

5.1 Improvement in stages 6 and 7. In St6 each process: -calculates what request of other processes lie in its FD. -creates a list of offsets and lengths for each process. -sends the lists to the rest process In St7 each process: -sends the data calculated in St6 stage. LATP: - reduces the time of st6 and st7 in most cases. - increases the locality (maximizes data stored in local FD): -Sends less data to the other processes -Reduces volume and number of communication operations. Mesh1

5.1 Reduction of communications When LATP is applied, the transferred data volume is reduced.

5.3 Overall Improvement Mesh1Mesh2 Mesh3 Mesh4

6. Conclusions LATP is an optimization of two-phase collective I/O. Uses Linear Assignment problem for maximizing the locality. Improves overall performance. The new stage (st3) has insignificant overhead. Scales well: the greater the number of processes, the larger improvement.