Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San.

Slides:

Advertisements

Similar presentations

Serializability in Multidatabases Ramon Lawrence Dept. of Computer Science

Advertisements

Social network partition Presenter: Xiaofei Cao Partick Berg.

Bo Hong Electrical and Computer Engineering Department Drexel University

1 An Adaptive GA for Multi Objective Flexible Manufacturing Systems A. Younes, H. Ghenniwa, S. Areibi uoguelph.ca.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Felix Halim, Roland H.C. Yap, Yongzheng Wu

This work was supported bu EU projects FP7-ICT NIFTi and FP7-ICT HUMAVIPS and the Czech project 1M0567 CAK July, 2011 EMMCVPR Center.

Lectures on Network Flows

Distributed Algorithms for Secure Multipath Routing

Lock-free Cuckoo Hashing Nhan Nguyen & Philippas Tsigas ICDCS 2014 Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden.

1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.

Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,

A New Approach to the Maximum-Flow Problem Andrew V. Goldberg, Robert E. Tarjan Presented by Andrew Guillory.

1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.

Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.

EDA (CS286.5b) Day 11 Scheduling (List, Force, Approximation) N.B. no class Thursday (FPGA) …

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.

1 I/O Management in Representative Operating Systems.

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

A New Approach to the Maximum-Flow Problem Andrew V. Goldberg, Robert E. Tarjan Presented by Andrew Guillory.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Continuous Retrieval of Replicated Data from Heterogeneous Storage Arrays 9/10/2014 Nihat Altiparmak and Ali Saman Tosun Mascots 2014.

PMIT-6102 Advanced Database Systems

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)

Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

An Efficient Algorithm for Dual-Voltage Design Without Need for Level-Conversion SSST 2012 Mridula Allani Intel Corporation, Austin, TX (Formerly.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Can Cloud Computing be Used for Planning? An Initial Study Authors: Qiang Lu*, You Xu†, Ruoyun Huang†, Yixin Chen† and Guoliang Chen* from *University.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Advanced Computer Architecture & Processing Systems Research Lab Framework for Automatic Design Space Exploration.

Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.

1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),

Chapter 7 May 3 Ford-Fulkerson algorithm Step-by-step walk through of an example Worst-case number of augmentations Edmunds-Karp modification Time complexity.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

Outline Standard 2-way minimum graph cut problem. Applications to problems in computer vision Classical algorithms from the theory literature A new algorithm.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Iterative Improvement for Domain-Specific Problems Lecturer: Jing Liu Homepage:

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University

COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Preflow Push Algorithm M. Amber Hassaan. Preflow Push Algorithm2 Max Flow Problem Given a graph with “Source” and “Sink” nodes we want to compute:  The.

Faster Data Structures in Transactional Memory using Three Paths

Linchuan Chen, Xin Huo and Gagan Agrawal

Unistore: Project Updates

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.

Presentation transcript:

Integrated Maximum Flow Algorithm for Optimal Response Time Retrieval of Replicated Data Nihat Altiparmak, Ali Saman Tosun The University of Texas at San Antonio

Declustering and Parallel I/O 9/11/20122 ICPP 2012 Department of Computer Science, UTSA Disk 0Disk 1Disk 2Disk 3Disk Access

Replication is a common technique used for redundancy and better performance in declustering schemes Retrieval using the first copy requires two accesses We can use the second copy to retrieve in one access Problem: Which copy to use for the best performance? Replication 9/11/20123 ICPP 2012 Department of Computer Science, UTSA Copy 1Copy 2

N disks |Q| buckets Each bucket can be replicated among multiple disks Find a retrieval schedule so that the response time of the query Q is minimized Optimal Response Time Retrieval Problem Definition 9/11/20124 ICPP 2012 Department of Computer Science, UTSA

Basic Retrieval Problem 9/11/2012 ICPP 2012 Department of Computer Science, UTSA s t BucketsDisks Max-flow = |Q| = 6. If not, increment capacities of disk-t edges and call max-flow again. O(|Q|) calls in the worst case. Max-flow solution [Chen’93] [0,0] [0,1] [1,0] [1,1] [2,0] [2,1] 1.Disks are homogeneous 2.No initial load 3.No network delay

Heterogeneous Disks  Disks might have different response times depending on the rotational speed (7.2K, 10K, 15K RPM etc.), interface (SCSI, IDE etc.), and underlying technology (HDD, SSD etc.)  Retrieval from the fastest disk is preferred Multi-site Retrieval and Network Delay  Data might be distributed among multiple storage arrays located on different servers  Retrieval from the server with minimum network delay is preferred. Initial Load  A disk might have an initial load to be retrieved from previous queries  Retrieval from the disk with minimum or possibly no initial load is preferred Generalized Retrieval Problem 9/11/20126 ICPP 2012 Department of Computer Science, UTSA

Generalized Retrieval Problem Generalized retrieval problem can be solved using binary capacity scaling and capacity incrementation techniques proposed in [Altiparmak’12] 9/11/2012 ICPP 2012 Department of Computer Science, UTSA7 15K RPM HDD 15K RPM HDD SSD HYBRID STORAGE ARRAY SSD SSD STORAGE ARRAY 10K RPM HDD 10K RPM HDD 10K RPM HDD 10K RPM HDD HDD STORAGE ARRAY Initial Load Network Delay

Generalized Retrieval Problem 9/11/2012 ICPP 2012 Department of Computer Science, UTSA Site 1Site 2 RUN MAX-FLOW Deciding the retrieval schedule is a time critical issue Max-flow is called multiple times as a block box function with similar capacity values Flow values within consecutive calls cannot be conserved Same flow calculations are performed over and over Can we conserve the flows within multiple runs of max-flow? Integrated maximum flow alg. Can we make it even faster? Parallel int. maximum flow alg. Observation: Limitations: Contributions: Use Capacity Scaling! Use Capacity Incrementation! Fact:

Motivation and Background Ford-Fulkerson Based Solution Push-relabel Based Solution Parallel Push-relabel Solution Evaluation Conclusion Talk Outline 9/11/20129 ICPP 2012 Department of Computer Science, UTSA

Uses augmenting path method Repeatedly sends flow along augmenting paths until no such path remains Ford-Fulkerson based integrated algorithm proposed in [Chen’93] for the basic retrieval problem can easily be modified for the generalized case Ford-Fulkerson Based Solution 9/11/ ICPP 2012 Department of Computer Science, UTSA Basic Retrieval Case [Chen’93]Generalized Retrieval Case

Motivation and Background Ford-Fulkerson Based Solution Push-relabel Based Solution Parallel Push-relabel Solution Evaluation Conclusion Talk Outline 9/11/ ICPP 2012 Department of Computer Science, UTSA

Sends flow along individual edges instead of the entire augmenting path Leads to better performance [Goldberg’88] Most practical implementations are based on push-relabel algorithm Push-relabel Based Solution 9/11/ ICPP 2012 Department of Computer Science, UTSA Push-relabel Algorithm Generalized Retrieval Case Initialization Condition to stop (Flow=|Q|) Initialization

Considers all possible retrieval times starting from the minimum in an exhaustive search manner. Worst case complexity is Adapt the binary capacity scaling technique presented in [Altiparmak’12].  Worst case complexity becomes Performs better in practice thanks to the flow conservation property Push-relabel Based Solution 9/11/ ICPP 2012 Department of Computer Science, UTSA Push-relabel operations are unchanged, integrated algorithm can easily be parallelized!

Motivation and Background Ford-Fulkerson Based Solution Push-relabel Based Solution Parallel Push-relabel Solution Evaluation Conclusion Talk Outline 9/11/ ICPP 2012 Department of Computer Science, UTSA

Most new generation storage arrays are powered with multi-core processors  EMC Symmetrix VMAX has four Quad-core 2.33 GHz Intel Xeon Processors We can reduce the computation time further by using parallel push-relabel implementation Many parallel push-relabel algorithms are proposed  [Goldberg’88], [Anderson’92], [Bader’05], [Hong’11] Most recent implementation in [Hong’11] claims to outperform others. Parallel Push-relabel Solution 9/11/ ICPP 2012 Department of Computer Science, UTSA

Uses the push-relabel technique proposed in [Goldberg’88] Multiple processes/threads do not need any locks or barriers to protect the push and relabel operations Each thread independently determines its own termination without using any locks or barriers Requires atomic read-modify-write instructions  Shared flow and excess values are updated by multiple threads using atomic operations Complexity: We use [Hong’11]’s implementation for our parallel push- relabel based solution Parallel Push-relabel Solution: [Hong’11]’s Implementation 9/11/ ICPP 2012 Department of Computer Science, UTSA

Motivation and Background Ford-Fulkerson Based Solution Push-relabel Based Solution Parallel Push-relabel Solution Evaluation Conclusion Talk Outline 9/11/ ICPP 2012 Department of Computer Science, UTSA

Algorithms are implemented in C++ except the parallel implementation, which uses C with pthreads We used LEDA library for the graph structure and black-box max-flow calculation  LEDA uses Goldberg and Tarjan’s Push-relabel algorithm for max-flow (O(|V| 3 ) complexity) Integrated Push-relabel algorithm is implemented on top of LEDA’s max-flow implementation for fair comparison Algorithms are compiled using gcc/g++ version and compiler optimization levels resulting the fastest execution time Evaluation 9/11/ ICPP 2012 Department of Computer Science, UTSA

Load 1  Distribution of queries are similar to the distribution of the queries in a particular query type (Range, Arbitrary, or Connected )  Expected bucket size is for range queries and for arbitrary queries Load 2  Distribution of queries is uniform. Expected bucket size is Load 3  Smaller queries are more likely.  Expected bucket size is much smaller than the other loads,. Evaluation: Query Loads 9/11/ ICPP 2012 Department of Computer Science, UTSA

Execution Time: Ford-Fulkerson vs. Push-relabel 9/11/ ICPP 2012 Department of Computer Science, UTSA Load 1Load 2 Load 3

Execution Time Ratio: Push- relabel Black-Box/Integrated 9/11/ ICPP 2012 Department of Computer Science, UTSA Load 1Load 2 Load 3

Execution Time Ratio: Push- relabel Sequential/Parallel 9/11/ ICPP 2012 Department of Computer Science, UTSA Load 1Load 2 Load 1

Motivation and Background Ford-Fulkerson Based Solution Push-relabel Based Solution Parallel Push-relabel Solution Evaluation Conclusion Talk Outline 9/11/ ICPP 2012 Department of Computer Science, UTSA

Integrated Push-relabel based algorithm is up to 2.5X faster than the existing black-box counterpart Parallel implementation achieves a maximum speed-up of 1.7X (1.2X on avg.) over the sequential integrated algorithm using two threads  For small queries of load 3 and more than two number of threads, we observed a load-balancing issue Together with the parallel push-relabel implementation, proposed algorithm runs up to 4.25X (3X on avg.) faster than the existing black-box algorithm Conclusion 9/11/ ICPP 2012 Department of Computer Science, UTSA

[Altiparmak’12] Nihat Altiparmak and A. S¸. Tosun. Generalized optimal response time retrieval of replicated data from storage arrays Technical Report. [Anderson’92] Richard J. Anderson and Joao C. Setubal. On the parallel implementation of goldberg’s maximum flow algorithm. In Proceedings of the fourth annual ACM symposium on parallel algorithms and architectures, SPAA’92, pages 168–177, New York, NY, USA, ACM. [Bader,05] David A. Bader and Vipin Sachdeva. A cache-aware parallel implementation of the push-relabel network flow algorithm and experimental evaluation of the gap relabeling heuristic. In ISCA PDCS, pages 41–48, [31] Bo Hong and Zhengyu He. An asynchronous multithreaded algorithm for the maximum network flow problem with nonblocking global relabeling heuristic. IEEE Transactions on Parallel and Distributed Systems, 22(6):1025 –1033, june [Chen’93] L. T. Chen and D. Rotem. Optimal response time retrieval of replicated data. In ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 36–44, [Goldberg’88] Andrew V. Goldberg and Robert E. Tarjan. A new approach to the maximum flow problem. Journal of the ACM, 35:921–940, References 9/11/ ICPP 2012 Department of Computer Science, UTSA

Thank You! Questions? 9/11/ ICPP 2012 Department of Computer Science, UTSA