End-to-end Data-flow Parallelism for Throughput Optimization in High-speed Networks Esma Yildirim Data Intensive Distributed Computing Laboratory University.

Slides:

Advertisements

Similar presentations

Outline LP formulation of minimal cost flow problem

Advertisements

Hadi Goudarzi and Massoud Pedram

VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Transportation Problem (TP) and Assignment Problem (AP)

24-1 Chapter 24. Congestion Control and Quality of Service (part 1) 23.1 Data Traffic 23.2 Congestion 23.3 Congestion Control 23.4 Two Examples.

Confidentiality/date line: 13pt Arial Regular, white Maximum length: 1 line Information separated by vertical strokes, with two spaces on either side Disclaimer.

AT LOUISIANA STATE UNIVERSITY CCT: Center for Computation & LSU Stork Data Scheduler: Current Status and Future Directions Sivakumar Kulasekaran.

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Restricted Slow-Start for TCP William Allcock 1,2, Sanjay Hegde 3 and Rajkumar Kettimuthu 1,2 1 Argonne National Laboratory 2 The University of Chicago.

Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.

AQM for Congestion Control1 A Study of Active Queue Management for Congestion Control Victor Firoiu Marty Borden.

*Sponsored in part by the DARPA IT-MANET Program, NSF OCE Opportunistic Scheduling with Reliability Guarantees in Cognitive Radio Networks Rahul.

1 Emulating AQM from End Hosts Presenters: Syed Zaidi Ivor Rodrigues.

Random Early Detection Gateways for Congestion Avoidance

Max-flow/min-cut theorem Theorem: For each network with one source and one sink, the maximum flow from the source to the destination is equal to the minimal.

ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 17.

Experiences in Design and Implementation of a High Performance Transport Protocol Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data.

by B. Zadrozny and C. Elkan

These materials are licensed under the Creative Commons Attribution-Noncommercial 3.0 Unported license (

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Hamdy Fadl1 بسم الله الرحمن الرحيم Stork data scheduler: mitigating the data bottleneck in e-Science Hamdy Fadl Dr. Esma YILDIRIM.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.

NETE4631:Capacity Planning (2)- Lecture 10 Suronapee Phoomvuthisarn, Ph.D. /

Maximization of Network Survivability against Intelligent and Malicious Attacks (Cont’d) Presented by Erion Lin.

Example: Sorting on Distributed Computing Environment Apr 20,

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

High-speed TCP  FAST TCP: motivation, architecture, algorithms, performance (by Cheng Jin, David X. Wei and Steven H. Low)  Modifying TCP's Congestion.

2000 년 11 월 20 일 전북대학교 분산처리실험실 TCP Flow Control (nagle’s algorithm) 오 남 호 분산 처리 실험실

1 IEEE Meeting July 19, 2006 Raj Jain Modeling of BCN V2.0 Jinjing Jiang and Raj Jain Washington University in Saint Louis Saint Louis, MO

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Copyright 2008 Kenneth M. Chipps Ph.D. Controlling Flow Last Update

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.

1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.

Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.

TCP: Transmission Control Protocol Part II : Protocol Mechanisms Computer Network System Sirak Kaewjamnong Semester 1st, 2004.

Real-Time Performance Analysis of Adaptive Link Rate Baoke Zhang, Karthikeyan Sabhanatarajan, Ann Gordon-Ross*, Alan D. George* This work was supported.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Static Process Scheduling

1 An Arc-Path Model for OSPF Weight Setting Problem Dr.Jeffery Kennington Anusha Madhavan.

Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

SERENA: SchEduling RoutEr Nodes Activity in wireless ad hoc and sensor networks Pascale Minet and Saoucene Mahfoudh INRIA, Rocquencourt Le Chesnay.

Self-Organized Resource Allocation in LTE Systems with Weighted Proportional Fairness I-Hong Hou and Chung Shue Chen.

An Analysis of AIMD Algorithm with Decreasing Increases Yunhong Gu, Xinwei Hong, and Robert L. Grossman National Center for Data Mining.

Access Link Capacity Monitoring with TFRC Probe Ling-Jyh Chen, Tony Sun, Dan Xu, M. Y. Sanadidi, Mario Gerla Computer Science Department, University of.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Basic Concepts Maximum CPU utilization obtained with multiprogramming

OPERATING SYSTEMS CS 3502 Fall 2017

Speaker : Che-Wei Chang

Grid Computing.

Fast, Scalable, and Flexible Data Sharing Made Easy

Chapter 6: CPU Scheduling

FAST TCP : From Theory to Experiments

Chapter 6: CPU Scheduling

Javad Ghaderi, Tianxiong Ji and R. Srikant

Resource Allocation in a Middleware for Streaming Data

Chapter 6: CPU Scheduling

Approximate Mean Value Analysis of a Database Grid Application

Presentation transcript:

End-to-end Data-flow Parallelism for Throughput Optimization in High-speed Networks Esma Yildirim Data Intensive Distributed Computing Laboratory University at Buffalo (SUNY) Condor Week 2011

Motivation  Data grows larger hence the need for speed to transfer it  Technology develops with the introduction of high-speed networks and complex computer architectures which are not fully utilized yet  Still many questions are out in the uncertainty I can not receive the speed I am supposed to get from the network I have a 10G high-speed network and supercomputers connecting. Why do I still get under 1G throughput? I can’t wait for a new protocol to replace the current ones, why can’t I get high throughput with what I have at hand? OK, may be I am asking too much but I want to get optimal settings to achieve maximal throughput I want to get high throughput without congesting the traffic too much. How can I do it in the application level? 2

Introduction  Users of data-intensive applications need intelligent services and schedulers that will provide models and strategies to optimize their data transfer jobs  Goals:  Maximize throughput  Minimize model overhead  Do not cause contention among users  Use minimum number of end-system resources 3

Introduction  Current optical technology supports 100 G transport hence, the utilization of network brings a challenge to the middleware to provide faster data transfer speeds  Achieving multiple Gbps throughput have become a burden over TCP-based networks  Parallel streams can solve the problem of network utilization inefficiency of TCP  Finding the optimal number of streams is a challenging task  With faster networks end-systems have become the major source of bottleneck  CPU, NIC and Disk Bottleneck  We provide models to decide on the optimal number of parallelism and CPU/disk stripes 4

Outline  Stork Overview  End-system Bottlenecks  End-to-end Data-flow Parallelism  Optimization Algorithm  Conclusions and Future Work 5

Stork Data Scheduler  Implements state-of-the art models and algorithms for data scheduling and optimization  Started as part of the Condor project as PhD thesis of Dr. Tevfik Kosar  Currently developed at University at Buffalo and funded by NSF  Heavily uses some Condor libraries such as ClassAds and DaemonCore 6

Stork Data Scheduler (cont.)  Stork v.2.0 is available with enhanced features   Supports more than 20 platforms (mostly Linux flavors)  Windows and Azure Cloud support planned soon  The most recent enhancement:  Throughput Estimation and Optimization Service 7

End-to-end Data Transfer  Method to improve the end-to-end data transfer throughput  Application-level Data Flow Parallelism  Network level parallelism (parallel streams)  Disk/CPU level parallelism (stripes) 8

Network Bottleneck Step1: Effect of Parallel Streams on Disk-to-disk Transfers  Parallel streams can improve the data throughput but only to a certain extent  Disk speed presents a major limitation.  Parallel streams may have an adverse effect if the disk speed upper limit is already reached 9

Disk Bottleneck Step2: Effect of Parallel Streams on Memory-to-memory Transfers and CPU Utilization  Once disk bottleneck is eliminated, parallel streams improve the throughput dramatically  Throughput either becomes stable or falls down after reaching its peak due to network or end-system limitations. Ex:The network interface card limit(10G) could not be reached (e.g.7.5Gbps-internode) 10

CPU Bottleneck Step3: Effect of Striping and Removal of CPU Bottleneck  Striped transfers improves the throughput dramatically  Network card limit is reached for inter-node transfers(9Gbps) 11

Prediction of Optimal Parallel Stream Number  Throughput formulation : Newton’s Iteration Model  a’, b’ and c’ are three unknowns to be solved hence 3 throughput measurements of different parallelism level (n) are needed  Sampling strategy:  Exponentially increasing parallelism levels  Choose points not close to each other  Select points that are power of 2: 1, 2, 4, 8, …, 2 k  Stop when the throughput starts to decrease or increase very slowly comparing to the previous level  Selection of 3 data points  From the available sampling points  For every 3-point combination, calculate the predicted throughput curve  Find the distance between the actual and predicted throughput curve  Choose the combination with the minimum distance 12

Flow Model of End-to-end Throughput  CPU nodes are considered as nodes of a maximum flow problem  Memory-to-memory transfers are simulated with dummy source and sink nodes  The capacities of disk and network is found by applying parallel stream model by taking into consideration of resource capacities (NIC & CPU) 13

Flow Model of End-to-end Throughput  Convert the end-system and network capacities into a flow problem  Goal: Provide maximal possible data transfer throughput given real-time traffic (maximize(Th))  Number of streams per stripe (N si )  Number of stripes per node (S x )  Number of nodes (N n ) 14  Assumptions  Parameters not given and found by the model:  Available network capacity (U network )  Available disk system capacity (U disk )  Parameters given  CPU capacity (100% assuming they are idle at the beginning of the transfer) (U CPU )  NIC capacity (U NIC )  Number of available nodes (N avail )

Flow Model of End-to-end Throughput  Variables:  Uij = Total capacity of each arc from node i to node j  Uf= Maximal (optimal) capacity of each flow (stripe)  N opt = Number of streams for U f  X ij = Total amount of flow passing i −> j  X fk = Amount of each flow (stripe)  N Si = Number of streams to be used for X fkij  Sx ij = Number of stripes passing i− > j  N n = Number of nodes  Inequalities:  There is a high positive correlation between the throughput of parallel streams and CPU utilization  The linear relation between CPU utilization and Throughput is presented as :  a and b variables are solved by using the sampling throughput and CPU utilization measurements in regression of method of least squares 15

OPT B Algorithm for Homogeneous Resources  This algorithm finds the best parallelism values for maximal throughput in homogeneous resources  Input parameters:  A set of sampling values from sampling algorithm (Th N )  Destination CPU, NIC capacities (U CPU, U NIC )  Available number of nodes (N avail )  Output:  Number of streams per stripe (N si )  Number of stripes per node (S x )  Number of nodes (N n )  Assumes both source and destination nodes are idle 16

OPT B -Application Case Study 17 9Gbps  Systems: Oliver, Eric  Network: LONI (Local Area)  Processor: 4 cores  Network Interface: 10GigE Ethernet  Transfer: Disk-to-disk (Lustre)  Available number of nodes: 2

OPT B -Application Case Study 18 9Gbps  Th Nsi =903.41Mbps p=1  Th Nsi = Mbps p=2  Th Nsi = Mbps p=4  Th Nsi = Mbps p=8  N opt =3 N si =2 Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij=

OPT B -Application Case Study 19 9Gbps  S x =2 Th Sx1,2,2 =  S x =4 Th Sx1,4,2 =  S x =8 Th Sx2,4,2 = Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij= Nsi= Sxij=

OPT B -LONI-memory-to-memory-10G 20

OPT B -LONI-memory-to-memory-1G- Algorithm Overhead 21

Conclusions  We have achieved end-to-end data transfer throughput optimization with data flow parallelism  Network level parallelism  Parallel streams  End-system parallelism  CPU/Disk striping  At both levels we have developed models that predict best combination of stream and stripe numbers 22

Future work  We have focused on TCP and GridFTP protocols and we would like to adjust our models for other protocols  We have tested these models in 10G network and we plan to test it using a faster network  We would like to increase the heterogeneity among the nodes in source or destination 23

Acknowledgements  This project is in part sponsored by the National Science Foundation under award numbers  CNS (CAREER) – Research & Theory  OCI (Stork) – SW Design & Implementation  CCF (CiC) – Stork for Windows Azure  We also would like to thank to Dr. Miron Livny and the Condor Team for their continuous support to the Stork project. 