ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC

Slides:

Advertisements

Similar presentations

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Advertisements

Collaborative QoS Prediction in Cloud Computing Department of Computer Science & Engineering The Chinese University of Hong Kong Hong Kong, China Rocky.

Introduction Background Knowledge Workload Modeling Related Work Conclusion & Future Work Modeling Many-Task Computing Workloads on a Petaflop Blue Gene.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Performance Evaluation

Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.

Integrated Risk Analysis for a Commercial Computing Service Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. Dept.

© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing

OPTIMAL SERVER PROVISIONING AND FREQUENCY ADJUSTMENT IN SERVER CLUSTERS Presented by: Xinying Zheng 09/13/ XINYING ZHENG, YU CAI MICHIGAN TECHNOLOGICAL.

Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

DISTRIBUTED COMPUTING

Marcos Dias de Assunção 1,2, Alexandre di Costanzo 1 and Rajkumar Buyya 1 1 Department of Computer Science and Software Engineering 2 National ICT Australia.

Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Predicting Queue Waiting Time in Batch Controlled Systems Rich Wolski, Dan Nurmi, John Brevik, Graziano Obertelli Computer Science Department University.

Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.

1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.

Power-Aware Parallel Job Scheduling

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

Introduction to Simulation Andy Wang CIS Computer Systems Performance Analysis.

Scalable and Coordinated Scheduling for Cloud-Scale computing

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Chapter 7 Operating Systems Foundations of Computer Science  Cengage Learning 1.

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Resource Characterization Rich Wolski, Dan Nurmi, and John Brevik Computer Science Department University of California, Santa Barbara VGrADS Site Visit.

15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,

Resource Provision for Batch and Interactive Workloads in Data Centers Ting-Wei Chang, Pangfeng Liu Department of Computer Science and Information Engineering,

OPERATING SYSTEMS CS 3502 Fall 2017

Do SKU and Network Complexity Drive Inventory Levels?

Jacob R. Lorch Microsoft Research

Introduction to Load Balancing:

Seismic Hazard Analysis Using Distributed Workflows

ECRG High-Performance Computing Seminar

Ching-Chi Lin Institute of Information Science, Academia Sinica

Systems Design: Activity Based Costing

US CMS Testbed.

Understanding and Exploiting Amazon EC2 Spot Instances

Predictive Performance

University of Southern California

USING SIMPLE PID CONTROLLERS TO PREVENT AND MITIGATE FAULTS IN SCIENTIFIC WORKFLOWS Rafael Ferreira da Silva1, Rosa Filgueira2, Ewa Deelman1, Erola Pairo-Castineira3,

CLUSTER COMPUTING.

CSE8380 Parallel and Distributed Processing Presentation

A Characterization of Approaches to Parrallel Job Scheduling

Cloud Computing Architecture

Overview of Workflows: Why Use Them?

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Systems Design: Activity Based Costing

Kajornsak Piyoungkorn,

EGI High-Throughput Compute

Presentation transcript:

ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC Rafael Ferreira da Silva USC Information Sciences Institute CS&T Directors' Meeting

OUTLINE Pegasus Introduction Workload Characterization User Behavior in HPC HPC and HTC Job Schedulers Problem Statement Mira (ALCF) Compact Muon Solenoid (CMS) Think Time Runtime and Waiting Time Job Notifications User Behavior in HTC Summary Think Time Batches of Jobs Batch-Wise Submission Conclusions Future Research Directions Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

REFERENCES Consecutive Job Submission Behavior at Mira Supercomputer S. Schlagkamp, R. Ferreira da Silva, W. Allcock, E. Deelman, and U. Schwiegelshohn 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC), 2016 Understanding User Behavior: from HPC to HTC S. Schlagkamp, R. Ferreira da Silva, E. Deelman, and U. Schwiegelshohn International Conference on Computational Science (ICCS), 2016 Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

INTRODUCTION Feedback-Aware Performance Evaluation of Job Schedulers Evaluation with previously recorded workload traces System performance User reaction One instantiation of a dynamic process User reaction is a mystery D. G. Feitelson, 2015 Stable state Lack between theory and practice Further understanding of reactions to system performance Generated load U. Schwiegelshohn, 2014 Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> > THINK TIME Pegasus How do users react to system performance? data-driven analysis > Think Time Time between job completion and consecutive job submission > Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

OUTLINE Pegasus Introduction Workload Characterization User Behavior in HPC HPC and HTC Job Schedulers Problem Statement Mira (ALCF) Compact Muon Solenoid (CMS) Think Time Runtime and Waiting Time Job Notifications User Behavior in HTC Summary Think Time Batches of Jobs Batch-Wise Submission Conclusions Future Research Directions Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

WORKLOAD CHARACTERISTICS MIRA (ALCF) 2014 Physics 49,152 73 Users 24,429 Jobs 2,256 CPU hours (millions) 7,147 Avg. Runtime (sec) Total Number of Nodes Materials Science 786,432 77 Users 12,546 Jobs 895 CPU hours (millions) 5,820 Avg. Runtime (sec) Total Number of Cores 487 Total Number of Users 78,782 Chemistry Total Number of Jobs 51 Users 10,286 Jobs 810 CPU hours (millions) 6,131 Avg. Runtime (sec) 5,698 CPU hours (millions) 6,093 Avg. Runtime (seconds) Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

WORKLOAD CHARACTERISTICS AUGUST 2014 COMPACT MUON SOLENOID (CMS) OCTOBER 2014 1,435,280 392 408 1,638,803 Total Number of Jobs Total Number of Users Total Number of Users Total Number of Jobs 75 15,484 15,034 72 Execution Sites Execution Nodes Execution Nodes Execution Sites 792,603 385,447 476,391 816,678 Completed Jobs Exit Code (!= 0) Exit Code (!= 0/) Completed Jobs 9,444.6 55.3 Avg. Runtime (sec) Avg. Disk Usage (MB) 32.9 9967.1 Avg. Disk Usage (MB) Avg. Runtime (sec) Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

WORKLOAD CHARACTERIZATION Jobs’ resource requirements at Mira Job submission interarrival times per day Job submission interarrival times per hour Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

OUTLINE Pegasus Introduction Workload Characterization User Behavior in HPC HPC and HTC Job Schedulers Problem Statement Mira (ALCF) Compact Muon Solenoid (CMS) Think Time Runtime and Waiting Time Job Notifications User Behavior in HTC Summary Think Time Batches of Jobs Batch-Wise Submission Conclusions Future Research Directions Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> USER THINK TIME Pegasus The user’s think time quantifies the timespan between a job completion and the submission of the next job (by the same user) We only account for positive times (and less than 8 hours) between subsequent job submissions No change in the past 20 years > Average think times in several traces from the Parallel Workloads Archive and Mira Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> CORRELATIONS Pegasus Response Time General Behavior Waiting Time + Runtime General Behavior Subsequent behavior independent of the science field/application Low values (nearly instantaneous submissions) is typically due to the user of automated scripts Peaks (e.g., Engineering) is due to outliers (about 8h) > Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

RUNTIME WAITING TIME CORRELATIONS Pegasus Both runtime and waiting time have equal influence on the user behavior Reducing queuing times would not significantly improve think times for long running jobs RUNTIME WAITING TIME Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

NODES > CORRELATIONS Pegasus For small jobs (~103 nodes), average think times are relatively small (<1.5h) For larger jobs, it substantially increases: NODES Users do not fully understand the behavior of their applications as the number of cores increase Resource allocation for larger jobs is delayed Larger jobs require additional settings and refinements (increased job complexity) > Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

WORKLOAD > CORRELATIONS Pegasus Think time is heavily correlated to the workload Workload has more impact as it also considers runtime WORKLOAD w(j) = processing time x number of nodes Similar conclusions to the number of nodes analysis > Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

ANALYSIS OF JOB CHARACTERISTICS More complex jobs do yield higher think times, however there is a similar behavior when runtime or waiting time prevail Think times are small when runtime prevails User behavior is not impacted by the job size Complex jobs requires more think time Lack of accurate runtime estimates small jobs less complex jobs Represent 49.2% of total jobs Consume less than 277 CPU hours NODES WORKLOAD <= 512 > 512 <= 106 > 106 Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

ANALYSIS OF JOB NOTIFICATIONS The overall user behavior is nearly identical regardless of whether the user is notified 17,736 out of 78,782 jobs used the email notification mechanism LARGE JOBS > Average think time as a function of response time for jobs with and without notification upon job completion With notification Without notification Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> SUMMARY Pegasus User Behavior in HPC Discussion Think Time Runtime and Waiting Time Job Notifications > There is no shift on the think time behavior during the past 20 years. This similar behavior is due to the current restrictive definition to model think time Simulating submission behavior has to consider other job characteristics and system performance components A notification mechanism has no influence on the subsequent user behavior. Thus, there is no urging to model user (un)awareness of job completion in performance evaluation simulations Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

OUTLINE Pegasus Introduction Workload Characterization User Behavior in HPC HPC and HTC Job Schedulers Problem Statement Mira (ALCF) Compact Muon Solenoid (CMS) Think Time Runtime and Waiting Time Job Notifications User Behavior in HTC Summary Think Time Batches of Jobs Batch-Wise Submission Conclusions Future Research Directions Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

CHARACTERIZING THINK TIME > > > HPC HTC Tightly coupled applications Methods: Think time Embarrassingly parallel applications atypical behavior Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

MIRA CMS08 CMS10 BATCHES OF JOBS Pegasus User-triggered job submissions are often clustered and denoted as batches Two jobs successively submitted by a user belong to the same batch if the interarrival time between their submissions is within a threshold: Large interarrival thresholds may not capture the actual job submission behavior in HTC systems MIRA CMS08 CMS10 Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

Distribution of interarrival times (CDF) REDEFINING THINK TIME Think Time for HTC Quantifies the timespan between two subsequent submissions of bags of tasks General Behavior Most of the jobs belonging to the same experiment and user (97%) are submitted within one minute We use the threshold of 60s to distinguish between automated bag of tasks submissions and human-triggered submissions (batches) > Distribution of interarrival times (CDF) Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

RESPONSE WAITING TIME THINK TIME IN HTC Pegasus Both HTC workloads follow the same linear trend Batch-Wise Analysis: Lower think times when compared to standard analysis based on individual jobs HTC BoTs are comparable to HPC jobs User behavior in CMS is not strictly related to waiting time RESPONSE WAITING TIME Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

ALTERNATIVE THINK TIME DEFINITIONS Bexp the ground truth knowledge (from CMS traces) Buser bag of tasks based on jobs submitted by the same user (most common approach) The subsequent think time behavior for the general behavior is closer to Bexp than the Buser > general jobs are treated individually Comparison of different data interpretations for think time CMS 08 CMS 10 Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> SUMMARY Pegasus User Behavior in HTC Discussion Think Time Batches of Jobs Batch-Wise Submission > Although HTC jobs are composed of thousands of embarrassingly parallel jobs, the general human submission behavior is comparable to HPC Additional information is required to properly identify HTC batches Subsequent behavior in HPC is sensitive to the job complexity, while BoTs drives the HTC behavior There is no strong correlation between waiting and think times in the CMS experiments due to the dynamic behavior of queuing times within BoTs Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

> FUTURE WORK Pegasus Summary Future Research Directions Conclusion Extend and explore different think time definitions (e.g., based on concurrent activities) Model think time as a function of job complexity from past job submissions Cognitive studies: understand user reactions based on waiting times and satisfaction In-depth characterization of waiting times in bags of tasks to improve correlation analysis between queuing time and think time Pegasus R. Ferreira da Silva Analysis of User Submission Behavior on HPC and HTC

ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC Thank You Questions? Rafael Ferreira da Silva, Ph.D. Research Assistant Professor Department of Computer Science University of Southern California rafsilva@isi.edu – http://rafaelsilva.com http://pegasus.isi.edu