Miron Livny John P. Morgridge Professor of Computer Science

Distributed High Throughput Computing with HTCondor and the Open Science Grid
Miron Livny John P. Morgridge Professor of Computer Science Wisconsin Institutes for Discovery University of Wisconsin-Madison

“… many fields today rely on high-throughput computing for discovery.”
“Many fields increasingly rely on high-throughput computing”

“Recommendation 2.2. NSF should (a) … and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as high-performance workflows to them.”

The submitted performance analysis must include a broad range of applications and workflows requiring the highest capabilities in terms of scale (massive number of processors used in a single tightly-coupled application), high throughput (massive number of processors used by ensembles), and data analytics (large scale-out workloads for massive data analysis). Performance analysis must include projected time-to-solution improvement for all applications running on the proposed Phase 1 system over the existing BlueWaters system.

In 1996 I introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at NASA Goddard Flight Center and a month later at European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

Claims for “benefits” provided by Distributed Processing Systems
P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978 High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function

“I would like to thank Miron Livny and his software engineering team for producing a high-throughput computational workflow management system that is, in my opinion, not only an improvement on its predecessors, but also on its successors.” “The HTCondor pool at the University of Manchester has computed over 1,800 years of results for our researchers. Our users simply love this software system, which is the best way of saying thanks to Miron Livny and the men and women of his HTCondor software engineering team.”

High Throughput Computing requires automation as it is a activity that involves large numbers of jobs FLOPY  (60*60*24*7*52)*FLOPS 300K Hours*1 Job ≠ 1 H*300K J

“When a workflow might consist of 600,000 jobs, we don’t want to rerun them if we make a mistake. So we use DAGMan (Directed Acyclic Graph Manager, a meta-scheduler for HTCondor) and Pegasus workflow manager to optimize changes,” added Couvares who is responsible for distributed computing research and engineering in support of the LIGO Scientific Collaboration

Example of a LIGO Inspiral DAG (Workflow)

2017 Nobel Prize in Physics

2013 Nobel Prize in Physics “I think we have it, do you agree?”
Armed with 5σ significance delivered by more than 6K scientists from the ATLAS and CMS LHC experiments, the Director General of CERN, Rolf Heuer, asked on July 4, 2012: “I think we have it, do you agree?” “We have now found the missing cornerstone of particle physics. We have a discovery. We have observed a new particle that is consistent with a Higgs boson.” “only possible because of the extraordinary performance of the accelerators, experiments and the computing grid.”

All Managed by HTCondor!
350K

HTC customers want to run globally (acquire any resource (local or remote) that is capable and for as long as it is willing to run their job/task) while submitting locally (queue and manage their resource acquisition and jobs/tasks (workflows) execution locally)

The Open Science Grid (OSG) national fabric of distribute HTC services

“The members of OSG are united by a commitment to promote the adoption and to advance the state of the art of distributed high throughput computing (DHTC) – shared utilization of autonomous resources where all the elements are optimized for maximizing computational throughput.”

OSG provides services to advance this vision
The OSG Vision in a nutshell Across the nation, institutions invest into research computing to remain competitive Science is a team sport, and institutions with an island mentality will underperform. Integration is key to success OSG provides services to advance this vision

Transparent Computing across different resource types
National Supercomputer Access Point Collaborator’s Cluster OSG Metascheduling Service sharing Nationally Shared Clusters Commercial Cloud Local Cluster The Open Science Grid (OSG) integrates computing across different resource types and business models.

Integrating Compute & Storage Clusters
Each green dot represents a cluster registered in OSG

1.55B core hours in 12 months! Almost all jobs executed by
the OSG leverage (HT)Condor technologies: Condor-G HTCondor-CE Basco Condor Collectors HTCondor overlays HTCondor pools

Core Hours from HPC systems

Example of a LIGO Inspiral DAG

Dynamic discovery pathways at scale: Architecture view
Science Portals Research Facilities … Applications, Frameworks Campus, national resources NSF-supported resources International National/International Research & Education Networks, Commercial Networks Private, commercial clouds Data Management Authentication OSG HPC access, community Workflow Systems Collaboration Platforms Observation Discipline-specific Environments Integrative Services (“Middleware”) “Foundational” CI Resources Discovery

Sharing @ UW-Madison campus
The Center for High Throughput Computing (CHTC) has been serving the campus for more than 10 years Delivered more than 380M core hours in the past 12 month to researchers with HTC workload from more than 50 departments ~10% of these hours where provided by the Open Science Grid (OSG)

In 24 hours the CHTC serves a broad and diverse collection of science disciplines

Many owners of Resources!
Resources located in data centers, machine rooms, laboratories, class rooms, desks, …

HEPCloud is an R&D project led by the Fermi National Laboratory computing division

Decision Engine will have to implement on-the-fly capacity planning to control acquisition and release of resources

An Underpinning Aim Elasticity (cloud computing) aims at matching the amount of resource allocated to a service (/application/workflow) with the amount of resource it actually requires, avoiding over- or under-provisioning. [en.wikipedia.org/wiki/Elasticity_(cloud_computing)]

CHTC (HTCondor) and OSG offer a variety of services to
CHTC (HTCondor) and OSG offer a variety of services to individual researchers research labs Institutes Universities Regional and National Infrastructures Services range from consulting to software tools to operating distributed computing resources Our goal is to empower you to support the researchers at your institution. Let’s work together to support HT Research !

Miron Livny John P. Morgridge Professor of Computer Science

Similar presentations

Presentation on theme: "Miron Livny John P. Morgridge Professor of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Miron Livny John P. Morgridge Professor of Computer Science

Similar presentations

Presentation on theme: "Miron Livny John P. Morgridge Professor of Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback