1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Slides:



Advertisements
Similar presentations
ANTHONY TIRADANI AND THE GLIDEINWMS TEAM glideinWMS in the Cloud.
Advertisements

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
HPC Pack On-Premises On-premises clusters Ability to scale to reduce runtimes Job scheduling and mgmt via head node Reliability HPC Pack Hybrid.
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
Advanced Computing and Information Systems laboratory Educational Virtual Clusters for On- demand MPI/Hadoop/Condor in FutureGrid Renato Figueiredo Panoat.
INFSO-RI An On-Demand Dynamic Virtualization Manager Øyvind Valen-Sendstad CERN – IT/GD, ETICS Virtual Node bootstrapper.
Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Pilots 2.0: DIRAC pilots for all the skies Federico Stagni, A.McNab, C.Luzzi, A.Tsaregorodtsev On behalf of the DIRAC consortium and the LHCb collaboration.
TG RoundTable, Purdue RP Update October 11, 2008 Carol Song Purdue RP PI Rosen Center for Advanced Computing.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Volunteer Computing and Hubs David P. Anderson Space Sciences Lab University of California, Berkeley HUBbub September 26, 2013.
OSG Campus Grids Dr. Sebastien Goasguen, Clemson University ____________________________.
HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
Purdue RP Highlights TeraGrid Round Table September 23, 2010 Carol Song Purdue TeraGrid RP PI Rosen Center for Advanced Computing Purdue University.
Advanced Computing and Information Systems laboratory Plug-and-play Virtual Appliance Clusters Running Hadoop Dr. Renato Figueiredo ACIS Lab - University.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
Grids, Clouds and the Community. Cloud Technology and the NGS Steve Thorn Edinburgh University Matteo Turilli, Oxford University Presented by David Fergusson.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Simplifying Resource Sharing in Voluntary Grid Computing with the Grid Appliance David Wolinsky Renato Figueiredo ACIS Lab University of Florida.
Ideas for a virtual analysis facility Stefano Bagnasco, INFN Torino CAF & PROOF Workshop CERN Nov 29-30, 2007.
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
WNoDeS – Worker Nodes on Demand Service on EMI2 WNoDeS – Worker Nodes on Demand Service on EMI2 Local batch jobs can be run on both real and virtual execution.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
NTU Cloud 2010/05/30. System Diagram Architecture Gluster File System – Provide a distributed shared file system for migration NFS – A Prototype Image.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Ian Gable HEPiX Spring 2009, Umeå 1 VM CPU Benchmarking the HEPiX Way Manfred Alef, Ian Gable FZK Karlsruhe University of Victoria May 28, 2009.
Predrag Buncic (CERN/PH-SFT) Virtualizing LHC Applications.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Dynamic Creation and Management of Runtime Environments in the Grid Kate Keahey Matei Ripeanu Karl Doering.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI VM Management Chair: Alexander Papaspyrou 2/25/
What’s Coming? What are we Planning?. › Better docs › Goldilocks – This slot size is just right › Storage › New.
Purdue RP Highlights TeraGrid Round Table May 20, 2010 Preston Smith Manager - HPC Grid Systems Rosen Center for Advanced Computing Purdue University.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
Instituto de Biocomputación y Física de Sistemas Complejos Cloud resources and BIFI activities in JRA2 Reunión JRU Española.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Job Management Claudio Grandi.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
INFN/IGI contributions Federated Clouds Task Force F2F meeting November 24, 2011, Amsterdam.
Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
2007/05/22 Integration of virtualization software Pierre Girard ATLAS 3T1 Meeting
Out of the basement, into the cloud: Extending the Tier 3 Lincoln Bryant Computation and Enrico Fermi Institutes.
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
July 18, 2011S. Timm FermiCloud Enabling Scientific Computing with Integrated Private Cloud Infrastructures Steven Timm.
Matt Lemons Nate Mayotte
Sviluppi in ambito WLCG Highlights
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
ATLAS Cloud Operations
CREAM-CE/HTCondor site
Practical aspects of multi-core job submission at CERN
Virtualization in the gLite Grid Middleware software process
WLCG Collaboration Workshop;
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Condor and Multi-core Scheduling
GLOW A Campus Grid within OSG
Presentation transcript:

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on adapting applications and computing services to multi-core and virtualization, CERN, June 2009

2 VMs in OSG VM worker nodes –enthusiastic reports from site admins –transparent to users VM middleware pilots –just a more dynamic version of above –on-demand creation of worker nodes or services –still transparent to grid users VM user pilots/jobs –VO or user provides VM –at prototype stage

3 VM Worker Node Example Purdue: Linux in Windows Using GridAppliance Run Condor in Linux VM –consistent execution environment for jobs –VM is transparent to job –IPOP network overlay to span firewalls etc. –sandboxes the job Hope to incorporate as many as possible of the 27k computers on campus

4 VM middleware pilot: “Wispy” Science Gateway Purdue’s VM cloud testbed built with Nimbus Used by TeraGrid Science Gateways program –Prototype for creating OSG clusters on TeraGrid resources –VM looks like an OSG worker node

5 Clemson VO Cluster KVM virtual machines OSG CE in Virtual machines Condor as the LRM Condor within the VM image NFS share And PVFS setup KVM offers a snapshot mode that gives us ability to use a single image file. Writes are temporary

6 Results Engage VO Site Clemson-Birdnest on OSG Production Virtual cluster size responds to load

7 Routing excess grid jobs to the clouds Examples: Clemson’s “watchdog” –monitor job queue –run EC2 VMs as extra Condor worker nodes Condor JobRouter –with plugins, can transform “vanilla” job into VM or EC2 job e.g. RedHat’s MRG/EC2 approach –or operate more like watchdog, generating pilots reminds me of glideinWMS pilot scheduling – should be nearly directly applicable

8 VM Instantiation Nimbus, VMWare, Platform, RedHat, … –likely many solutions will be used Condor can run VMs using familiar batch system interface –start/stop/checkpoint VMs as if they are jobs –VMWare, Xen, KVM –can submit to EC2 interfaces (e.g. Nimbus) –working example: MIT CSAIL VMWare jobs running on Wisconsin CHTC cluster

9 multi-core jobs in OSG MPI grid jobs are doable –tends to require site-specific tuning –doesn’t directly extend to multi-core in all cases For multi-core, want standard way to request –whole CPU or whole machine –or N cores with M memory … but currently, not standardized –Example PBS RSL: (jobtype=single)(xcount=N)(host_xcount=1) Condor: –site-specific (custom ClassAd attributes) –not currently configured at most sites

10 Example of multi-core in OSG Wisconsin GLOW site –Several users require whole machine –workflow is many small MPI jobs –running on dual 4-core CPUs, 12 GB RAM Choices –statically configure one-slot machines –or mix whole-machine and single-core policies

11 Condor mixed-core config batch slot can represent whole machine or CPU or GPU or whatever slots can represent different slices of same physical resources cores CPUs whole machine overlapping slots

12 Condor mixed-core config Slot policy controls interaction between overlapping slots –can use job suspension to implement reservation –job can discover size of slot at runtime via environment –cpu affinity can be set –next development release improves accounting: weighted slots –example config from Condor How-to:

13 Condor variable-core dynamic slicing of machine into slots also possible –create new slot from leftovers after matching job –current issue: fragmentation steady stream of small jobs can starve large jobs doesn’t reserve/preempt N small slots to create big slot instead, must periodically drain machine to defragment

14 Summary VM worker nodes –already happening VM pilots/jobs –happening at resource-provider level –user-provided VMs more complex need agreement of sites need to deploy common interfaces need logging of pilot activity (like glexec) multi-core jobs –grid interface needs improvement –need standard Condor configuration across sites