Building a Virtualized Desktop Grid Eric Sedore

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

“It’s going to take a month to get a proof of concept going.” “I know VMM, but don’t know how it works with SPF and the Portal” “I know Azure, but.
Microsoft SharePoint 2013 SharePoint 2013 as a Developer Platform
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
European Organization for Nuclear Research Virtualization Review and Discussion Omer Khalid 17 th June 2010.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.
Implementing Failover Clustering with Hyper-V
5205 – IT Service Delivery and Support
Agenda Master Expert Associat e Microsoft Certified Solutions Master (MCSM) Microsoft Certified Solutions Expert (MCSE) Microsoft Certified Solutions.
© 2012 The McGraw-Hill Companies, Inc. All rights reserved. 1 Third Edition Chapter 3 Desktop Virtualization McGraw-Hill.
Minerva Infrastructure Meeting – October 04, 2011.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Week #10 Objectives: Remote Access and Mobile Computing Configure Mobile Computer and Device Settings Configure Remote Desktop and Remote Assistance for.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Data Center Network Redesign using SDN
Real Security for Server Virtualization Rajiv Motwani 2 nd October 2010.
Copyright 2009 Trend Micro Inc. OfficeScan 10.5 VDI-aware endpoint security.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Yury Kissin Infrastructure Consultant Storage improvements Dynamic Memory Hyper-V Replica VM Mobility New and Improved Networking Capabilities.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
+ CS 325: CS Hardware and Software Organization and Architecture Cloud Architectures.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Copyright © 2011 EMC Corporation. All Rights Reserved. MODULE – 6 VIRTUALIZED DATA CENTER – DESKTOP AND APPLICATION 1.
Module 8 Configuring Mobile Computing and Remote Access in Windows® 7.
Microsoft and Community Tour 2011 – Infrastrutture in evoluzione Community Tour 2011 Infrastrutture in evoluzione.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
608D CloudStack 3.0 Omer Palo Readiness Specialist, WW Tech Support Readiness May 8, 2012.
HTCondor and Beyond: Research Computer at Syracuse University Eric Sedore ACIO – Information Technology and Services.
From Virtualization Management to Private Cloud with SCVMM 2012 Dan Stolts Sr. IT Pro Evangelist Microsoft Corporation
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Uwe Lüthy Solution Specialist, Core Infrastructure Microsoft Corporation Integrated System Management.
Copyright © cs-tutorial.com. Overview Introduction Architecture Implementation Evaluation.
Virtual Workspaces Kate Keahey Argonne National Laboratory.
Microsoft Management Seminar Series SMS 2003 Change Management.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
Rob Davidson, Partner Technology Specialist Microsoft Management Servers: Using management to stay secure.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
System Center Lesson 4: Overview of System Center 2012 Components System Center 2012 Private Cloud Components VMM Overview App Controller Overview.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Architecture & Cybersecurity – Module 3 ELO-100Identify the features of virtualization. (Figure 3) ELO-060Identify the different components of a cloud.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Cloud computing: IaaS. IaaS is the simplest cloud offerings. IaaS is the simplest cloud offerings. It is an evolution of virtual private server offerings.
Managing a growing campus pool Eric Sedore
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Chapter 6: Securing the Cloud
C Loomis (CNRS/LAL) and V. Floros (GRNET)
HTCondor at Syracuse University – Building a Resource Utilization Strategy Eric Sedore Associate CIO HTCondor Week 2017.
Chapter 2: System Structures
GGF15 – Grids and Network Virtualization
Microsoft Ignite NZ October 2016 SKYCITY, Auckland.
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Concept of VLAN (Virtual LAN) and Benefits
11/23/2018 3:03 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
HC Hyper-V Module GUI Portal VPS Templates Web Console
Azure Container Service
Client/Server Computing and Web Technologies
IT Management, Simplified
Presentation transcript:

Building a Virtualized Desktop Grid Eric Sedore

Why create a desktop grid?  One prong of an three pronged strategy to enhance research infrastructure on campus (physical hosting, HTC grid, private research cloud)  Create a common, no cost (to them), resource pool for research community - especially beneficial for researchers with limited access to compute resources  Attract faculty/researchers  Leverage an existing resource  Use as a seed to work toward critical mass in the research community

Goals  Create Condor pool sizeable enough for “significant” computational work (initial success = 2000 concurrent cores)  Create and deploy grid infrastructure rapidly (6 months)  Secure and low impact enough to run on any machine on campus  Create a adaptive research environment (virtualization)  Simple for distributed desktop administrators to add computers to grid  Automated methods for detecting/enabling Intel-VT (for hypervisor)  Automated hypervisor deployment

Integration of Existing Components  Condor  VirtualBox  Windows 7 (64 bit)  TCL / FreeWrap – Condor VM Catapult (glue)  AD – Group Policy Preference

Typical Challenges introducing the Grid (FUD)  Security  You want to use “my” computer?  Where does my research data go?  Technical  Hypervisor / VM Management  Scalability  After you put “the grid” on my computer…  Governance  Who gets access to “my” resources?  How does the scheduling work?

Security

Security on the client  Grid processes run as a non-privileged user  Virtualization to abstract research environment / interaction  VM’s on the local drive are encrypted at all times – (using certificate of non-privileged user)  Local cached repository and when running in a slot  Utilize Windows 7 encrypted file system  Allows grid work on machines with end users as local administrators  To-do – create a signature to ensure researcher (and admins) that the VM started is “approved” and has not been modified (i.e. not modified to be a botnet)

Securing/Protecting the Infrastructure  Create an isolated private 10.x.x.x. network via VPN tunnels (pfSense and OpenVPN)  Limit bandwidth for each research VM to protect against a network DOS  Research VM’s NAT’d on desktops  Other standard protections – Firewalls, ACL’s

OpenVPN End-Point (pfSense) / FW / Router Condor Infrastructure Roles ITS-SL6- LSCSOFT Condor Submit Server 10.x.x.x network Public Network Bottleneck for higher bandwidth jobs

Technical

Condor VM Coordinator (CMVC)  Condor’s VM “agent” on the desktop  Manage distribution of local virtual machine repository  Manage encryption of virtual machines  Runs as non-privileged user – reduces adoption barriers  Pseudo Scheduler  Rudimentary logic for when to allow grid activity  Windows specific – is there a user logged in?

Why did you write CVMC?  Runs as non-privileged user (and needs windows profile)  Mistrust in a 3 rd party agent (condor client) on all campus desktops – especially when turned over to the research community – even with the strong sandbox controls in condor  Utilizes built-in MS Task Scheduler for idle detection – no processes running in user’s context for activity detection  VM repository management  Encryption  It seemed so simple when I started…

Job Configuration  Requirements = ( TARGET.vm_name == "its-u11-boinc " ) && ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= DiskUsage ) && ( ( TARGET.Memory * 1024 ) >= ImageSize ) && ( ( RequestMemory * 1024 ) >= ImageSize ) && ( TARGET.HasFileTransfer )  ClassAd addition  vm_name = "its-u11-boinc ”  CVMC Uses vm_name ClassAd to determine which VM to launch  Jobs without vm_name can use running VM’s (assuming the requirements match) – but they won’t startup new VM’s

Task Scheduler CVMC VirtualBox Web Server Condor Queue Slot 1 Slot 2 Idle State VM Repo Slot … Win 7 Client Condor Back -end

Technical Challenges  Host resource starvation  Leave memory for the host OS  Memory controls on jobs (within Condor)  Unique combination of approaches implementing Condor  CVMC / Web service  VM distribution  Build custom VM’s based on job needs vs. scavenging existing operating system configurations  Hypervisor expects to have an interactive session environment (windows profile)  Reinventing the wheel on occasion

How do you “ensure” low impact?  When no one is logged in CVMC will allow grid load regardless of the time  When a user is logged in CVMC will kill grid load at 7 AM and not allow it to run again until 5 PM (regardless if the machine is idle)  Leave the OS memory (512MB-1GB) so it does not page out key OS components (using a simple memory allocation method)  Do not cache VM disks – will keep OS from filling its memory cache with VM I/O traffic

Keep OS from Caching VM I/O

Next Steps  Grow the research community – depth and diversity  Increase pool size – ~12,000 cores which are eligible  Infrastructure Scalability  Condor (tuning/sizing)  Network / Storage (NFS – Parrot / Chirp)

Solving the Data Transfer Problem  Born from an unfinished side-project 7+ years ago.  Goal: maximize the compute resources available to LIGO’s search for gravitational waves  More cycles == a better search.  Problem: huge input data, impractical to move w/job.  How to...  Run on other LIGO Data Grid sites without a shared filesystem?  Run on clusters outside the LIGO Data Grid lacking LIGO data? Tools to get the job done: ihope, GLUE, Pegasus, Condor Checkpointing, and Condor-C. People: Kayleigh Bohémier, Duncan Brown, Peter Couvares. Help from SU ITS, Pegasus Team, Condor Team

Idea: Cross-Pool Checkpoint Migration  Condor_compiled (checkpointable) jobs.  Jobs start on a LIGO pool with local data.  Jobs read in data and pre-process.  Jobs call checkpoint_and_exit().  Pegasus workflow treats checkpoint image as output, and provides it as “input” to a second Condor-C job.  Condor-C job transfers and executes standalone checkpoint on remote pool, and transfers results back.

Devil in the Details  Condor checkpoint_and_exit() caused the job to exit with SIGUSR2, so we needed to catch that and treat it as success.  Standalone checkpoint images didn’t like to restart in a different cwd, even if they shouldn’t care, so we had to binary edit each checkpoint image to replace the hard-coded /path/to/cwd with.////////////  Will be fixed in Condor 7.8?  Pegasus needed minor mods to support Condor-C “grid” jobs w/Condor file transfer  Fixed for next Pegasus release.

Working Solution Move jobs that do not require input files on the SUGAR cluster to the remote campus cluster.