Next-generation cluster management architecture and software

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Mecanismos de alta disponibilidad con Microsoft SQL Server 2008 Por: ISC Lenin López Fernández de Lara.
© 2003 IBM Corporation IBM Systems and Technology Group Operating System Attributes for High Performance Computing Ken Rozendal Distinguished Engineer.
Faith Allington Program Manager Microsoft Corporation Session Code: WSV304.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
MarketNet Directory Services (MDS) Weibin Zhao Henning Schulzrinne Department of Computer Science Columbia University.
Draft-li-rtgwg-cc-igp-arch-00IETF 88 RTGWG1 An Architecture of Central Controlled Interior Gateway Protocol (IGP) draft-li-rtgwg-cc-igp-arch-00 Zhenbin.
Patch Management Module 13. Module You Are Here VMware vSphere 4.1: Install, Configure, Manage – Revision A Operations vSphere Environment Introduction.
Chapter 7 Configuring & Managing Distributed File System
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Automate Microsoft Azure Ross Sponholtz Mark Ghazai.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Cyberinfrastructure Status July, NSF reverse site visit Refactoring and cleanup after review preparations Coordinating Node technology changes.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Tools and software process for the FLP prototype B. von Haller 9. June 2015 CERN.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
Successful Deployment and Solid Management … Close Relatives Tim Sinclair, General Manager, Windows Enterprise Management.
AUTOBUILD Build and Deployment Automation Solution.
Anilkumar Dantu CCIE (22536)
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Openstack on Openstack how to bootstrap a cloud Paul Voccio Director, Infrastructure Engineering Rackspace.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer, Progress Sonic.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Managing the CERN LHC Tier0/Tier1 centre Status and Plans March 27 th 2003 CERN.ch.
SONIC-3: Creating Large Scale Installations & Deployments Andrew S. Neumann Principal Engineer Progress Sonic.
DNS DNS overview DNS operation DNS zones. DNS Overview Name to IP address lookup service based on Domain Names Some DNS servers hold name and address.
Plug-in Architectures Presented by Truc Nguyen. What’s a plug-in? “a type of program that tightly integrates with a larger application to add a special.
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Module 11: Configuring and Managing Distributed File System.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Next Generation of Apache Hadoop MapReduce Owen
And scales by cloning the app on multiple servers/VMs/Containers Traditional architecture approach Microservices architecture approach A microservice.
Module 11 Configuring and Managing Distributed File System.
Information Initiative Center, Hokkaido University North 11, West 5, Sapporo , Japan Tel, Fax: General.
1 Ivan Marsic Rutgers University LECTURE 2: Software Configuration Management.
Atrium Router Project Proposal Subhas Mondal, Manoj Nair, Subhash Singh.
OSIsoft High Availability PI Replication Colin Breck, PI Server Team Dave Oda, PI SDK Team.
NASA MSFC Mission Operations Laboratory MSFC NASA MSFC Mission Operations Laboratory Kelvin Nichols, EO50 March 2016 MSFC ISS DTN Project Status.
Software Defined Networking BY RAVI NAMBOORI. Overview  Origins of SDN.  What is SDN ?  Original Definition of SDN.  What = Why We need SDN ?  Conclusion.
OpenQRM is not Dead the lightning version Building a cloud in 5 mnutes by Kris Buytaert.
Shaopeng, Ho Architect of Chinac Group
Daisy4nfv: An Installer Based upon Open Source Project – Daisy & Kolla
Tales from CICD for vR Suite Daniele Ulrich
Building ARM IaaS Application Environment
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
LECTURE 2: Software Configuration Management
Implementing Active Directory Domain Services
Self Healing and Dynamic Construction Framework:
Deploying and Maintaining Server Images
Configuration Management with Azure Automation DSC
CHAPTER 3 Architectures for Distributed Systems
Maintaining software solutions
LECTURE 3: Software Configuration Management
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Outline Midterm results summary Distributed file systems – continued
ABHISHEK SHARMA ARVIND SRINIVASA BABU HEMANT PRASAD 08-OCT-2018
Software System Testing
The Problem ~6,000 PCs Another ~1,000 boxes But! Affected by:
TechEd /23/2019 9:23 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Distributed Edge Computing
5G Use Case Configuration & PNF SW Upgrade using NETCONF ONAP DDF, Jan 9, 2019 Ericsson.
Final Review 27th March Final Review 27th March 2019.
NFV and SD-WAN Multi vendor deployment
Nolan Leake Co-Founder, Cumulus Networks Paul Speciale
GNFC Architecture and Interfaces
ONAP Architecture Principle Review
Containers on Azure Peter Lasne Sr. Software Development Engineer
Presentation transcript:

Next-generation cluster management architecture and software LA-UR-18-30725 Next-generation cluster management architecture and software Towards Continuous Uptime Paul Peltz Jr. and Lowell Wofford 11/11/2018 Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

Current Issues Current Cluster Management Tools Centralized Monolithic Proprietary/Outdated Not Configuration Management and Orchestration capable Downtime Binary Network Topologies (Up/Down) Updates/Patches Schedulers Lack integration into the cluster management Inference model only

How do we solve these problems? Current Cluster Management Tools Complete redesign necessary Distributed State Engine, Modular Downtime System should never go down Degraded modes Distributed Services Rolling Upgrades CI/CD of Patches and Changes Schedulers Ties directly into the distributed state engine

Redesign of HPC Architecture

How we will test and deploy changes in the future Build Server Gitlab Config Mgmt Master Master Master Sub-Master Sub-Master Sub-Master

Stage 1 – Functional Testing Build Server Gitlab Config Mgmt OBS Git CM Master Master Master Sub-Master Sub-Master Sub-Master

Stage 2 – Small Scale Test Build Server Gitlab Config Mgmt OBS Git CM Master Master Master Sub-Master Sub-Master Sub-Master

Stage 3 Scale Testing Build Server Gitlab Config Mgmt OBS Git CM Master Master Master Sub-Master Sub-Master Sub-Master

Stage 4 – Update Additional Master/Sub-Master Build Server Gitlab Config Mgmt OBS Git CM Master Master Master Sub-Master Sub-Master Sub-Master

Final Stage – Continue to reboot nodes into new images once jobs complete Build Server Gitlab Config Mgmt OBS Git CM Master Master Master Sub-Master Sub-Master Sub-Master

Sounds great, how do we do this? Service decentralization HA is not acceptable VMs/Micro Services Migrate services Distributed State Engine The cluster manager must be able to track state, transitions, and final states of the system including the service nodes. Backwards compatible APIs Need at least one version compatibility between service APIs. Scheduler Integration Must be able to schedule the transition of release states on the system Handle A/B configuration if there are compatibility issues

Automated and Continuous Testing! Testing first on development systems (TDS) Automated testing for functionality and performance during each stage of deployment Once release is approved that branch is released into production Configuration management notifies state engine to begin rolling through upgrade Testing is performed for each step If a test exception occurs, human intervention will be required to resolve the issue Roll forward or roll back support

Kraken is a State Engine for System Automation It can: boot systems, load images, recover from failures… and more. Kraken tracks two kinds of state: What state are we in? (“Discoverable”) What state should we be in? (“Configuration”) Kraken knows how to mutate from the state you’re in to the state you want: Maintains a directed graph of known state mutations (e.g. power_off -> power_on) Searches the graph to create a chain of mutations to get from where we are to where we want to be. Simplified mutation graph for booting a node through traditional PXE. Edges are mutations, boxes are states.

Kraken is Distributed Kraken uses a protobuf-based, light-weight datagram protocol to synchronize state across the system. This uses an eventual consistency model to converge on synchronized state. Discoverable state migrates up the tree from the end-point nodes (computes). Configuration state migrates down the tree from the master, or “full-state node.” Sideways replication is future work.

Kraken is Modular Modules make kraken actually do work. The core of Kraken: State management Plugin/Extension management Modules can: Instantiate microservices Declare that they can mutate state Declare that they can discover state Extensions add new variables to the state

State of the Kraken Basic set of functional microservices & plugins Can boot multiple architectures Uses layered container images in reference implementation Open source (BSD-3): https://github.com/hpc/kraken Working towards first full release

Thanks!