MyOps An Operational Framework for PlanetLab Deployments 1.

Slides:



Advertisements
Similar presentations
PlanetLab: An Overlay Testbed for Broad-Coverage Services Bavier, Bowman, Chun, Culler, Peterson, Roscoe, Wawrzoniak Presented by Jason Waddle.
Advertisements

CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop
Chapter 19: Network Management Business Data Communications, 5e.
Telecommunications Management /635 Network Management.
Stoimen Stoimenov QA Engineer SitefinityLeads, SitefinityTeam6 Telerik QA Academy Telerik QA Academy.
Sweeping lame DNS reverse delegations APNIC16 – DNS Operations SIG Seoul, Korea, 20 August 2003.
OpalisRobot™ Demonstration Actual Run Book Procedure Actual Data center Run Book Procedure documenting for Level 1 staff how to both VERIFY.
Validata Release Coordinator Accelerated application delivery through automated end-to-end release management.
Chapter 19: Network Management Business Data Communications, 4e.
1 Steve Chenoweth Friday, 10/21/11 Week 7, Day 4 Right – Good or bad policy? – Asking the user what to do next! From malware.net/how-to-remove-protection-system-
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
1 In VINI Veritas: Realistic and Controlled Network Experimentation Jennifer Rexford with Andy Bavier, Nick Feamster, Mark Huang, and Larry Peterson
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
ECE Synthesis & Verification1 ECE 667 Spring 2011 Synthesis and Verification of Digital Systems Verification Introduction.
Chapter 11 - Monitoring Server Performance1 Ch. 11 – Monitoring Server Performance MIS 431 – created Spring 2006.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Toward Optimal Network Fault Correction via End-to-End Inference Patrick P. C. Lee, Vishal Misra, Dan Rubenstein Distributed Network Analysis (DNA) Lab.
1 Interconnecting LAN segments Repeaters Hubs Bridges Switches.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Quality Assurance Testing Tony Mack PlanetLab Developers Meeting.
Network and Active Directory Performance Monitoring and Troubleshooting NETW4008 Lecture 8.
1 Chapter Overview Monitoring Server Performance Monitoring Shared Resources Microsoft Windows 2000 Auditing.
Existing Network Study CPIT 375 Data Network Designing and Evaluation.
Determining an Internet Address at Startup
23-Support Protocols and Technologies Dr. John P. Abraham Professor UTPA.
Immutable Infrastructure With Docker and EC2 Docker Conf 2014 Michael Bryzek CTO & Co-Founder Gilt
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Dynamic Firewalls and Service Deployment Models for Grid Environments Gian Luca Volpato, Christian Grimm RRZN – Leibniz Universität Hannover Cracow Grid.
CH2 System models.
Information-Based Building Energy Management SEEDM Breakout Session #4.
Integrating Fine-Grained Application Adaptation with Global Adaptation for Saving Energy Vibhore Vardhan, Daniel G. Sachs, Wanghong Yuan, Albert F. Harris,
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
Course Presentation EEL5881, Fall, 2003 Project: Network Reliability Tests Project: Network Reliability Tests Team: Gladiator Team: Gladiator Shuxin Li.
Service Transition & Planning Service Validation & Testing
Event Management & ITIL V3
© Logicalis Group Using DB2/400 effectively. Data integrity facilities Traditional iSeries database usage Applications are responsible for data integrity.
TOSCA Monitoring Working Group Status Roger Dev June 17, 2015.
Testing Workflow In the Unified Process and Agile/Scrum processes.
Chapter 19: Network Management Business Data Communications, 4e.
Tony McGregor RIPE NCC Visiting Researcher The University of Waikato DAR Active measurement in the large.
Grid Failure Monitoring and Ranking using FailRank Demetris Zeinalipour (Open University of Cyprus) Kyriacos Neocleous, Chryssis Georgiou, Marios D. Dikaiakos.
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
FLEXnet InstallShield Collaboration Bob Corrigan InstallShield Product Manager.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Identifying Failures in Grids through Monitoring and Ranking Demetris Zeinalipour Open University of Cyprus Kyriacos Neocleous, Chryssis Georgiou, Marios.
Connect. Communicate. Collaborate Hades – Going Operational Roland Karch, RRZE FAU Erlangen-Nürnberg JRA1 Montpellier Meeting, October 2006.
Service Level Agreements Service Level Statements NO YES The process of negotiating and defining the levels of user service (service levels) required.
Classsourcing: Crowd-Based Validation of Question-Answer Learning Objects Jakub Šimko, Marián Šimko, Mária Bieliková, Jakub Ševcech, Roman Burger
Microsoft Management Seminar Series SMS 2003 Change Management.
Silberschatz, Galvin and Gagne  Operating System Concepts UNIT II Operating System Services.
CERN IT Department t LHCb Software Distribution Roberto Santinelli CERN IT/GS.
Online Monitoring for the CDF Run II Experiment T.Arisawa, D.Hirschbuehl, K.Ikado, K.Maeshima, H.Stadie, G.Veramendi, W.Wagner, H.Wenzel, M.Worcester MAR.
Course Title. John Arnold Marketing to Maximize Repeat and Referral Business.
111 EMC CONFIDENTIAL—INTERNAL USE ONLY Mail Home Munira Manasawala NetWorker 7.3 TOI July 19, 2005.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Vmware 2V0-621D Vmware Exam Questions & Answers VMware Certified Professional 6 Presents
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
MPE Workshop 14/12/2010 Post Mortem Project Status and Plans Arkadiusz Gorzawski (on behalf of the PMA team)
ENGINEERING PRACTICES FOR CONTINUOUS DELIVERY: From Metrics and Deployment Pipelines to Integration and Microservices By Neal Ford with Tim Brown Deployment.
Please fill in my session feedback form available on each chair. SPSCairo Welcome.
Windows Vista Configuration MCTS : Maintenance and Optimization.
Chapter 19: Network Management
The Development Process of Web Applications
Planning for Testing In a DevOps World.
Storage elements discovery
How to fix Printer Errors- Reliable Printer Repair Services at an affordable rate-
HOW TO FIX DATA CORRUPTION IN SAGE 50?. HOW TO FIX ERRORS & WARNINGS IN YOUR SAGE DATA.
PlanetLab Operations Tools (Outline)
Fault Tolerance Distributed Web-based Systems
Presentation transcript:

MyOps An Operational Framework for PlanetLab Deployments 1

Outline o Objective of MyOps o Current status o Future ideas o Questions at any time 2

Example of Feedback 3

Objective : Close Operational Cycle System - Provides service (slice) Monitoring - Feedback from running system Operator - Interpret feedback into tasks Management - Control running system 4

Challenges: Break-down System may not deliver service Monitoring not observe useful metrics Operator may not know o how to interpret observations o how to control the system o what the service goals are Management may not control system 5

Requirements for Operational Systems Satisfy Minimal Conditions 1. Physical Integrity 2. Interconnectivity 3. Controllable 4. Provide a Service Two requirements o Reliably reach the final condition o When failures occurs, repair or report automatically Two approaches in MyOps o Precise bootstrap stages (not discussed) o Operational monitoring & management in platform 6

System: PlanetLab Slices 7

Monitoring Types Open-loop monitoring Identify the unknown More information, fine-grained Operational monitoring (closed-loop) Correctness Less information, coarse-grained Actionable 8

Management Types Open-loop management Bootstrap/Deploy from the ground up Inefficient, coarse-grained No feed-back Operational management (closed-loop) Tweak the system to correct behavior More efficient, fine-grained 9

Example Observe: Node is Off-Line Control: Attempt to Power-On Observe: Node is On-line but Failed to boot Observe: Failed to boot Error Control: Create ticket & Send to local contact Time passes Control: Disable slice creation Observe: Local contact responds Observe: Node is Power-on and Running Control: Re-enable slice creation Contro: Close ticket 10

History of PlanetLab Operations Open-loop Monitoring with Open-loop Management Collect fine-grained statistics using CoMon Act with coarse-grained operations (e.g. Reinstall) Manual bridge between the two Moving towards Closed-loop Operations Collect targeted metrics Take directed, problem-specific actions Automate actions based on policy 11

PlanetLab Operations Close the monitor/management cycle Direct automation of common operations Indirect through remote contacts and incentives 12

MyOps Architecture Collection from Node Translated by policy to Automated action 13

MyOps Architecture Collection from Node Send notice to Local contact to take action 14

MyOps Architecture When there is no response Indirect influence with incentives 15

Collection Operational monitoring specific targets, such as: o Boot status, Filesystem status o DNS - internal and external o RPMs o System services, etc Periodic collection o Coarse-grained collection at a human-timescale o Time-series of events and status 16

Policy Constraints over a time-series of events To satisfy a constraint o Automated action o Send notice o Apply incentive Policy defines o Preferred status of system o Frequency of actions o Magnitude of incentives 17

Automation Automatic correction of common bootstrap problems o Communication errors with MyPLC o Corrupt filesystem repair o Retry when state is unknown o PCU Reboot o Reinstall Automation Notices o Bad disk o Minimal hardware o Bad DNS o Bad node configuration 18

Notices & Incentives Notices are indirect paths to node management o Node down / online / specific problem (i.e. DNS, disk) o Site down / online o Privilege reduced / restored o PCU errors The incentives on MyPLC o Sites 10 slices o Disable slice creation o Disable running slices 19

Validation of Notices & Incentives ABCDE Notice BugFixKernel BugFix Fix2 20

Time to Restore Down Node (all issues) 21

Future Ideas Generalize Configuration Collect from multiple sources Expose policy Act on multiple targets Self-monitoring Positive Incentives Special access to services Additional resources (Slices, Bandwidth, CPU, etc) 22

Time to Reply (when there is a reply) 23