Chris Adams Solutions Architect Microsoft Corporation SESSION CODE: MGT303 Brian Copps Service Engineering, Mgr Microsoft Corporation
Agenda for the Session….
What’s our Operations Role at Microsoft? 10 Geo Data Centers 10 Geo Data Centers > 99.9% Global Availability > 99.9% Global Availability 2M+ Concurrent Connections 2M+ Concurrent Connections 700M Unique Clients/Month 700M Unique Clients/Month 100B Downloads/Year 100B Downloads/Year 500+ PetaBytes of Egress/Year 500+ PetaBytes of Egress/Year 25-50% Service Growth (YoY) 25-50% Service Growth (YoY) Poor Business Planning Technology Complexities Expectation to Manage Growth/Costs 1 - Mitigate Risks (Safety 1 st ) 2 - Cost Efficacy (Manage Budget) 3 - Innovate & Impact (Value)
Critical Incident Warning Incident
Risk Mitigation Asset Accountability (Servers & Data) Baseline Health Monitoring Pulse Elimination of Legacy Monitoring Tools with limited/no supportability models Cost Efficacy SC Suite vs SC + 3 rd Party + Custom Simplification Strategy - Saved $1.5M in FY10 (People/Infra) 50% complete in Simplification Plan for DC Monitoring & Mgmt; targeting another $500k in FY11 Innovations Objective is too establish the foundation for E2E Service Monitoring & Management Non-SC Collection (3 rd Party Solutions) Threshold Alerts Performance
Critical Incident Warning Incident
Non-SC Collection (3 rd Party Solutions) Threshold Alerts Performance
Database 1 Web Site VIP Web Site VIP Web Server 1 Web Server 2
public partial class TestServiceSample : System.Web.UI.Page { protected void Page_Load(object sender, EventArgs e) { // Default to 200 int httpStatusCode = (int)HttpStatusCode.OK; // Test a sample WCF service SampleSvc.SampleSvcClient client = new SampleSvc.SampleSvcClient(); try { Response.Write("Functional test: "); Widget[] widgets = client.GetWidgets(); // Check for expected result... if (widgets == null || widgets.Length == 0) throw new Exception(“Unexpected result."); Response.Write("Passed"); } catch (Exception ex) { // Depending on what test was performed or what the failure was, // set a custom http status code to 600 or above. httpStatusCode = 600; Response.Write(String.Format("Failed Exception message: {0}", ex.Message)); } Response.StatusCode = (int)httpStatusCode; Response.Write(" HTTP Status Code: " + httpStatusCode); Response.Flush(); }
public override bool OnStart() { //Get Default Config DiagnosticMonitorConfiguration config = DiagnosticMonitor.GetDefaultInitialConfiguration(); //Windows Performance Counters List counters = new List (); Processor Time"); Mbytes"); Established"); Applications(__Total__)\Requests/Sec"); Interface(*)\Bytes Received/sec"); Interface(*)\Bytes Sent/sec"); foreach (string counter in counters) { PerformanceCounterConfiguration counterConfig = new PerformanceCounterConfiguration(); counterConfig.CounterSpecifier = counter; counterConfig.SampleRate = TimeSpan.FromMinutes(5); config.PerformanceCounters.DataSources.Add(counterConfig); } config.PerformanceCounters.ScheduledTransferPeriod = TimeSpan.FromMinutes(5); //Windows Event Logs config.WindowsEventLog.DataSources.Add("System!*"); config.WindowsEventLog.DataSources.Add("Application!*"); config.WindowsEventLog.ScheduledTransferPeriod = TimeSpan.FromMinutes(1); config.WindowsEventLog.ScheduledTransferLogLevelFilter = LogLevel.Warning; //IIS Logs config.Directories.ScheduledTransferPeriod=TimeSpan.FromMinutes(10); DiagnosticMonitor.Start("DiagnosticsConnectionString", config); // For information on handling configuration changes // see the MSDN topic at RoleEnvironment.Changing += RoleEnvironmentChanging; return base.OnStart(); }
Hardware Inventory – cpu’s, drive size, memory, … Software Inventory – products, security patches, versions, … Configuration Data – Registry, File Versions, … Performance Counters – Server and Custom Collections Events and Alerts – Time, Frequency, Most Common Availability KPI’s– SLA %, Download Time, Page Size Error Data – HTTP error codes, Common failing pages, DNS Resolution, … Incident / Problem – Trends, Resolution % by Tier, KPI’s, … Change / Config Mgmt – Trends, Request Frequency, …
Putting it all Together
System Center in Action - Best Practices running-the-largest-corp-online-service-tips-tricks-and-guidance.aspx System Center Team Blog Learn more on System Center Web
Sign up for Tech·Ed 2011 and save $500 starting June 8 – June 31 st You can also register at the North America 2011 kiosk located at registration Join us in Atlanta next year