Download presentation
Presentation is loading. Please wait.
Published bySydney McDonald Modified over 9 years ago
2
ARC301
5
BUT…
7
we are here
8
FromTo
9
Chicago Cheyenne Dublin Amsterdam Hong Kong Singapore San Antonio Microsoft has datacenter capacity around the world…and we’re growing Boydton Shanghai Quincy Des Moines Brazil 35+ factors in site selection: Proximity to customers Energy, Fiber Infrastructure Skilled workforce "Data Centers have become as vital to the functioning of society as power stations." The Economist
12
Office365.com Our surface area is too big/partitioned to manage sanely Service management is largely done via our Datacenter Service Fabric North America 1North America nEurope 1 DATACENTER AUTOMATION
13
Big Data External Signals System Signals Access Approval Auditing Compliance Changes Safety Orchestration Repair We simplify by focusing all our work along the three pillars— these work in tandem to create a great service fabric Allows us to create a virtuous automation system that is SAFE, DATA DRIVEN while being AGILE at very high scale Machine Learning
14
Orchestration Central Admin (CA), the change/task engine for the service Deployment/Patching Build, System orchestration (CA) + specialized system and server setup Monitoring eXternal Active Monitoring (XAM): outside in probes, Local Active Monitoring (LAM/MA): server probes and recovery, Data Insights (DI): System health assessment/analysis Diagnostics, Perf Extensible Diagnostics Service (EDS): perf counters, Watson (per server) Data (Big, Streaming) Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis On-call Interfaces Office Service Portal, Remote PowerShell admin access Notification/Alerting Smart Alerts (phone, email alerts), on-call scheduling, automated alerts Provisioning/Directory Service Account Forest Model (SAFM) via AD and tenant/user addition/updates via Provisioning Pipeline Networking Routers, Load Balancers, NATs New Capacity Pipeline Fully automated server/device/capacity deployment DATACENTER AUTOMATION
17
Multi-signal analysisData driven automationConfidence in data communicate snooze recover block AUTOAUTO
19
PARTITION Office365.com Each scenario tests each DB WW ~5mins—ensuring near continuous verification of availability From two+ locations to ensure accuracy and redundancy in system 250 million test transactions per day to verify the service Synthetics create a robust “baseline” or heartbeat for the service NETWORK
23
Deviation from normal means something might be wrong 99.5% and 0.5% historical thresholds Moving Average +/- 2 Standard Deviations Methodology for data computed
24
4:46 PM is when the alert was raised This is 4:46 PM! Allows us to inform customers in real- time Keeps engineers focused on recovery Improves transparency with support and others who keep customers happy
27
CAPACITY
33
1) Run a simple patching cmd to initiate patching: Request-CAReplaceBinary 2) CA creates a patching approval request email with all relevant details 3) CA applies patching in stages (scopes) and notifies the requestor 4) Approver reviews scopes and determines if the patch should be approved 5) Once approved, CA will start staged rollout (first scope only contains 2 servers) 6) CA moves to the next stage ONLY if the previous scope has been successfully patched AND health index is high 7) Supports “Persistent Patching” mode
35
Old network design…To new:
38
What we are today is a mix of experimentation, learning from others and industry trends (and making a lot of mistakes!)
39
Product Team Service Operations Service Tier 2 Operations Tier 1 Operations Product Team Service Tier 1 Operations Service Product Team Operations Software Aided Processes SupportOther Product Team we are here
40
September People Impact 1. 176 unique on-calls were paged 2. 33 of them got > 15 pages (40% of pages) 3. 30 got >= 8 and <= 15 (35%) 4. 113 < 8 pages (15% of pages)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.