Machine Learning Turbo-Charges the Ops Portion of DevOps Sampanna Salunke Consulting Member Technical Staff Oracle Management Cloud March, 2017 Confidential – Oracle Internal/Restricted/Highly Restricted
This is a Safe Harbor Front slide, one of two Safe Harbor Statement slides included in this template. One of the Safe Harbor slides must be used if your presentation covers material affected by Oracle’s Revenue Recognition Policy To learn more about this policy, e-mail: Revrec-americasiebc_us@oracle.com For internal communication, Safe Harbor Statements are not required. However, there is an applicable disclaimer (Exhibit E) that should be used, found in the Oracle Revenue Recognition Policy for Future Product Communications. Copy and paste this link into a web browser, to find out more information. http://my.oracle.com/site/fin/gfo/GlobalProcesses/cnt452504.pdf For all external communications such as press release, roadmaps, PowerPoint presentations, Safe Harbor Statements are required. You can refer to the link mentioned above to find out additional information/disclaimers required depending on your audience.
The Product Area I Work On Our Vision Complete, integrated suite of systems management solutions Security Monitoring & Analytics Infrastructure Monitoring Orchestration Compliance Application Performance Monitoring Designed for heterogeneous applications and infrastructure Log Analytics IT Analytics Rapid time to value On Premise
Program Agenda 1 Defining terms Why Machine Learning is Perfect for (Dev)Ops Making Machine Learning Smarter Q&A 2 3 4
Program Agenda 1 Defining terms Why Machine Learning is Perfect for (Dev)Ops Making Machine Learning Smarter Q&A 2 3 4
Defining Terms (source: wikipedia.com) Machine Learning Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence, machine learning explores the study and construction of algorithms that can learn from and make predictions on data. DevOps DevOps (a clipped compound of "software DEVelopment" and "information technology OPerationS") is a term used to refer to a set of practices that emphasize the collaboration and communication of both software developers and information technology (IT) professionals while automating the process of software delivery and infrastructure changes.
Program Agenda 1 Defining terms Why Machine Learning is Perfect for (Dev)Ops Making Machine Learning Smarter Q&A 2 3 4
IT Organizations are Drowning in Data Too many tools Too much data No insight
Rate of Change Increasing Due to DevOps Automation Develop Build Package Deploy Continuous Integration
Machine Learning is Perfect for (Dev)Ops Structured, Time-Series Data User Performance Metrics Server-side Performance Metrics (App & Infrastructure) Configurations Events/Alerts Transaction Payloads Unstructured Text Data Log Records Massive volume Highly patterned Predictable format Exists in identifiable silos Exhibits long-term trends Sources constantly change
Machine Learning Powers Oracle Management Cloudeal Users Synthetic Users ✔ Anomaly detection APPLICATION App metrics Transactions ✔ clustering MIDDLE TIER Server metrics Diagnostics Logs ✔ FORECASTING DATA TIER Host metrics VM metrics Container metrics VIRTUALIZATION TIER Unified Platform VM CONTAINER ✔ correlation VM CONTAINER CMDB Tickets Alerts INFRASTRUCTURE TIER
Program Agenda 1 Defining terms Why Machine Learning is Perfect for (Dev)Ops Making Machine Learning Smarter Q&A 2 3 4
To us, “smarter” means 3 things… Enhance Algorithms Increase Breadth Increase Depth
Threshold Based Alerting is Being Eclipsed Before you shout at me – threshold based alerting is a must for many situations – especially for user facing application response times (ex. page should always load in less than a second). For everything else, standard was to set thresholds manually or via percentile. Manual is becoming increasingly impractical – what should thresholds be & who is going to do it? Percentile based alerting had its day, but does not scale from an alert volume perspective. If alerts are set at 99.9 percentile, then for 1 million metrics, that is 1000 alerts If those metrics are sampled every 5 minutes, that is 1000 alerts every 5 minutes Or 200 alerts / minute >> NOT OK OMC, and indeed, the industry, is incrementally replacing thresholds with high-low channels that are derived from a time series based model such as Holt Winters.
OMC’s Baselining & Anomaly Detection Begin with the Basics Distribution Based Unseasonal Model Daily + Weekly Additive Holt- Winter Modeling Automatic Season Detection Tune Based on Validation Robust to Sparse Pattern Variability Robust to Small Anomalies Graceful Transition from Daily-to-Weekly Evaluation Model Segmentation Daily seasonality detected. Base lines are wide because metric has a weekly pattern. Weekly seasonality detected and base lines much tighter around the observed values. Anomalies b/c observations higher than expected. CPU Utilization Anomalies b/c observations lower than expected. No seasonality detected. Time
9x Improvement in False Positive Rate by Addressing Common Corner Cases Before: Weekdays and weekends are allowed to be imbalanced. Before: Flagged as an anomaly due to load/measurement variability. Before: Anomalies are out-of-band samples. After: Select days to keep weekday-weekend balance. Graceful Day-to-Week Transition Sparse Pattern Variability After: Computing baselines at higher scale (hourly, configurable) solves this problem. Small Anomalies After: Anomalies are statistically significant out-of-band samples.
Scalability Incremental updates to baseline models Learning algorithms improve with more data. Storing months of data for millions of targets is expensive. Models are updated incrementally, so a model can reflect months of learning even when the actual stored data for a short duration. Segmenting models when evaluating data Testing incoming data for anomalies needs to be fast. To speed up processing, models are cached. But time series models like Holt Winters consume a lot of memory. To reduce memory costs, the model is segmented and only the part of the model required for processing the current time is cached.
Baselining Laid Foundation for Early Warning Forecast Mirrors Baseline when Observations are In Line with Expectation Derivative of Baseline Algorithm Hybrid Long & Short Term Modeling Configurable Horizon & Sensitivity Sensitivity can be Controlled via Confidence Forecast Becomes Baseline + Trend of Errors when Observations Deviate
OMC’s Forecasting Capability Traditional Linear Forecast Begin with the Basics Robust Linear Regression for Unseasonal Automatic Season Detection Tolerance Intervals Tune Based on Validation Season Specific Trending- Uncertainty Regime Change Detection Seasonal Pattern Trending Temporal Weighting OMC
2x Improvement in Forecast Accuracy by Addressing Common Corner Cases Low Seasons: Flat & Predictable High Seasons: Trend & Fluctuate Sparse High Seasons: Flat & Predictable Before: Legacy Linear Fit Season Specific Trending-Uncertainty Regime Change Detection After: Regime Change Identified
2x Improvement in Forecast Accuracy by Addressing Common Corner Cases Before: Un-Weighted Before After: Temporally Weighted After Seasonal Pattern Trending Temporal Weighting
Data Unification & Normalization Enables Greater Breadth Application Performance Monitoring Security Monitoring & Analytics Infrastructure Monitoring Log Analytics Orchestration Compliance Oracle Management Cloud Data Store Norm is repo by repo projects: slow and incremental. By centralizing data, we are able to deliver ML driven features more quickly. Convert to Time Series (Clustering & Rollup) Base Lining & Anomaly Detection IT Analytics
Program Agenda 1 Defining terms Why (Dev)Ops is perfect for machine learning Making Machine Learning Smarter Q&A 2 3 4
For More Information/Questions cloud.oracle.com/management community.oracle.com/mgmtcloud #MgmtCloud @OracleMgmtCloud blogs.oracle.com/cloud