Device Failure Prediction

Slides:



Advertisements
Similar presentations
JamesRH  7 major AWS Services (  Amazon E-Commerce Service (ECS)  Amazon.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Yingping Huang and Gregory Madey University of Notre Dame A W S utonomic eb-based imulation Presented by Tariq M. King Published by the IEEE Computer Society.
Team Manager - Travis Blais Project Manager - Jordan Shields Manufacturing Leader - Evan Lumby Software Leader - Samuel Slezak Maval Visit Summary (October.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.
Failure Trends in a Large Disk Drive Population Authors: Eduardo Pinheiro, Wolf- Dietrich Weber and Luiz Andr´e Barroso Presented by Vinuthna & Arjun.
GPS Tracking & Inventory Management GPS Tracking & Inventory Management Presented by: Product Activation Group.
PRASHANTHI NARAYAN NETTEM.
Testing - an Overview September 10, What is it, Why do it? Testing is a set of activities aimed at validating that an attribute or capability.
Handling Security Threats in Kentico CMS Karol Jarkovsky Sr. Solution Architect Kentico Software
 A cookie is a piece of text that a Web server can store on a user's hard disk.  Cookie data is simply name-value pairs stored on your hard disk by.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Lecture On Database Analysis and Design By- Jesmin Akhter Lecturer, IIT, Jahangirnagar University.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
PL/SQLPL/SQL Oracle10g Developer: PL/SQL Programming Chapter 7 PL/SQL Packages.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Health eDecisions Use Case 2: CDS Guidance Service Strawman of Core Concepts Use Case 2 1.
Disk Failures Eli Alshan. Agenda Articles survey – Failure Trends in a Large Disk Drive Population – Article review – Conclusions – Criticism – Disk failure.
CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
ICM – API Server & Forms Gary Ratcliffe.
CHARACTERIZING CLOUD COMPUTING HARDWARE RELIABILITY Authors: Kashi Venkatesh Vishwanath ; Nachiappan Nagappan Presented By: Vibhuti Dhiman.
Open Map Yamama Dagash & Haitham Khateeb under the supervision of: Benny Daon & Eyal Levin Open Map.
Project Dow: Extending EclipseTrader Emmanuel Sotelo Fall 2008.
PHP: Further Skills 02 By Trevor Adams. Topics covered Persistence What is it? Why do we need it? Basic Persistence Hidden form fields Query strings Cookies.
Real-Time Dashboards on Power BI
Sage Franch | Technical Evangelist Susan Ibach | Technical Evangelist.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 10: Mass-Storage Systems.
Testing the Zambeel Aztera Chris Brew FermilabCD/CSS/SCS Caveat: This is very much a work in progress. The results presented are from jobs run in the last.
Advanced Higher Computing Science
File-System Management
Device Management vOLTHA
Experience Report: System Log Analysis for Anomaly Detection
The Devil and Packet Trace Anonymization
TIM 58 Chapter 3: Requirements Determination
Chapter 2: System Structures
Exploring the Backblaze Hard Drive Data Big, Missing, Problematic Data
Controlling the Cost of Reliability in Peer-to-Peer Overlays
All about social networking
The Client/Server Database Environment
FICEER 2017 Docker as a Solution for Data Confidentiality Issues in Learning Management System.
SENIOR MANAGER - SOFTWARE TESTING PRACTICE
Azure Machine Learning & ML Studio
Introduction to the Kernel and Device Drivers
Cloud Testing Shilpi Chugh.
Degree works plans training
Continuous Performance Engineering
De-anonymizing the Internet Using Unreliable IDs By Yinglian Xie, Fang Yu, and Martín Abadi Presented by Peng Cheng 03/22/2017.
Synchronization in Distributed File System
Group Based Licensing Steve Scholz
Extended Document Management System (EDMS)
Big Data - in Performance Engineering
Learning to Program in Python
COP 4600 Operating Systems Spring 2011
TECHNICAL SEMINAR PRESENTATION
Approaching an ML Problem
SharePoint Online Authentication Patterns
Post-Silicon Calibration for Large-Volume Products
Technical Capabilities
Specialized Cloud Architectures
An Experimental Study of the Potential of Using Small
Creative Activity and Research Day (CARD)
Identifying Slow HTTP DoS/DDoS Attacks against Web Servers DEPARTMENT ANDDepartment of Computer Science & Information SPECIALIZATIONTechnology, University.
Paper ID: XX Track: Track Name
Chapter 2: Operating-System Structures
Performance And Scalability In Oracle9i And SQL Server 2000
Final Review 27th March Final Review 27th March 2019.
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

Device Failure Prediction Ankit Maharia Pulkit Kapoor Sreyas Krishna Natarajan

Outline Problem Statement & Motivation Methodology/Implementation ML Results Future Scope Conclusion

Problem Statement Build a service to facilitate creation of open dataset of device health metrics. Failure prediction using public dataset.

Motivation Limited public dataset availability for researchers. Avoid degraded data redundancy or complete data loss. Tuning background scrub speed Several academic papers have been published based on these data sets under NDA, but they provide only high level guidance on what parts of the SMART data make good inputs for prediction models. Complete data loss in case of concurrent data failure to adapt the scrub rate dynamically based on the predicted chance of encountering errors, rather than using one fixed scrub rate throughout

Methodology A cloud service where external services/storage systems can push their device metrics. Devices can include HDD, SSD, SAS, NVMe from different vendors Each has different mechanisms to expose similar metrics, although they are all broadly referred to a SMART metrics.

Service Dependencies Pecan: Python web framework for creating REST API. ElasticSearch: To store device metrics MongoDB: For storing host id and host secret on service side. MongoDB Can be removed. It is just a double check while authenticating

Flow - Registering a host Done via POST api call to: /register-host Returns a unique host_id and host_secret which are to be used when sending metrics Client can store it in its persistent storage. For ceph: we have stored it in the manager store. What is a host? Provides grouping of devices on same machine There is one manager store for the entire cluster. Even though multiple manager processes can be running. These host_id and host_secret are stored in mongoDB for additional validation. which can be safely removed, thus eliminating mongoDB entirely.

Flow - Sending Device Metrics Done via POST api call to: /store-device- metrics Client sets host_id and host_secret in the request headers and posts the payload like below Service stores the payload to elastic search, replacing device serial number under smartctl_json with a SHA-1 hash Give example of how to track a person

DEMO - Demo 1 - Demo 2

CEPH INTEGRATION

FAILURE PREDICTION

SMART (Self-Monitoring, Analysis and Reporting Technology) SMART is a monitoring system supported by most drives that reports on various indicators of drive health, including various types of errors, but also operational data, such as drive temperature, and power on hours of the drive.

Features

Features: Interesting SMART 5 (S5) - count of reallocated sectors. When a read or a write operation on a sector fails, the drive will mark the sector as bad and remap (reallocate) it to a spare sector on disk. SMART 187 (S187) - read errors that could not be recovered using ECC SMART 197 (S197) - count of ”unstable” sectors. Some drives mark a sector as “unstable” following a failed read, and remap it only after waiting for a while to see whether the data can be recovered in a subsequent read or when it gets overwritten

Challenges How do we define a failure? User rarely(never) uploads data regarding failure. This leads to lack of failure signal. If metrics for a device is not reported on a day and it was reported the previous day, it can be assumed as a failure. If a device has moved from one host to another then it could be marked as a failure. (Fixed and attached back) Also, each individual has it’s own analogy for failure. Vendor vs user. Different tests, Different conditions, etc. Last point not yet implemented. We add backblaze data under one host. But can be implemented with ease

Prediction Pipeline Processing of metrics dumped to elasticsearch index on daily basis Adding a failure signal Checking multipath Validated flow using backblaze We used Backblaze dataset from 2013 to 2016. Data was sampled to 3:2 (label 0 : 1) Train # failure samples: 3250 Test # failure samples: 350 Train on 2013-2016 Q3 Test on 2016 Q4 Negative Label(0): Device did not fail

Results Best Model- Random Forest Model Precision Recall F1 1 0.75 0.67 0.71 2 0.55 0.83 0.66 Precision is the measure of accuracy of the model to predict the failures. Recall is the coverage of failure (what % of actual failures were we able to provide)

Future Scope Dockerize the service (Done!) REST API for prediction Better Machine Learning Models Script to publicly release dataset

Questions?

Thank You!