Device Failure Prediction

Device Failure Prediction
Ankit Maharia Pulkit Kapoor Sreyas Krishna Natarajan

Outline Problem Statement & Motivation Methodology/Implementation ML
Results Future Scope Conclusion

Problem Statement Build a service to facilitate creation of open dataset of device health metrics. Failure prediction using public dataset.

Motivation Limited public dataset availability for researchers.
Avoid degraded data redundancy or complete data loss. Tuning background scrub speed Several academic papers have been published based on these data sets under NDA, but they provide only high level guidance on what parts of the SMART data make good inputs for prediction models. Complete data loss in case of concurrent data failure to adapt the scrub rate dynamically based on the predicted chance of encountering errors, rather than using one fixed scrub rate throughout

Methodology A cloud service where external services/storage systems can push their device metrics. Devices can include HDD, SSD, SAS, NVMe from different vendors Each has different mechanisms to expose similar metrics, although they are all broadly referred to a SMART metrics.

Service Dependencies Pecan: Python web framework for creating REST API. ElasticSearch: To store device metrics MongoDB: For storing host id and host secret on service side. MongoDB Can be removed. It is just a double check while authenticating

Flow - Registering a host
Done via POST api call to: /register-host Returns a unique host_id and host_secret which are to be used when sending metrics Client can store it in its persistent storage. For ceph: we have stored it in the manager store. What is a host? Provides grouping of devices on same machine There is one manager store for the entire cluster. Even though multiple manager processes can be running. These host_id and host_secret are stored in mongoDB for additional validation. which can be safely removed, thus eliminating mongoDB entirely.

Flow - Sending Device Metrics
Done via POST api call to: /store-device- metrics Client sets host_id and host_secret in the request headers and posts the payload like below Service stores the payload to elastic search, replacing device serial number under smartctl_json with a SHA-1 hash Give example of how to track a person

DEMO - Demo 1 - Demo 2

CEPH INTEGRATION

FAILURE PREDICTION

SMART (Self-Monitoring, Analysis and Reporting Technology)
SMART is a monitoring system supported by most drives that reports on various indicators of drive health, including various types of errors, but also operational data, such as drive temperature, and power on hours of the drive.

Features

Features: Interesting
SMART 5 (S5) - count of reallocated sectors. When a read or a write operation on a sector fails, the drive will mark the sector as bad and remap (reallocate) it to a spare sector on disk. SMART 187 (S187) - read errors that could not be recovered using ECC SMART 197 (S197) - count of ”unstable” sectors. Some drives mark a sector as “unstable” following a failed read, and remap it only after waiting for a while to see whether the data can be recovered in a subsequent read or when it gets overwritten

Challenges How do we define a failure?
User rarely(never) uploads data regarding failure. This leads to lack of failure signal. If metrics for a device is not reported on a day and it was reported the previous day, it can be assumed as a failure. If a device has moved from one host to another then it could be marked as a failure. (Fixed and attached back) Also, each individual has it’s own analogy for failure. Vendor vs user. Different tests, Different conditions, etc. Last point not yet implemented. We add backblaze data under one host. But can be implemented with ease

Prediction Pipeline Processing of metrics dumped to elasticsearch index on daily basis Adding a failure signal Checking multipath Validated flow using backblaze We used Backblaze dataset from 2013 to 2016. Data was sampled to 3:2 (label 0 : 1) Train # failure samples: 3250 Test # failure samples: 350 Train on Q3 Test on 2016 Q4 Negative Label(0): Device did not fail

Results Best Model- Random Forest Model Precision Recall F1 1 0.75
0.67 0.71 2 0.55 0.83 0.66 Precision is the measure of accuracy of the model to predict the failures. Recall is the coverage of failure (what % of actual failures were we able to provide)

Future Scope Dockerize the service (Done!) REST API for prediction
Better Machine Learning Models Script to publicly release dataset

Questions?

Thank You!

Device Failure Prediction

Similar presentations

Presentation on theme: "Device Failure Prediction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Device Failure Prediction

Similar presentations

Presentation on theme: "Device Failure Prediction"— Presentation transcript:

Similar presentations

About project

Feedback