Overview of Research in Dependable Computing Systems Lab

Overview of Research in Dependable Computing Systems Lab
Saurabh Bagchi ECE & CS Purdue University Last updated: June 2016

What is Dependable Computing?
We need computer systems that we can depend on in the face of Naturally occurring faults – hardware malfunction, software bugs Malicious intrusions – insider attack or external adversaries Dependability: Property that the system continues to provide functionality, within time bounds, despite the above kinds of failures What do we do at DCSL? Build dependability mechanisms - in software, at application or middleware level Introduce them in practical systems Evaluate them in practical settings, with accelerated error rate

We are part of two centers:
Who Are We? 14 graduate students 2 research scientists We are part of two centers: CERIAS GE-PRIAM Our support comes from a variety of sources Federal government: NSF, Department of Defense, Department of Energy, NIH Private organizations: IBM, Northrop Grumman, Lawrence Livermore, Google, AT&T

Thrust #1: Distributed Intrusion Tolerant System
Distributed systems subjected to malicious attacks to services, including hitherto unknown attacks Objective is to tolerate intrusions, not just detect Different phases: Detection, Diagnosis, Containment, Response. Solution approach: Attack graph based modeling of multi-stage attacks Algorithms for containment and survivability computation and semi-optimal response selection Provides situational awareness in the face of imperfect observability Dealing with zero-day attacks Work supported by: Missile Defense Agency (MDA), Northrop Grumman, NSF

Hacker S4 S3

Thrust #2: Debugging Large-Scale Distributed Applications
Goal is to provide highly available applications (e.g., web service) in distributed environment Perform failure prediction Perform bug localization Challenges in today’s distributed systems Large number of entities and large amount of data Interactions between entities causes error propagation High throughput or low latency requirements for the application Work supported by: NSF, AT&T, Lawrence Livermore National Lab (LLNL)

Thrust #2: Predictive Reliability Engine for Cellular Networks
ISP Backbone eNodeB MME S/P-Gateway IMS Core Handoff thresholds Online data Call pause Offline data (GPEH) P-GW VM Offline Training Online predictor ML Classifier Extreme Cloud Site

Thrust #2: Metric-based Bug Localization
Application Requests rate, transactions, DB reads/writes, etc.. Middleware Virtual machines and containers statistics Operating System CPU, memory, I/O, network statistics Hardware CPU performance counters Tivoli Operations Manager How can we use these metrics to localize the root cause of problems?

Thrust #2: Metric-based Bug Localization
Look for abnormal time patterns Pinpoint code regions that are correlated with these abnormal patterns

Thrust #2: Debugging at Extreme Scales
Problem Statement: Applications being run at large scales – large number of processes (such as, Hadoop clusters) and on large amounts of data (such as, computational genomics applications) Correctness or performance problem does not show up at small scales, but shows up at large scales Current debugging techniques cannot handle such problems SCALE # OF TIMES LOOP EXECUTES Training runs Production runs Accounting for scale makes trends clear, errors at large scales obvious RUN # # OF TIMES LOOP EXECUTES Is there a bug in one of the production runs? Training runs Production runs Current debugging techniques work process by process, are centralized rather than distributed, and are offline.

Thrust #3: Dependable Embedded Wireless Network
Embedded wireless networks have fundamental resource constraints and are often deployed in hostile or uncertain environments Constraints include: Energy, Bandwidth, Untrusted nodes, Disconnected networks Goal: Develop middleware that provides a robust platform keeping environment constraints in mind Provide detection, diagnosis, and isolation functionality Precision Agriculture Our sample application areas Industrial Monitoring Smart Grids Smart Buildings Surveillance

Thrust #3: Dependable Embedded Wireless Network
Solution directions: Record and Replay: Software for tracing events on the wireless node Replaying traced events on a lab server and debugging support from replayed trace Fastest reprogramming protocol to upload a patch to the network while nodes are deployed in the field Game theoretic security: Game theoretic formulation of adversary and defender incentives Model incorporates multiple stakeholders with partially aligned interests Probabilistically optimal investment of fixed defense budget Randomization-based security: Randomize binary so that adversary cannot reverse engineer and inject attacks Place canaries strategically in control and data regions Randomize placements of functions and data Work supported by: NSF, Sandia

Interested. Want to learn more. https://engineering. purdue
Interested? Want to learn more? PI: Saurabh Bagchi

Overview of Research in Dependable Computing Systems Lab

Similar presentations

Presentation on theme: "Overview of Research in Dependable Computing Systems Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Overview of Research in Dependable Computing Systems Lab

Similar presentations

Presentation on theme: "Overview of Research in Dependable Computing Systems Lab"— Presentation transcript:

Similar presentations

About project

Feedback