Building Hyperscale IOT Services With TLA+

Slides:



Advertisements
Similar presentations
SDN Controller Challenges
Advertisements

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Computer Science 162 Section 1 CS162 Teaching Staff.
1 CMSC 132: Object-Oriented Programming II Software Development I Department of Computer Science University of Maryland, College Park.
ObjectStore Martin Wasiak. ObjectStore Overview Object-oriented database system Can use normal C++ code to access tuples Easily add persistence to existing.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
Distributed Deadlocks and Transaction Recovery.
1 DATABASE TECHNOLOGIES BUS Abdou Illia, Fall 2007 (Week 3, Tuesday 9/4/2007)
1 CMPT 275 Software Engineering Software life cycle.
Networked File System CS Introduction to Operating Systems.
-Nikhil Bhatia 28 th October What is RUP? Central Elements of RUP Project Lifecycle Phases Six Engineering Disciplines Three Supporting Disciplines.
Understand Application Lifecycle Management
Project Tracking. Questions... Why should we track a project that is underway? What aspects of a project need tracking?
BT Young Scientists & Technology Exhibition App Risk Management.
The First in GPON Verification Classic Mistakes Verification Leadership Seminar Racheli Ganot FlexLight Networks.
Developer TECH REFRESH 15 Junho 2015 #pttechrefres h Understand your end-users and your app with Application Insights.
Darkstar. Darkstar is a Sun research project on massively parallel online games The objective (not yet demonstrated!) is to supply a framework for massively.
AP-1 4. Agile Processes. AP-2 Agile Processes Focus on creating a working system Different attitude on measuring progress XP Scrum.
Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.
Lecture 4 Page 1 CS 111 Online Modularity and Virtualization CS 111 On-Line MS Program Operating Systems Peter Reiher.
Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall Object-Oriented Systems Analysis and Design Using UML Systems Analysis and Design,
SUSE Linux Enterprise Server for SAP Applications
Connected Infrastructure
Component 1.6.
Essentials of UrbanCode Deploy v6.1 QQ147
What is it ? …all via a single, proven Platform-as-a-Service.
Connected Living Connected Living What to look for Architecture
100% Exam Passing Guarantee & Money Back Assurance
Understanding Android Security
Smart Building Solution
Success Stories.
Project Center Use Cases Revision 2
Cisco Data Virtualization
Need for Speed: Why Applications With No Database and No Services are Fast ARC334 Nick Randolph – Built to Roam.
Smart Building Solution
Optimizing Edge-Cloud IoT Applications for Performance and Cost
Connected Living Connected Living What to look for Architecture
Chapter 19: Architecture, Implementation, and Testing
DevOps – Test Automation for IOTs
V-Shaped SDLC Model Lecture-6.
Connected Infrastructure
Project Center Use Cases Revision 3
Project Center Use Cases Revision 3
Exploring Azure Event Grid
Mikael Hakansson IoT – Common patterns and practices Integration MVP
Stratus Innovations Group Intelligent Factory™ Solution Offering
The Sitecore® Experience Platform™ on Microsoft Azure
Real World use cases for BizTalk360
I494: Designing and Developing an Information System
Azure Event Grid with Custom Events
Objective of This Course
Chapter 2 – Software Processes
Replace with Application Image
Where Intelligence Lives & Intelligence Management
Service-Oriented Computing: Semantics, Processes, Agents
Windows 10 Enterprise subscriptions in CSP – Messaging Summary
Serverless Architecture in the Cloud
2/19/2019 9:06 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
Nenad Stefanovic and Danijela Milosevic
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
Understanding Android Security
The Troubleshooting theory
SOFTWARE DEVELOPMENT LIFE CYCLE
Service-Oriented Computing: Semantics, Processes, Agents
Chapter 6: Architectural Design
Modern benefits administration and HR software, supported by us.
Jamie Cool Program Manager Microsoft
Drawn from TAPI: oimt.2019.ND TapiStreaming.mht
The photo app every contractor & supplier needs
Presentation transcript:

Building Hyperscale IOT Services With TLA+ Hi, I am Vaibhav from Azure IoT Today I want to talk about how we use TLA+ to build cloud solutions that operate at IoT scale.

Things IoT Hub Insights Actions Let’s first understand what is IoT and what Azure IoT offers. Billions of devices connect to cloud today. These devices could be Smart bulb, Smart Plugs, Modern cars, Security Cameras, Oil Rigs. After connecting to cloud they send massive amount of data. The data is enriched, transformed and then it handed over to the storage layer. From there the data is processed by AI models, Rule Engines and other services. The processing of data results in action that goes back the device through the same channel.

ioT Example : Intruder Detection Data Source: Camera Enrichment : Add Location AI: Identify Objects Rule: Detect Intruder Action: Warn Intruder Send Command Back Let me show a real example to explain how it works. Let’s say a security camera sends and image. Enrichment add the camera location details to the data. AI model identifies objects in the image. A rule engine finds a foreign object as intruder and decides to warn the intruder. The command to run the warning goes back to the device through the same channel.

Exponential growth IoT is Growing Billions to devices connect to cloud and IoT is still growing exponentially.  Cellular IoT connections are expected to reach 3.5B in 2023.  [Forbes] Rare is Relative Gradually, we start hitting rare race conditions and failure sequences that were unlikely. Can’t test for every permutation of failure and race conditions. TLA+ to detect and eliminated them during design. As I said earlier, Billions of devices connect to the cloud sending large volume of amount of data. Projections show exponential growth in future.

Model Checking A Design In The Product Lifecycle Collect and Prioritize Requirements Define Abstract Solution TLA+ Model Check Work Estimate Implement, Test, Deploy Cycles Every project starts with understand of requirements, once we have clear set of requirements, we do an abstract design where we identify components we need to build, components we need to modify and interaction between them. At this point we do the TLA+ modeling and update the design as needed to have a verified model.  Then we do the estimates and if the project is funded it goes through the regular implement, test deploy cycle. 

Model Checking A Design Cost, Skill and Process Project Overhead Less than 5% of overall project time Prevent expensive postproduction issues Skill Set One engineer proficient in TLA+ in a team of five. Others including managers review it. We model what we change, not the whole system

Model Checking A Design Scope of Model TLA+ covers The algorithm and cross component interactions. Async interactions, states and state transitions. Traditional Tests Cover Implementation of each component. Mostly synchronous operations modeled as actions.

Example projects Cross Region Failover Azure IoT hub allows cross region failovers. The design is model checked to ensure RPO, RTO guarantees. Message Routing IoT telemetry message routing system enriches and transforms data.. New routing features are model checked. Device Cache Azure in-memory device store is backed by Azure storage as the durable store. Every aspect of the design is model checked.

* Some details are excluded to maintain confidentiality Example Project * Some details are excluded to maintain confidentiality Device State Notifications This was all theory Let’s look at a real example now. Let’s see what was the problem. What did we model, what did we verify. Let’s look at the full evolution of the design.

Device State Notifications Core problem Devices (like security cameras) can connect and disconnect from the Azure IoT cloud. Allow customers (on a cellphone app or a backend service) to receive notifications when device status changes. Duplicates are allowed, ordering is not guaranteed. Scale Millions of devices can connect or disconnect at any time. Thousands of servers can failover at the same time. Notification system as well and the backend database wouldn’t operate at this scale.  When a security camera connects to the cloud it is online. When it disconnects it is offline. We wanted the customers to know about these events. As notifications can arrive out of order, we also wanted to put a sequence number in notifications. So that the customer can know the real order of events. If the camera connected and disconnected and it is currently offline, an out of order delivery will make customer think it is online. -------------------- This is a fairly simple problem to solve. We can use a two phase approach, where we first update the database with device state change and then we send the notification and update the Database again saying the notification has gone out. Scale makes it a complex problem. Millions of devices can connect and disconnect at the same time, and thousands of servers can fail at once. We can not overwhelm the backend database to the notification system during these events. We can also not reduce availability of device connect by putting a database or notification in the critical path. This makes it a much harder problem to solve. -------------------

Rudimentary Approach Device Client Topic Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Failed Standby Primary (New) Database Disconnected 2 Disconnected 0 Connected 1 Let's look at the Rudimentary approach we started with. Any time the device state changes, the service first notifies the client and then at some point after sending the notification, updates the db.  If the service fails, the standby takes over. And resumes the operation from there.

Rudimentary Approach – Resend Device Client Topic Connect 1 Disconnect 2 Disconnect 2 Primary Failed Primary (New) Standby Database Disconnected 2 Connected 1 Disconnected 0 Let’s look at a failure scenario. If the service fails before sending the notification, Things work fine and the new primary looks at the db and the device state. It sends the notification if needed.

Rudimentary Approach – Resend Device Client Topic Connect 1 Disconnect 2 Disconnect 2 Disconnect 2 Disconnect 2 Primary Failed Standby Primary (New) Database Connected 1 Disconnected 2 Disconnected 0 Let’s look at a failure scenario. If the service fails before sending the notification, Things work fine and the new primary looks at the db and the device state. It sends the notification if needed.

Rudimentary Approach – Failure Device Client Topic Disconnect 1 Disconnect 1 Connect 2 Connect 2 Primary Failed Standby Primary (New) Database Connected 0 Connected 0 Disconnected 1 Now let’s look at a race condition that can happen.

Device State Notifications Components Device Service/In-memory Database/Persistence Client States Connected/Disconnected Notified/Pending Last notification Notifications received

TLA+ The TLA model has four transitions shown in the right Components Device Service/In-memory Database/Persistence Client/Notified-State States Connected/Disconnected Notified/Pending Last notification Notifications received TLA+ The TLA model has four transitions shown in the right  Any time device can connect, disconnect, the notification can go out, db can get updated or the partition can move. On the left side I have the data structures representing each state. 

Invariant When we run the model with the invariant saying the client must get notified about the eventual state of the device we see failure. 

When we run the model with the invariant saying the client must get notified about the eventual state of the device we see failure. 

Invariant Failure

Generate Sequence Numbers FIX Generate Sequence Numbers Service assigns a sequence number and stores the event in the DB We use algorithm for in-memory sequence numbers generation. Checkpoint A background process sends the notification and tracks the sequence number processed Failover On failover the new service resumes from where the last one left off

Sophisticate Solution – Simple Device Client Topic Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Failed Standby Primary (New) Database Connected 1 Disconnected 0 Disconnected 2 1 2 Let's look at the fix we applied for this issue.  We have high water mark and a hi-lo sequence generator per service instance. We assign the next sequence to the next device event and keep track of sequence until which the notification has gone out.  When device connects the service updates the db, then sends the notification and at some point add the check pointer saying what notifications have gone out. 

Sophisticate Solution - Failover Device Client Topic Connect 1 Disconnect 2 Disconnect 2 Primary Failed Standby Primary (New) Database Connected 1 Disconnect 2 Disconnected 0 1 2 In the case the if the service fails in between the new primary takes over from the last notified sequence number. 

TLA+ The model is similar with some minor change. Components Device Service/In-memory Database/Persistence Client/Notified-State States Connected/Disconnected Notified/Pending Last notification Notifications received TLA+ The model is similar with some minor change.   The invariant passes this time.  This about the cost of updating a TLA+ model vs debugging and then fixing such issue in production. 

Invariant

When we run the model with the invariant saying the client must get notified about the eventual state of the device we see failure. 

Correctness and Optimizations Takeaways Correctness and Optimizations We can process millions of concurrent device connects and disconnects correctly. Database writes and Sending notifications are not in the critical path. in-memory sequence numbers instead of hitting database. Cost and Savings We invested less than a week on the TLA+ model. No design changes were needed at the development and testing stages. Optimizations reduced operational cost by an order of magnitude.

Conclusion Operating at Scale Reduces post-production issues in count and complexity. Eliminates design reworks that previously took 20-40% of project time. Allows aggressive optimizations that are necessary at IOT scale. Agility and Clarity It’s easier to read, review and iterate over 50 lines of TLA+ than 10 pages of design doc. Makes cross-component integrations simpler as everyone codes against same specification. Code reviews use the TLA+ specifications as reference and primarily focus on code quality.

Questions ?