Dealing with the chaos monkey Mobile Computing Bruce Scharlau, University of Aberdeen, 2012
Bruce Scharlau, University of Aberdeen, 2012 Background You have large international service built on top of web services in ‘the cloud’, which you rely upon What happens to your service if they disappear? How will your customers respond? Bruce Scharlau, University of Aberdeen, 2012
We can place data elsewhere on the network Use a web service to store data elsewhere – save photos to flickr, files to some other app in the cloud. Can save files automatically, or at user discretion with time values, etc. (twitter, email apps, or photo capture) Bruce Scharlau, University of Aberdeen, 2012
Bruce Scharlau, University of Aberdeen, 2012 Amazon Web Services died for several days a few years ago: only one company who used them carried on while others suffered the outage Working on form online and lose the connection Work disconnected, and then sync device when ‘in contact’ Save state in a game Persistence lets you add ‘memory’ to the application Bruce Scharlau, University of Aberdeen, 2012
Netflix’s chaos monkey saved them They had built a service to create random outages of services they used. This forced them to provide a minimal service despite outages When Amazon went down, they were prepared Bruce Scharlau, University of Aberdeen, 2012
Feed & grow your chaos monkey How often will remote data be accessed? How quickly does remote data need to appear? How often will the data be updated/edited? Where will minimal data be stored? These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
Remote data may not be always needed Depending upon what you put on remote servers depends upon your own product and how it is deployed. These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
Remote data may not be instant If remote data is not expected to be instant, then slower servers of your own may suffice for interim periods These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
Remote data can be slowly edited Remote data can be staged so that current versions are local and thus can be used when remote services fail These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
Storing your own minimal data may be necessary Remote web services help, but are not the only route to success These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
All depends upon data storage needs How often will the data be accessed? How quickly does the data need to appear? How often will the data be updated/edited? Will the data be added to over time? Will the data be deleted? How will the data need to be used? These answers will suggest solutions for you Bruce Scharlau, University of Aberdeen, 2012
Use caches to manage data Caches come in different shapes and sizes and some can handle data before it’s written to db Some can hold data while db is changed, etc Bruce Scharlau, University of Aberdeen, 2012
Remove 3rd party dependencies Don’t make your app wait for third party responses before it replies to user Find way to use the 3rd party in asynchronous manner so your speed isn’t determined by their response time Bruce Scharlau, University of Aberdeen, 2012
Separate out functions, etc Keep functions in separate libraries to ease maintenance and development When everything is put in one component it becomes entangled and causes problems with response rates Bruce Scharlau, University of Aberdeen, 2012
Take this further and assume anything could fail Servers die, power fails, things fail. Build your system to withstand this and you’ll do fine You will end up with a resilient infrastructure Bruce Scharlau, University of Aberdeen, 2012
When code is ready then test https://github.com/Netflix/SimianArmy/wiki/Quick-Start-Guide will guide you Run automatic tests on code, but test code works by randomly stopping services, etc http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html http://techblog.netflix.com/search/label/chaos%20monkey Bruce Scharlau, University of Aberdeen, 2012
Run these tests when suitable staff are available Run these tests when staff expect them so that they can respond accordingly and learn from them Run them on production side so that responses can be organised accordingly Better now than at 3am at the weekend… Bruce Scharlau, University of Aberdeen, 2012
Must be run against production Chaos monkey must be run against production as this is where it counts and where nuances exist that can’t be replicated in test environments All of this fits into larger ‘devops’ approach to development http://www.ibm.com/developerworks/java/library/a-devops1/index.html Bruce Scharlau, University of Aberdeen, 2012
Mobile ticketing site example Tickets for purchase from many events Huge demand when tickets first released Unpredictable demand when events go viral Bruce Scharlau, University of Aberdeen, 2012