CWIC Developers Meeting January 29 th 2014 Calin Duma Service Level Agreements High-Availability, Reliability and Performance
Agenda What are SLAs Why use SLAs Joint SLOs / SLAs dependencies How to establish joint SLOs / SLAs CWIC Data Providers SLO challenges Initial Sample Approach CWIC Start performance challenges CWIC Start metrics options Joint metrics to consider 2
What are SLAs Service Level Agreements: – Specify service level requirements between a service provider and a service consumer – Often in terms of a legal contract with penalties for non-compliance – Concrete and measurable service level objectives (SLOs) are used to test that SLAs are being met In general there is a recognized gap between the expected service levels and the delivered ones: – Availability : downtime per year (ex 5 minutes translates to an SLO of % uptime) – Reliability : advertised components failure rates, can be mitigated by fault tolerant software and system design – Performance : response time (completion - submission) and throughput (concurrent requests) oriented SLOs Response times increase as throughput increases 3
Why use SLAs CWIC is gaining popularity and is providing potential for excellent exposure of data islands (India, China, Brazil etc.) We should provide better end-user service: – Service consumers know what to expected when using GCMD, CWIC and CWIC Start (and other clients) We should establish SLOs for our applications: – Involves hardware resources, infrastructure platforms (OS, Web Application stack) and custom code – Teams are motivated to work toward agreed upon targets – Can dictate and provide empirical data for future hardware and software needs 4
Joint SLOs / SLAs dependencies CWIC Start depends on GCMD and CWIC CWIC depends on GCMD and 5 providers: – NASA, INPE, GHRSST, USGSLSI and CCMEO In order to have availability, reliability and performance SLOs we would have to coordinate among 8 components: 1.CWIC Start 2.GCMD 3.CWIC 4.NASA / ECHO 5.INPE 6.GHRSST 7.USGSLSI 8.CCMEO If any of the above components are down or slow the end- user will be subject to a sub-optimal experience Complexity will increase when more providers are added 5
How to establish joint SLOs / SLAs While usage of our services is free it doesn’t mean that we can’t provide a reasonable user experience and set realistic user expectations True joint SLOs / SLAs would be at most the SLO / SLA of the weakest component and therefore not desirable CWIC, GCMD, CWIC Start and ECHO can work together on joint SLOs / SLAs CWIC can collect existing provider SLAs where applicable or help providers think about SLAs 6
CWIC data providers SLOs challenges Similar to ECHO’s challenges of dealing with its 11 data partners ECHO model is something we can learn from: – Provide individual availability notices on the CWIC WGISS home page – If providers do not communicate down times or availability, collect statistics with monitoring technologies / APIs – Collect CWIC Start and CWIC metrics that can capture current SLOs for all external dependencies* 7
Initial Sample Approach 8
CWIC Start Performance Challenges CWIC Start had performance issues due to: – A distributed search memory leak – Inconsistent OpenLayers maps rendering – Potential memory leak due to high load generated by search bots It was and still is very challenging to pinpoint performance problems due to: – Ruby on Rails running on top of jRuby and difficulties of using memory profilers that point to the actual ruby code – Clustered / load balanced deployment and requests from the same user being serviced on different hosts – Difficulties in collecting host level performance metrics such as free physical memory, swap utilization, CPU and network IO 9
CWIC Start metrics options We are investigating Real User Monitoring Metrics (RUM) that capture the user browser experience: – Google Analytics (~26 subjects with hundreds of dimensions / specific descriptive attributes) – W3C Navigation Timing to complement GA – New Relic: excellent back end code instrumentation targeting SLAs and detailed performance metrics We added semantic logging and detailed durations to make it easy to trace requests on a cluster: Example: [813d9f df507a10eac] [ ] Started GET "/datasetssearch?standard=csw" for at :22: Example: HttpRequest.submit RESPONSE, DURATION (uCPU sCPU usCPU real): ( ) 10
Joint metrics to consider CWIC ExtJS application is an excellent start Questions to answer: – Who is using CWIC and GCMD CSW (clientId?) – Who is using CWIC and GCMD OpenSearch (clientId) – Who is using CWIC without GCMD interaction – Granule metadata and data downloads via CWIC provided links – Percentage of direct downloads vs. provider welcome page redirects – Average response times 11
Joint metrics to consider cont. Questions to answer (CSW and OpenSearch): – Number of errors due to provider internal errors – Number of errors due to CWIC internal errors – Number of errors due to provider unavailable – CWIC specific performance metrics per provider – GCMD specific performance metrics – Others? 12