Experimenting with Complex Event Processing for Large Scale Internet Services Monitoring Stephan Grell, Olivier Nano Microsoft, Ritter Strasse 23, Aachen, 52072, Germany Tel: , Fax: {stgrell,
Overview Agenda: – SLA monitoring System – Scenarios – Language expressiveness – Offline analysis – Reliability
European Microsoft Innovation Center (EMIC) Overview Founded in May 2003 (under Craig Mundie) ~40 employees + students Goals: – Applied collaborative research with European partners (BT, Philips, etc…) – Participating in FP6, FP7 and other colaborative projects – Generating strong prototypes to drive interest at MS – Some internal projects for MS (port of CF for Symbian)
SLA monitoring System Developed as part of the FP 6 SeCSE project
Scenarios S1: Syntactic transactions Test applications ping service functionality regularly SLA evaluates success, response time and failure states The system takes appropriate actions depending on the state Requirements: – Single node CEP system – Pattern detection – state modeling S2: user generated events Monitor local service instances Aggregate on higher level – Per service role – Over service roles / per service Requirements: – Distributed CEP system – Capacity management – High Availability Support for on the fly query adaptation and root cause analysis
Language expressiveness Detecting patterns? – Over available data – Over available data with temporal constraints Building state machines? Needed: a simple way to formulate a state machine Question: How to enable a none expert to use the tools?
Offline analysis / debugging Required for debugging processing plans – CEP simulation environment Automated event generation based on the query Step by step execution of the query Conditional break point setting – Smart logging at runtime Only required traces are stored of the query in question Only the data is stored that issued a “bad” result – Support for building the right query from the available data
Reliable Infrastructure Survive failures: High availability – Replication – Distributed storage Correct output - How to compare outputs? Deal with overload scenarios – Intelligent load shedding v. delayed execution Question: what is the required “quality of service”
Next Steps Engaging in new scenarios Development focuses on – High Availability – Debugging / Root Cause analysis Explore heterogeneous CEP system that spans – Servers – Embedded devices – Sensors – The cloud?