A Collaborative Monitoring Mechanism for Making a Multitenant Platform Accoutable HotCloud 10 By Xuanran Zong
Background Applications are moving to cloud – Pay-as-you-go basis – Resource multiplexing – Reduce over-provisioning cost Cloud service uncertainty – How do the clients know if the cloud provider handles their data and logic correctly Logic correctness Consistency constraints Performance
Service level agreement (SLA) To ensure data and logic are handled correctly, service provider offers service level agreement to clients – Performance e.g. One EC2 compute unit has the computation power of GHz – Availability e.g. the service would up 99.9% of the time
SLA Problems – Few means are provided to clients to make a SLA accountable when problem occurs Accountable means we know who is responsible when things go wrong Monitoring is provided by provider – Clients are often required to furnish evidence all by themselves to be eligible to claim credit for SLA violation
EC2 SLA Reference:
Accountability service Provided by third party Responsibility – Collect evidence based on SLA – Runtime compliance check and problem detection
Problem description Clients has a set of end-points {ep 0, ep 1, …, ep n-1 } that operate on data stored in multitenancy environment Many things can go wrong – Data is modified without owner’s permission – Consistency requirement is broken The accountability service should detect these issues and provide evidence.
System architecture Wrapper provided by third party Wrapper captures input/ouput from ep i and send to accountability service
Accountability service The accountability service maintains a view of the data state – Reflects what data should be from users’ perspective – Aggregates data updating requests of users to calculate the data state – Authenticates query results based on the calculated data state
Evidence collection and processing Logging service, wep, extract operation information and send log message to accountability service W – If it is a update service, W updates MB-tree – If it is a query service, W authenticates the result with MB-tree and ensures correctness and completeness – MB-tree maintains the data state
Data state calculation Use Merkle B-tree to calculate data state By combining the items in VO, we can recalculate the root of the MB-tree and compare it with the root to reveal the correctness and completeness of the query result
Consistency issue What if the log messages arrive out-of-order? – Assume eventual consistency – Clocks are synchronized – Maintains a sliding window of sorted log messages based on timestamp – Time window size is determined by the maximum delay of passing a log message from client to W
Collaborative monitoring mechanism Current approach – Centralized: availability, scalability, trustworthy Let’s make it distributed – Data state is maintained by a set of services – Each service maintains a view of the data state
Design choice I Log send to one data state service and the service then propagate the log to other services in a synchronous manner – Pros Strong consistency Request can be answered by any service – Cons Large overhead due to synchronous communication
Design choice II Log send to one service and the service propagate the log asynchronously – Pros Better logging performance – Cons Uncertainty in answering an authentication request
Their design Somewhere in between of the two extremes Partition the key range into a few disjoint regions Log message only sends to its designated region Log message is propagate synchronously within the region and asynchronously across regions Authentication request is directed to service whose region overlaps most with request range – Answer with certainty if request range falls inside service region – Wait, if not
Evaluation Overhead – Centralized design – Where does the overhead come from?
Evaluation VO calculation overhead
Evaluation Performance improvement with multiple data state service
Discussion Articulate the problem clearly and show one solution that employs third party to make the data state accountable Which part is the main overhead? – Communication? VO calculation? Distributed design does not help much when query range is large Do people want to sacrifice their performance(at least double the time) in order to make the service accountable? Can we use similar design to make other parts accountable? For instance, performance?