Download presentation
Presentation is loading. Please wait.
Published byRobert Barnett Modified over 9 years ago
1
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox {bling,emrek,fox}@cs.stanford.edu
2
© 2004 Benjamin Ling Outline n Motivation: What is Session State? n SSM: l Architecture l Algorithm l Backpressure and Admission Control n SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference AppServer integration n Conclusion
3
© 2004 Benjamin Ling Proliferation of J2EE and Web Services n J2EE embraced as industry standard n Framework l Simplifies development l Allows for portability of services l Standardized interfaces n However, difficulties remain…
4
© 2004 Benjamin Ling The Pain – Administration and Maintenance n Administration is difficult and costly l $$ -- Database admins cost ~$200K/yr a head l Development efficiency negatively impacted n Failure/Recovery is costly l Recovery slow, especially site outages l Data loss on crashes l Users adversely affected
5
© 2004 Benjamin Ling Not All State is Created Equal n Various types of state in J2EE… l User profile state l Persistent shared state l Transaction history state n But usually stored in the same place l Stored in DB or FS Focus on particular class Exploit its properties Simplify Administration and Maintenance
6
© 2004 Benjamin Ling Example of Session State
7
© 2004 Benjamin Ling Properties of Session State n Subcategory of session state l Single-user, serial access, semi-persistent data l Examples: Temporary application data, application workflow l Example of usage (e.g. J2EE): Browser App Server 1 2 3 4 5 6
8
© 2004 Benjamin Ling Goal n Build a session state store that is: l Failure-friendly n Does not lose data on crash n Degrades gracefully l Recovery-friendly n Recovers fast l Self-Managing
9
© 2004 Benjamin Ling Outline n Motivation: What is Session State? n SSM: l Architecture l Algorithm l Backpressure and Admission Control n SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference AppServer integration n Conclusion
10
© 2004 Benjamin Ling Session State Manager (SSM) Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 AppServer STUBSTUB STUBSTUB Redundant, in-memory hash table distributed across nodes Algorithm: Redundancy similar to quorums Write to many random nodes, wait for few (avoid performance coupling) Write to many random nodes, wait for few (avoid performance coupling) Read one Read one RAM, Network Interface
11
© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 5
12
© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 5
13
© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 5
14
© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 5
15
© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 1414 Brick 5 Cookie holds metadata Crashed? Slow?
16
© 2004 Benjamin Ling Read example: Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 1414 Brick 5 Try to read from Bricks 1, 4
17
© 2004 Benjamin Ling Read example: Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 1414 Brick 5
18
© 2004 Benjamin Ling Read example: Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 Brick 1 crashes
19
© 2004 Benjamin Ling Read example: Browser AppServer STUBSTUB Brick 2 Brick 3 Brick 4 Brick 5
20
© 2004 Benjamin Ling SSM: Failure and Recovery n Failure of single node l No data loss, WQ-1 remain l State is available for R/W during failure n Recovery l Restart – No recovery l No special case recovery code l State is available for R/W during brick restart l Session state is self-recovering n User’s access pattern causes data to be rewritten
21
© 2004 Benjamin Ling Backpressure and Admission Control AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 AppServer STUBSTUB Heavy flow to Brick 3 Drop Requests
22
© 2004 Benjamin Ling Backpressure and Admission Control AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 AppServer STUBSTUB Drop Requests Reduce Sending Reject requests
23
© 2004 Benjamin Ling Outline n Motivation: What is Session State? n SSM: l Architecture l Algorithm l Backpressure and Admission Control n SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference AppServer integration n Conclusion
24
© 2004 Benjamin Ling Downtime Recovery Philosophy RECOVERYCOSTRECOVERYCOST DETECTION ACCURACY AccurateLax Downtime Undetected Errors Ideal Cheap Expensive Aggressive Hard
25
© 2004 Benjamin Ling Failure detection and Recovery Failure Detection Recovered Recovery SSM: Failure masked Instant recovery
26
© 2004 Benjamin Ling False Positives False positive triggered Instant recovery Normal Operation
27
© 2004 Benjamin Ling Statistical Monitoring Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 Pinpoint Statistics NumElements MemoryUsed InboxSize NumDropped NumReads NumWrites
28
© 2004 Benjamin Ling Statistical Monitoring Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 Pinpoint Statistics NumElements MemoryUsed InboxSize NumDropped NumReads NumWrites REBOOT
29
© 2004 Benjamin Ling Statistical Monitoring Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 Pinpoint Statistics NumElements MemoryUsed InboxSize NumDropped NumReads NumWrites
30
© 2004 Benjamin Ling SSM Monitoring n N replicated bricks handle read/write requests l Cannot do structural anomaly detection! l Alternative features (performance, mem usage, etc) n Activity statistics: How often did a brick do something? l Msgs received/sec, dropped/sec, etc. l Same across all peers, assuming balanced workload l Use anomalies as likely failures n State statistics: Current state of system l Memory usage, queue length, etc. l Similar pattern across peers, but may not be in phase l Look for patterns in time-series; differences in patterns indicate failure at a node.
31
© 2004 Benjamin Ling Surprising Patterns in Time-Series 1. Discretize time-series into string. [Keogh] [0.2, 0.3, 0.4, 0.6, 0.8, 0.2] -> “aaabba” 2. Calculate the frequencies of short substrings in the string. “aa” occurs twice; “ab”, “bb”, “ba” occurs once. 3. Compare frequencies to normal, look for substrings that occur much less or much more than normal.
32
© 2004 Benjamin Ling Outline n Motivation: What is Session State? n SSM: l Architecture l Algorithm l Backpressure and Admission Control n SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference AppServer integration n Conclusion
33
© 2004 Benjamin Ling Microbenchmarks n UC Berkeley Millennium Cluster l Six bricks running n Candidate Write Set = 3, Write quota = 2 n Candidate Read Set = 2 n State Size = 8K
34
© 2004 Benjamin Ling Induced Fault One bricked killed Brick restarted by PP SSM unaffected
35
© 2004 Benjamin Ling Memory fault Memory fault detected in hash PP restarts Brick SSM unaffected
36
© 2004 Benjamin Ling Network Fault – 70% packet loss Network fault injected Fault detected Brick killed PP restarts Brick
37
© 2004 Benjamin Ling Performance Fault Performance fault injected
38
© 2004 Benjamin Ling Macrobenchmark n TellMe’s Email-By-Phone Application n Session state stored in memory l Email header information l Index information n Alter application to store session state using l Disk l SSM
39
© 2004 Benjamin Ling Macrobenchmark 25% Throughput Degradation compared to in-memory Throughput preserved compared to disk
40
© 2004 Benjamin Ling Future Work n Integrate with Sun’s reference Application Server l Enterprise benchmarks n Statistical Anomaly Detection l Too many magic numbers n Integrated ROC-J2EE application server
41
© 2004 Benjamin Ling Conclusion SSM A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling bling@cs.stanford.edu http://swig.stanford.edu/ bling@cs.stanford.edu
42
© 2004 Benjamin Ling Existing solutions : n File System and Databases l Poor failure behavior n Lose data (FS) l Slow recovery (Both) l Difficult to administer (DB) l Difficult to tune (both) n In-memory replication using primary/secondary: l Performance coupling l Poor failover (uneven load balancing)
43
© 2004 Benjamin Ling Other implementation details n Garbage collection l Generational hash table n Hash table of hash tables n Each hash table has an associated time range n When time has passed, GC that table l No reference counting, scanning, etc.
44
© 2004 Benjamin Ling SSM: Self-Managing n Adaptive: l Stub maintains count of maximum allowable in-flight requests to each brick n Additive increase on successful request n Multiplicative decrease on timeout l Stubs discover capacity of each brick Self-Tuning n Admission control l Stubs say “no” if insufficient bricks l Propagate backpressure from bricks to clients n Turn users away under overload Self-Protecting
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.