A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox {bling,emrek,fox}@cs.stanford.edu

© 2004 Benjamin Ling Outline n Motivation: What is Session State? n SSM: l Architecture l Algorithm l Backpressure and Admission Control n SSM + Pinpoint l Self-recovering, self-monitoring n Benchmarks n Next steps: Sun Reference AppServer integration n Conclusion

© 2004 Benjamin Ling Proliferation of J2EE and Web Services n J2EE embraced as industry standard n Framework l Simplifies development l Allows for portability of services l Standardized interfaces n However, difficulties remain…

© 2004 Benjamin Ling The Pain – Administration and Maintenance n Administration is difficult and costly l $$ -- Database admins cost ~$200K/yr a head l Development efficiency negatively impacted n Failure/Recovery is costly l Recovery slow, especially site outages l Data loss on crashes l Users adversely affected

© 2004 Benjamin Ling Not All State is Created Equal n Various types of state in J2EE… l User profile state l Persistent shared state l Transaction history state n But usually stored in the same place l Stored in DB or FS Focus on particular class Exploit its properties Simplify Administration and Maintenance

© 2004 Benjamin Ling Properties of Session State n Subcategory of session state l Single-user, serial access, semi-persistent data l Examples: Temporary application data, application workflow l Example of usage (e.g. J2EE): Browser App Server 1 2 3 4 5 6

© 2004 Benjamin Ling Session State Manager (SSM) Brick 1 Brick 2 Brick 3 Brick 4 Brick 5 AppServer STUBSTUB STUBSTUB Redundant, in-memory hash table distributed across nodes Algorithm: Redundancy similar to quorums Write to many random nodes, wait for few (avoid performance coupling) Write to many random nodes, wait for few (avoid performance coupling) Read one Read one RAM, Network Interface

© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 Brick 5

© 2004 Benjamin Ling Write example: “Write to Many, Wait for Few” Browser AppServer STUBSTUB Brick 1 Brick 2 Brick 3 Brick 4 Try to write to W random bricks, W = 4 Must wait for WQ bricks to reply, WQ = 2 1414 Brick 5 Cookie holds metadata Crashed? Slow?

© 2004 Benjamin Ling SSM: Failure and Recovery n Failure of single node l No data loss, WQ-1 remain l State is available for R/W during failure n Recovery l Restart – No recovery l No special case recovery code l State is available for R/W during brick restart l Session state is self-recovering n User’s access pattern causes data to be rewritten

© 2004 Benjamin Ling SSM Monitoring n N replicated bricks handle read/write requests l Cannot do structural anomaly detection! l Alternative features (performance, mem usage, etc) n Activity statistics: How often did a brick do something? l Msgs received/sec, dropped/sec, etc. l Same across all peers, assuming balanced workload l Use anomalies as likely failures n State statistics: Current state of system l Memory usage, queue length, etc. l Similar pattern across peers, but may not be in phase l Look for patterns in time-series; differences in patterns indicate failure at a node.

© 2004 Benjamin Ling Surprising Patterns in Time-Series 1. Discretize time-series into string. [Keogh] [0.2, 0.3, 0.4, 0.6, 0.8, 0.2] -> “aaabba” 2. Calculate the frequencies of short substrings in the string. “aa” occurs twice; “ab”, “bb”, “ba” occurs once. 3. Compare frequencies to normal, look for substrings that occur much less or much more than normal.

© 2004 Benjamin Ling Macrobenchmark n TellMe’s Email-By-Phone Application n Session state stored in memory l Email header information l Index information n Alter application to store session state using l Disk l SSM

© 2004 Benjamin Ling Future Work n Integrate with Sun’s reference Application Server l Enterprise benchmarks n Statistical Anomaly Detection l Too many magic numbers n Integrated ROC-J2EE application server

© 2004 Benjamin Ling Existing solutions : n File System and Databases l Poor failure behavior n Lose data (FS) l Slow recovery (Both) l Difficult to administer (DB) l Difficult to tune (both) n In-memory replication using primary/secondary: l Performance coupling l Poor failover (uneven load balancing)

© 2004 Benjamin Ling Other implementation details n Garbage collection l Generational hash table n Hash table of hash tables n Each hash table has an associated time range n When time has passed, GC that table l No reference counting, scanning, etc.

© 2004 Benjamin Ling SSM: Self-Managing n Adaptive: l Stub maintains count of maximum allowable in-flight requests to each brick n Additive increase on successful request n Multiplicative decrease on timeout l Stubs discover capacity of each brick  Self-Tuning n Admission control l Stubs say “no” if insufficient bricks l Propagate backpressure from bricks to clients n Turn users away under overload  Self-Protecting

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox

Similar presentations

Presentation on theme: "A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox

Similar presentations

Presentation on theme: "A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox"— Presentation transcript:

Similar presentations

About project

Feedback