Welcome to the Winter 2004 ROC Retreat

Slides:



Advertisements
Similar presentations
Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi.
Advertisements

Validating the Evaluation of Adaptive Systems by User Profile Simulation Javier Bravo and Alvaro Ortigosa {javier.bravo, Universidad.
Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Welcome to DEAS 2005 Design and Evolution of Autonomic Application Software David Garlan, CMU Marin Litoiu, IBM CAS Hausi A. Müller, UVic John Mylopoulos,
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Recovery-Oriented Computing Stanford ROC Updates Armando Fox.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,
OCIN Workshop Wrapup Bill Dally. Thanks To Funding –NSF - Timothy Pinkston, Federica Darema, Mike Foster –UC Discovery Program Organization –Jane Klickman,
UC Berkeley Monitoring Hadoop through Tracing Andy Konwinski and Matei Zaharia.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Recovery Oriented Computing: Update Armando Fox (in loco Patterson) Summer ROC Retreat, June 2002.
1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005.
Recovery Oriented Computing (ROC) Dave Patterson and a cast of 1000s: Aaron Brown, Pete Broadwell, George Candea †, Mike Chen, James Cutler †, Prof. Armando.
Why Recovery Should Be Free, And Often Can Be Armando Fox, Stanford University June 2003 ROC Retreat.
Latency as a Performability Metric for Internet Services Pete Broadwell
Modeling and Detecting Anomalous Topic Access Siddharth Gupta 1, Casey Hanson 2, Carl A Gunter 3, Mario Frank 4, David Liebovitz 4, Bradley Malin 6 1,2,3,4.
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
Probabilistic Consistency and Durability in RAINS: Redundant Array of Independent, Non-Durable Stores Andy Huang and Armando Fox Stanford University.
Metrics and Techniques for Evaluating the Performability of Internet Services Pete Broadwell
Peer Pressure: Distributed Recovery in Gnutella Pedram Keyani Brian Larson Muthukumar Senthil Computer Science Department Stanford University.
CSA Discovery Services!! Community of Scholars PapersInvited COS Funding Opportunities.
EEC 688/788 Secure and Dependable Computing Lecture 8 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Combining Statistical Monitoring and Predictable Recovery for Self-Management Armando Fox, Emre Kıcıman, Stanford University Dave Patterson, Mike Jordan,
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
CompSci Self-Managing Systems Shivnath Babu.
IBM Rational User Group UK Welcome Julian Holmes - Capgemini 2nd March 2006 In collaboration with.
Recovery-Oriented Computing Detecting and Diagnosing Application-Level Failures in Internet Services Emre Kıcıman and Armando Fox {emrek,
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
Progress Report Armando Fox with George Candea, James Cutler, Ben Ling, Andy Huang.
Free Recovery: A Step Towards Self-Managing State Andy Huang and Armando Fox Stanford University.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling, Emre Kiciman, Armando Fox
Slide 1 Recovery-Oriented Computing Aaron Brown, Dan Hettenna, David Oppenheimer, Noah Treuhaft, Leonard Chung, Patty Enriquez, Susan Housand, Archana.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
DStore: An Easy-to-Manage Persistent State Store Andy Huang and Armando Fox Stanford University.
Lecture 17 Page 1 CS 188,Winter 2015 A Design Problem in Distributed Systems CS 188 Distributed Systems March 10, 2015.
Introduction to Machine Learning, its potential usage in network area,
Experience Report: System Log Analysis for Anomaly Detection
Cluster-Based Scalable
Embracing Failure: A Case for Recovery-Oriented Computing
Evaluating state of the art in AI
Large Distributed Systems
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Northwestern Lab for Internet and Security Technology (LIST) Yan Chen Department of Computer Science Northwestern University.
Recovery-Oriented Computing
Class project by Piyush Ranjan Satapathy & Van Lepham
History of Project Management: Post 1950s
Latency as a Performability Metric: Experimental Results
Systems Issues for Scalable, Fault Tolerant Internet Services
Geospatial Data Use and sharing Concepts
EEC 688/788 Secure and Dependable Computing
RM3G: Next Generation Recovery Manager
WELCOME! Nonclinical Topics Working Group CSS Breakout Plan.
EEC 688/788 Secure and Dependable Computing
Refining of Failure Detection Technique in Web Applications
Christos Faloutsos CMU
Rapid Mobility via Type Indirection
EEC 688/788 Secure and Dependable Computing
Decoupled Storage: “Free the Replicas!”
Self-healing systems – What are they?
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Towards Unified Management
SAHARA Second Winter Retreat January 2003
Yining ZHAO Computer Network Information Center,
SLB is a diverse community of researchers and clinicians offering member’s career support while enhancing the advancement of immunological research.
Research Issues in Middleware (Bhaskar)
I N S I G H T IMEX Blade Servers Inside…
On-the-spot student performance analysis through visualization
Presentation transcript:

Welcome to the Winter 2004 ROC Retreat Armando Fox and David Patterson

About ROC Retreats Purpose of semi-annual retreats Logistics Progress reports/talks from academia and industry Exposure/feedback on new ideas or work in progress Brainstorming in immersive atmosphere Industry/visitor feedback, opportunities for collaboration Skiing Logistics Web server with retreat talks/papers - thanks to Mike Howard and Bob Miller

ROC Events Aaron Brown, UC Berkeley => Dr. Aaron Brown, IBM Research Pete Broadwell, UC Berkeley => Pete Broadwell, M.S., ??? Soon: Mike Chen, UC Berkeley => Dr. Mike Chen, ??? ROC work recognized in the 2003 Scientific American 50

Recent Publications (since June 2003) Published or to appear: Ben Ling, Emre Kiciman, Armando Fox: Session State: Beyond Soft State, in NSDI 2004 Mike Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Eric Brewer, Armando Fox: Path-Based Failure and Evolution Management, in NSDI 2004 George Candea, Steve Zhang, Emre Kiciman, Armando Fox, Application-Generic Recovery for Internet Middleware, Cluster Computing Journal (special issue on Autonomic Computing), summer 2004 George Candea, James Cutler, Armando Fox, Improving Availability with Recursive Microreboots: A Soft-State System Case Study, Performance Evaluation Journal, 56(1-3), March 2004 In submission: George Candea and Armando Fox, Microreboots: An Application-Generic Recovery Technique for Internet Services, submitted to USENIX 2004 Andy Huang and Armando Fox, Free Recovery: A Step Towards Self-Managing State, submitted to USENIX 2004 Emre Kiciman and Armando Fox, Detecting and Localizing Anomalous Behavior to Discover Failures in Component-Based Internet Services, submitted to USENIX 2004 Yee-Jiun Song, Jeff Raymakers, Wendy Tobagus, Armando Fox. Is MTTR More Important Than MTTF For User-Perceived Availability?, submitted to DSN-IPDS 2004

Preview of some upcoming talks Benchmarking Evaluating undo: human-aware recovery benchmarks Benchmarking distributed services Including latency & data quality in performability evaluation of a web-based service Making recovery nearly free Evaluating the effect of micro-reboots on end users How cheap recovery simplifies persistent state management Embracing statistical analysis Using statistical learning to detect and localize faults in componentized Internet services A statistical learning approach to failure diagnosis for eBay Toward generalized API’s for statistical monitoring

ROC => RADS Generalize ROC approaches that focus on statistical anomaly detection as a way of detecting conditions that require response Generalize “recovery” to “adaptation” System is “always recovering”/”always adapting” Some early examples of this will be featured in talks Insight: statistical pattern recognition provides a degree of application-generic failure detection; nearly-free recovery means we can tolerate some false positives Kickoff panel this evening

Other Highlights Poster advertisements before poster session Three talks from industrial visitors Moises Goldszmidt: statistical pattern recognition applied to systems management Chris Overton: modeling large-scale IT systems Paul Brett: Real-world failures, a systemic view