Download presentation
Presentation is loading. Please wait.
1
1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005
2
2 RAD Lab The 5-year Vision: Single person can go from vision to a next-generation IT service (“the Fortune 1 million”) E.g., over long holiday weekend in 1995, Pierre Omidyar created Ebay v1.0 The Vehicle: Interdisciplinary Center creates core technical competency to demo 10X to 100X Researchers are leaders in machine learning, networking, and systems Industrial Participants: leading companies in HW, systems SW, and online services Called “RAD Lab” for Reliable, Adaptable, Distributed systems
3
3 RAD Lab The Science: Both shorter-term and longer-term solutions Develop using primitives functions (MapReduce), services (Craigslist) Assess/debug using deterministic replay and finding new metrics Deploy using “Internet-in-a-Box” via FPGAs under failure/slowdown workloads Operate using Statistical Learning Theory-friendly, Control Theory-friendly software architectures and visualization tools Cap: Dado: (The section of a pedestal between cap and base) Base: Added Value to Industrial Participants: Working with leading people and companies from different industries on long-range, pre-competitive technology Training of dozens of future leaders of IT in multiple disciplines, and their recruitment by industrial participants Working with researchers with successful track record of rapid transfer of new technology
4
4 Steps vs. Process Process: Support DADO Evolution, 1 group Steps: Traditional, Static Handoff Model, N groups DevelopAssessDeployOperateDevelopAssessDeployOperate
5
5 Create abstractions, primitives, & toolkit for large scale systems that make it easy to invent/deploy functions (e.g, MapReduce) For example, Distributed Hash Tables (OpenDHT) Already setting the trend for IETF standards DADO - Develop
6
6 “We improve what we can measure” Inspect box visibility into networks, usually data poor Servers data rich; data often discarded Statistical and Machine Learning (SML) to the rescue. It works well when You have lots of raw data You have reason to believe the raw data is related to some high-level effect you’re interested in You don’t have a model of what that relationship is Note: SML advances fast analysis DADO - Assess
7
7 DADO - Deploy Re-engineer RAMP to act like 1000+ node distributed system under realistic failure and slowdown workloads RAMP emulates data center & wide area systems as well as MPP Collect and apply failure data from real world RAMP vs. Clusters: Larger scale, easier to develop/debug, flexible HW/SW configuration, inexpensive so no need to share Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic
8
8 DADO - Operate Idea: when site misbehaves, users notice, and change their behavior; use as “failure detector” Approach: combine visualization with Statistical and Machine Learning analysis so operator see anomalies too Experiment: does distribution of hits to various pages match the “historical” distribution? Each minute, compare hit counts of top N pages to hit counts over last 6 hours using Bayesian networks and 2 test, real Ebates data To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.
9
9 11:33am – 11:56am site crash Novel Visualization Account page problem anomaly score 0 11:07am start of anomalies “I see and understand” Winning operator trust
10
10 Founding the RADLab; Start 12/1 Looking for 3 to 5 founding companies to fund 5 years @ cost of $0.5M / year 25 grad students + 15 undergrads+ 6 faculty + 2 staff Founding companies: Google, Microsoft, Sun Microsystems RADS Consortium model Preference to founding partner technology in prototypes Designate employees to act as consultants Head start for participants on research results Putting IP in Public Domain so partners use & not sued Press release of founding RAD Lab partners December 1? Mid project review after 3 years by founding partners
11
11 RAD Lab Opportunity: New Research Model Chance to Partner with the Top University in Computer Systems on the “Next Great Thing” National Academy of Engineering mentions Berkeley in 7 of 19 $1B+ industries that came from IT research NAE mentions Berkeley 7 times, Stanford 5 Times, MIT 5, CMU 3 Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations, GUI, VLSI Design (Spice) [ECAD $5B?/yr], RISC [$10B?/yr], Relational DB (Ingres/Postgres) [RDB $15B?/yr], Parallel DB, Data Mining, Parallel Computing, RAID [$15B?/yr], Portable Communication (BWRC), WWW, Speech Recognition, Broadband Berkeley one of the top suppliers of systems students to industry and academia US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford
12
12 Working with different industries on long-range, pre-competitive technology Training of dozens of future leaders of IT, plus their recruitment Working with researchers with track records of successful technology transfer RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems Develop using primitives to enable functions and services Assess using deterministic replay and statistical and machine learning (SML) Deploy via “Internet-in-a-Box” FPGAs Operate SML-friendly, Control Theory- friendly architectures and operator- centric visualization and analysis tools Capability (Desired): Capability (Desired): 1 person can invent & run the next-gen IT service Base Technology: Server Hardware, System Software, Middleware, Networking
13
13 Backup Slides
14
14 References “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson. “Microreboot -- A Technique for Cheap Recovery,” George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Proc. 6th Symp. on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec. 2004. “Path-Based Failure and Evolution Management,” Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer In Proc. 1st USENIX/ACM Symp. on Networked Systems Design and Implementation (NSDI '04), San Francisco, CA, March 2004. "," Ben Liblit, M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal I. Jordan, PLDI, 2005."Scalable Statistical Bug Isolation," Ben Liblit, M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal I. Jordan, PLDI, 2005. To learn more, see
15
15 Sustaining Innovation/Training Engine in 21st Century Replicate research centers based primarily on industrial funding to expand IT market and to train next generation of IT leaders Berkeley Wireless Research Center (BWRC): 50 grad students, 30 undergrads @ $5M per year Stanford Network Research Center (SNRC): 50 Grad students @ $5M per year MIT Tparty $4M per year (100% $ from Quanta) Industry largely funds N companies, where N is 5? Exciting, long term technical vision Demonstrated by prototype(s)
16
16 State of Research Funding Today Most industry research shorter term DARPA exiting long-term (exp.) IT research ’03-’05 BAAs IPTO: 9 AI, 2 classified, 1 SW radio, 1 sensor net, 1 reliability, all have 12 to 18 month “go/no go” milestones Academic led funding reduced 50% (so far) 2001 to 2004 Faculty ≈ consultants in consortia led by defense contractor, get grants ≈ support 1-2 students (~ NSF funding level) NSF swamped with proposals, conservative 2000 to 6500 proposals in 5 years IT has lowest acceptance rate at NSF (between 8% to 16%) “Ambitious proposal” is a negative review Even if get NSF funding, proposal reduced to stretch NSF $ e.g., got 3 x 1/3 faculty, 6 grad students, 0 staff, 3 years (To learn more, see www.cra.org/research)
17
17 RAD Lab Timeline 2005 Launch RAD Lab 12/1 2006 Collect workloads, Internet in a Box 2007 SLT/CT distributed architectures, Iboxes, annotative layer, class testing 2008 Development toolkit 1.0, tuple space, class testing; Mid Project Review 2009 RAD Lab software suite 1.0, class testing 2010 End of Project Party
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.