1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005.

Slides:



Advertisements
Similar presentations
Consumer-Centric Knowledge Web A Vision of Consumer Applications of Software Agent Technology - Enabling Consumer-Centric Knowledge-Based Computing Jack.
Advertisements

INDIANAUNIVERSITYINDIANAUNIVERSITY GENI Global Environment for Network Innovation James Williams Director – International Networking Director – Operational.
Building a CFD Grid Over ThaiGrid Infrastructure Putchong Uthayopas, Ph.D Department of Computer Engineering, Faculty of Engineering, Kasetsart University,
Distributed Data Processing
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
SensMax People Counting Solutions Visitors counting makes the most efficient use of resources - people, time and money, which leads to higher profits in.
Cloud Computing: Theirs, Mine and Ours Belinda G. Watkins, VP EIS - Network Computing FedEx Services March 11, 2011.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
GENI: Global Environment for Networking Innovations Larry Landweber Senior Advisor NSF:CISE Joint Techs Madison, WI July 17, 2006.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
CSE 598B: Self-* Systems Path Based Failure and Evolution Management Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox,
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
May 17, Capabilities Description of a Rapid Prototyping Capability for Earth-Sun System Sciences RPC Project Team Mississippi State University.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
1 GENI: Global Environment for Network Innovations Jennifer Rexford Princeton University
Berkeley RAD Lab Center Proposal Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica RADS Retreat, June 2005.
12 Chapter 12 Client/Server Systems Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
1 A Research Program in Reliable Adaptive Distributed Systems (RADS) Armando Fox*, Michael Jordan, Randy Katz, George Necula, David Patterson, Ion Stoica,
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Introduction to Systems Analysis and Design
© 2012 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Applied Research Center for Computer Networking GENI, we be of one blood.
Introduction. Readings r Van Steen and Tanenbaum: 5.1 r Coulouris: 10.3.
Wireless Grid Computing A Prototype Wireless Grid Grant Gifford Mark Hempstead April 30, 2003.
Critical Emerging Network-Centric Applications Tele-control/tele-presence Defense Tele-medicine Remote plane/vehicle/robot control Distance learning Real-time.
1 Berkeley RAD Lab Technical Overview Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica March 2006.
SAP R/3 System: Client Server Overview (Buck-Emden & Galimow, 1998) Dr. K. Palaniappan.
UC Berkeley Scaleable Structured Datastorage for Web 2.0 Michael Armbrust, David Patterson October, 2007.
Scalable Statistical Bug Isolation Ben Liblit, Mayur Naik, Alice Zheng, Alex Aiken, and Michael Jordan University of Wisconsin, Stanford University, and.
Enterprise Storage A New Approach to Information Access Darren Thomas Vice President Compaq Computer Corporation.
Keeping on Top of Technological Trends and Uses of Existing Technology Daniel L. Appelman Heller Ehrman LLP.
1 RADS Conceptual Architecture Commodity Internet & IP networks Edge Network Distributed Middleware Client SLT Services Distributed Middleware Server Router.
Recovery Oriented Computing (ROC) Aaron Brown*, Pete Broadwell, George Candea †, Mike Chen, Leonard Chung*, James Cutler †, Armando Fox †, Archana Ganapathi*,
Cloud Computing Dave Elliman 11/10/2015G53ELC 1. Source: NY Times (6/14/2006) The datacenter is the computer!
1 World Wide Consortium for the Grid Global Grid Forum Network-Centric Operations Community Session 28 June
Project 2003 Presentation Ben Howard 15 th July 2003.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
ICS-FORTH 25-Nov Infrastructure for Scalable Services Are we Ready Yet? Angelos Bilas Institute of Computer Science (ICS) Foundation.
Berkeley RAD Lab: Building Successful Industry Partnerships for Fun & Profit Armando Fox Research Associate & Co-founding PI, UC Berkeley RADLab Visiting.
09/02 ID099-1 September 9, 2002Grid Technology Panel Patrick Dreher Technical Panel Discussion: Progress in Developing a Web Services Data Analysis Grid.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
DISTRIBUTED COMPUTING. Computing? Computing is usually defined as the activity of using and improving computer technology, computer hardware and software.
1 Berkeley RAD Lab Technical Approach Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica October 2005.
Test Results of the EuroStore Mass Storage System Ingo Augustin CERNIT-PDP/DM Padova.
Marv Adams Chief Information Officer November 29, 2001.
A scalable and flexible platform to run various types of resource intensive applications on clouds ISWG June 2015 Budapest, Hungary Tamas Kiss,
1 Wide Area Network Emulation on the Millennium Bhaskaran Raman Yan Chen Weidong Cui Randy Katz {bhaskar, yanchen, wdc, Millennium.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík ‡, Greg Friedman †, Lukas Biewald †, Helen Levine §, George.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Web Technologies Lecture 13 Introduction to cloud computing.
NSF Middleware Initiative Purpose To design, develop, deploy and support a set of reusable, expandable set of middleware functions and services that benefit.
Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,
Cyberinfrastructure Overview of Demos Townsville, AU 28 – 31 March 2006 CREON/GLEON.
Using HTTP Access Logs To Detect Application-Level Failures In Internet Services Peter Bodík, UC Berkeley Greg Friedman, Lukas Biewald, Stanford University.
1 This Changes Everything: Accelerating Scientific Discovery through High Performance Digital Infrastructure CANARIE’s Research Software.
Please fill in my session feedback form available on each chair. SPSCairo Welcome.
Intro to Software as a Service (SaaS) and Cloud Computing
Welcome to the Winter 2004 ROC Retreat
Language Technologies Institute Carnegie Mellon University
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
Secrets to Fast, Easy High Availability for SQL Server in AWS
Patrick Dreher Research Scientist & Associate Director
RM3G: Next Generation Recovery Manager
Enabling ML Based Research
School of Education Opportunity for Discovery, Learning & Engagement
Berkeley RAD Lab Center Proposal
Presentation transcript:

1 Berkeley RAD Lab: Robust, Adaptive, Distributed Systems Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica November 2005

2 RAD Lab The 5-year Vision: Single person can go from vision to a next-generation IT service (“the Fortune 1 million”) E.g., over long holiday weekend in 1995, Pierre Omidyar created Ebay v1.0 The Vehicle: Interdisciplinary Center creates core technical competency to demo 10X to 100X Researchers are leaders in machine learning, networking, and systems Industrial Participants: leading companies in HW, systems SW, and online services Called “RAD Lab” for Reliable, Adaptable, Distributed systems

3 RAD Lab The Science: Both shorter-term and longer-term solutions Develop using primitives  functions (MapReduce), services (Craigslist) Assess/debug using deterministic replay and finding new metrics Deploy using “Internet-in-a-Box” via FPGAs under failure/slowdown workloads Operate using Statistical Learning Theory-friendly, Control Theory-friendly software architectures and visualization tools Cap: Dado: (The section of a pedestal between cap and base) Base: Added Value to Industrial Participants: Working with leading people and companies from different industries on long-range, pre-competitive technology Training of dozens of future leaders of IT in multiple disciplines, and their recruitment by industrial participants Working with researchers with successful track record of rapid transfer of new technology

4 Steps vs. Process Process: Support DADO Evolution, 1 group Steps: Traditional, Static Handoff Model, N groups DevelopAssessDeployOperateDevelopAssessDeployOperate

5 Create abstractions, primitives, & toolkit for large scale systems that make it easy to invent/deploy functions (e.g, MapReduce)  For example, Distributed Hash Tables (OpenDHT)  Already setting the trend for IETF standards DADO - Develop

6 “We improve what we can measure”  Inspect box visibility into networks, usually data poor  Servers data rich; data often discarded Statistical and Machine Learning (SML) to the rescue. It works well when  You have lots of raw data  You have reason to believe the raw data is related to some high-level effect you’re interested in  You don’t have a model of what that relationship is Note: SML advances  fast analysis DADO - Assess

7 DADO - Deploy Re-engineer RAMP to act like node distributed system under realistic failure and slowdown workloads  RAMP emulates data center & wide area systems as well as MPP  Collect and apply failure data from real world  RAMP vs. Clusters: Larger scale, easier to develop/debug, flexible HW/SW configuration, inexpensive so no need to share Explore via repeatable experiments as vary parameters, configurations vs. observations on single (aging) cluster that is often idiosyncratic

8 DADO - Operate Idea: when site misbehaves, users notice, and change their behavior; use as “failure detector” Approach: combine visualization with Statistical and Machine Learning analysis so operator see anomalies too Experiment: does distribution of hits to various pages match the “historical” distribution?  Each minute, compare hit counts of top N pages to hit counts over last 6 hours using Bayesian networks and  2 test, real Ebates data To learn more, see “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson.

9 11:33am – 11:56am site crash Novel Visualization Account page problem anomaly score 0 11:07am start of anomalies “I see and understand” Winning operator trust

10 Founding the RADLab; Start 12/1 Looking for 3 to 5 founding companies to fund 5 cost of $0.5M / year  25 grad students + 15 undergrads+ 6 faculty + 2 staff  Founding companies: Google, Microsoft, Sun Microsystems RADS Consortium model  Preference to founding partner technology in prototypes  Designate employees to act as consultants  Head start for participants on research results  Putting IP in Public Domain so partners use & not sued Press release of founding RAD Lab partners December 1? Mid project review after 3 years by founding partners

11 RAD Lab Opportunity: New Research Model Chance to Partner with the Top University in Computer Systems on the “Next Great Thing”  National Academy of Engineering mentions Berkeley in 7 of 19 $1B+ industries that came from IT research NAE mentions Berkeley 7 times, Stanford 5 Times, MIT 5, CMU 3 Timesharing (SDS 940), Client-Server Computing (BSD Unix), Graphics, Entertainment, Internet, LANs, Workstations, GUI, VLSI Design (Spice) [ECAD $5B?/yr], RISC [$10B?/yr], Relational DB (Ingres/Postgres) [RDB $15B?/yr], Parallel DB, Data Mining, Parallel Computing, RAID [$15B?/yr], Portable Communication (BWRC), WWW, Speech Recognition, Broadband Berkeley one of the top suppliers of systems students to industry and academia US News & World Report ranking of CS Systems universities: 1 Berkeley, 2 CMU, 2 MIT, 4 Stanford

12 Working with different industries on long-range, pre-competitive technology Training of dozens of future leaders of IT, plus their recruitment Working with researchers with track records of successful technology transfer RAD Lab: Interdisciplinary Center for Reliable, Adaptive, Distributed Systems Develop using primitives to enable functions and services Assess using deterministic replay and statistical and machine learning (SML) Deploy via “Internet-in-a-Box” FPGAs Operate SML-friendly, Control Theory- friendly architectures and operator- centric visualization and analysis tools Capability (Desired): Capability (Desired): 1 person can invent & run the next-gen IT service Base Technology: Server Hardware, System Software, Middleware, Networking

13 Backup Slides

14 References “Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization,” In Proc. 2nd IEEE Int’l Conf. on Autonomic Computing, June 2005, by Peter Bodik, Greg Friedman, Lukas Biewald, Helen Levine (Ebates,com), George Candea, Kayur Patel, Gilman Tolle, Jon Hui, Armando Fox, Michael I. Jordan, David Patterson. “Microreboot -- A Technique for Cheap Recovery,” George Candea, Shinichi Kawamoto, Yuichi Fujiki, Greg Friedman, and Armando Fox. Proc. 6th Symp. on Operating Systems Design and Implementation (OSDI), San Francisco, CA, Dec “Path-Based Failure and Evolution Management,” Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer In Proc. 1st USENIX/ACM Symp. on Networked Systems Design and Implementation (NSDI '04), San Francisco, CA, March "," Ben Liblit, M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal I. Jordan, PLDI, 2005."Scalable Statistical Bug Isolation," Ben Liblit, M. Naik, Alice. X. Zheng, Alex Aiken, and Micheal I. Jordan, PLDI, To learn more, see

15 Sustaining Innovation/Training Engine in 21st Century Replicate research centers based primarily on industrial funding to expand IT market and to train next generation of IT leaders  Berkeley Wireless Research Center (BWRC): 50 grad students, 30 $5M per year  Stanford Network Research Center (SNRC): 50 Grad $5M per year  MIT Tparty $4M per year (100% $ from Quanta)  Industry largely funds N companies, where N is 5?  Exciting, long term technical vision Demonstrated by prototype(s)

16 State of Research Funding Today Most industry research shorter term DARPA exiting long-term (exp.) IT research  ’03-’05 BAAs IPTO: 9 AI, 2 classified, 1 SW radio, 1 sensor net, 1 reliability, all have 12 to 18 month “go/no go” milestones  Academic led funding reduced 50% (so far) 2001 to 2004  Faculty ≈ consultants in consortia led by defense contractor, get grants ≈ support 1-2 students (~ NSF funding level) NSF swamped with proposals, conservative  2000 to 6500 proposals in 5 years IT has lowest acceptance rate at NSF (between 8% to 16%)  “Ambitious proposal” is a negative review  Even if get NSF funding, proposal reduced to stretch NSF $ e.g., got 3 x 1/3 faculty, 6 grad students, 0 staff, 3 years (To learn more, see

17 RAD Lab Timeline 2005 Launch RAD Lab 12/ Collect workloads, Internet in a Box 2007 SLT/CT distributed architectures, Iboxes, annotative layer, class testing 2008 Development toolkit 1.0, tuple space, class testing; Mid Project Review 2009 RAD Lab software suite 1.0, class testing 2010 End of Project Party