Introduction to CS739: Distribution Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau What are distributed systems? What are the benefits and challenges? How will CS739 be structured? Readings, Writeups, Presentations Projects
Goals of Course Learn about challenges and existing techniques for building distributed systems and services Read and discuss influential papers from SOSP, OSDI, NSDI Gain some experience programming in distributed environment Warm-up project Final project
What is a Distributed System? Leslie Lamport says: “You know you have one when the crash of a computer you never heard of stops you from doing any work” More technical definition: “Collection of independent computers that appears to its users as a single coherent system” How are parallel, distributed, networked systems different? All contain nodes (processing, memory, disk) connected with network parallel distributednetworked Consider distributed services as well… More unified Less unified
Benefits of Distributed Systems Great price/performance Leverage commodity components (nodes and networks) Use many, many of them Incremental scalability Can add x% new nodes (or disks or memory) to improve performance x% Improved availability Continue operating when some nodes stop working Improved reliability Deliver correct results when some nodes misbehave, corrupt data Allow geographically-distributed individuals to share data or cooperate
Distributed System Challenges Lack of global state information Different nodes have different view of system –What are the contents of file A? –How many jobs are running on node X? –Which nodes are currently part of the system? See delays, different ordering of messages, lost messages, network partitions Tension with goal of “single coherent system” Handling slow, failed and misbehaving nodes How do you avoid slow nodes? How do you get back data or work from failed node? When nodes disagree, how do you know who is wrong? Tension with goal of “available and reliable” When is it okay to have some centralized components? Simplifies state management, but single point-of-failure and performance bottleneck
Content of 739 Distributed system courses can be very different… Theoretical: distributed algorithms (e.g., to allow nodes to come to consensus or agreement) 4 lectures Practical: distributed programming (e.g., using RPC, JAVA RMI, CORBA, DCOM, MPI, PVM) Warm-up project Research systems: new ideas for making distributed systems better Focus of course Implemented systems with new conceptual ideas Recent papers in top systems conferences (SOSP, OSDI, NSDI)
Learning by Reading Intense reading list; assume sophisticated reader (736) Usually cover 1 fascinating paper per class No exams Three types of classes 1)Formal lecture: Only for 4 theory topics 2)Discussions: Most papers –I ask questions, expect everyone to enthusiastically participate; fairly casual –Task 1: Read paper 2-3 times before class –Task 2: write-up to me BEFORE class –Task 3: Take turns being scribe (about 2 times in semester) Write-up notes from discussion in latex Post to web page within 72 hours
Learning by Reading (cont) Types of classes (cont) 3)Group-led lectures: 4 topics –Small group gives overview of about 3-4 related papers –Topics: Distributed system analysis Process migration Programming environments Specialized distributed services –Advantages Good practice for giving presentations Learn about topic in slightly more depth –Tasks Group: »Finalize related papers (1 week before) »Present to me (2 days before) »Use slides Everyone else: Skim papers –Handout: State preferences by next week
Course Topics: Reading List Distributed Operating Systems (Survey, Amoeba vs Sprite) Network File Systems (NFS, Coda, LBFS) Theory: Time, Ordering, and Distributed Snapshots (2 Lamport papers) Analysis of Distributed Systems (1 + Group Presentation) Programming Environments (DSM, MapReduce, Group) Process Migration (1 + Group) Specialized Distributed Services (Porcupine + Group) SPRING BREAK Theory: Consensus (Byzantine failures and fail-stop processors) Cluster-based File Systems (Petal+Frangipani and GoogleFS) Communication Primitives (RPC vs U-Net) P2P Systems (Measurement, CFS, Amazon, Pangaea, LOCKSS) Miscellaneous: Trust, Recovery, Mistakes, Speculation, Sensor Networks
Learning by Doing Warm-up Project Goal: Become familiar with existing distributed programming environments Examples: Hadoop (open-source MapReduce), MPI, PVM Task 0: Get environment running Task 1: Implement simple application (e.g., sorting) Task 2: Report sufficient numbers to indicate did something Final Project Goal 1: Experience with “research process” in general –Work on open-ended project, unknown result –New idea where don’t know if it will work Goal 2: Learn about specific topic in depth Topic from my list or your own choice; work with project partner Deliverables: 20 minute talk, short research paper
Agenda for Next Class See website: Read: Survey : Distributed Operating Systems Andrew S. Tanenbaum and Robbert Van Renesse ACM Computing Surveys, Volume 17, Issue 4 (December 1985), pp Long paper: Focus on Sections 1 and 2 Answer question: What were the goals of distributed systems at this time? Which design issue (I.e., communication primitives, naming and protection, resource management, fault tolerance, services) seems most challenging (or interesting)? Why? answer to me with Subject cs739: Survey Think about group presentation papers