Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building Phone:

Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building E-mail: mqzhou@sei.ecnu.edu.cn Phone: 32204750-167 2010-09-16

Course Introduction Data Management in P2P Systems – 1-4 weeks Data Management in Cloud Systems – 5-10 weeks Computational Advertisement – 11-18 weeks

Final Grades Usual Grades (60%) – Attendance – Presentation Final Report (40%), (English Preferred) – Survey – Paper

A Brief Introduction to Distributed Systems

5 What Is a Distributed System? Multiple computers (“machines,” “hosts,” “boxes,” &c.) – Each with CPU, memory, disk, network interface – Interconnected by LAN or WAN (e.g., Internet) Application runs across this dispersed collection of networked hardware But user sees single, unified system

6 What Is a Distributed System? (Alternate Take) “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” – Leslie Lamport, Microsoft Research (ex DEC)

7 Start Simple: Centralized System Suppose you run Gmail Workload: – Inbound email arrives; store on disk – Users retrieve, delete their email You run Gmail on one server with disk Gmail Server (PC) Email Sender Email Sender Email Sender Email Reader Email Reader Email Reader What are shortcomings of this design?

8 Why Distribute? For Availability Suppose Gmail server goes down, or network between client and it goes down No incoming mail delivered, no users can read their inboxes Fix: replicate the data on several servers – Increased chance some server will be reachable – Consistency? One server down when delete message, then comes back up; message returns in inbox – Latency? Replicas should be far apart, so they fail independently – Partition resilience? e.g., airline seat database splits, one seat remains, bought twice, once in each half!

9 Why Distribute? For Scalable Capacity What if Gmail a huge success? Workload exceeds capacity of one server Fix: spread users across several servers – Best case: linear scaling—if U users per box, N boxes support NU users – Bottlenecks? If each user’s inbox on one server, how to route inbound mail to right server? – Scaling? How close to linear? – Load balance? Some users get more mail than others!

10 Performance Can Be Subtle Goal: predictable performance under high load 2 employees run a Starbucks – Employee 1: takes orders from customers, calls them out to Employee 2 – Employee 2: writes down drink orders (5 seconds per order) makes drinks (10 seconds per order) What is throughput under increasing load?

11 Starbucks Throughput Peak system performance: 4 drinks / min What happens when load > 4 orders / min? What happens to efficiency as load increases? What would preferable curve be? What design achieves that goal?

12 Why Are Distributed Systems Hard to Design? Failure: of hosts, of network – Remember Lamport’s lament Heterogeneity – Hosts may have different data representations Need consistency (many specific definitions) – Users expect familiar “centralized” behavior Need concurrency for performance – Avoid waiting synchronously, leaving resources idle – Overlap requests concurrently whenever possible

References Books – Legitimate applications of peer-to-peer networks 。 Dinesh Verma 。 Wiley-IEEE, 2004 – Cloud Computing: Web-Based Applications That Change the Way You Work and Collaborate Online 。 Michael Miller ， Que, 2008 。 – F. von Lohmann, “P2P File ShDavid P. Anderson and John Kubiatowicz, The Worldwide Computer, Scientific American, March 2002 Papers – Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan, “Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications”, Proceedings of ACM SIGCOMM’01, San Diego, CA, August 2001. – Bujor Silaghi, Bobby Bhattacharjee, Pete Keleher, “Query Routing in the TerraDir Distributed Directory”, Proceedings of SPIE ITCOM, Boston, MA, July 2002. – Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker, “A Scalable Content-Addressable Network”, Proceedings of ACM SIGCOMM’01, San Diego, CA, August 2001.

– OceanStore: An Architecture for Global-Scale Persistent Storage, John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao. Appears in Proceedings of the Ninth international Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2000), November 2000 – W. J. Bolosky, J. R. Douceur, D. Ely, M. Theimer; Feasibility of a Serverless Distributed File System Deployed on an Existing Set of Desktop PCs, Proceedings of the international conference on Measurement and modeling of computer systems, 2000, pp. 34-43 – J. Kleinberg, The Small-World Phenomenon: An Algorithmic Perspective, Proc. 32nd ACM Symposium on Theory of Computing, Portland, OR, May, 2000 – R. Albert, H. Joeong, A. Barabasi, Error and Attack Tolerance of Complex Networks, Nature, vol. 46, July 2000. – H. Zhang, A. Goel, R. Govindan, Using the Small-World Model to Improve Freenet Performance, Proceedings of IEEE Infocom, New York, NY, June 2002. – J. Chu, K. Labonte, B. Levine, Availability and Locality Measurements of Peer-to- Peer File Systems, Proceedings of SPIE ITCOM, Boston, MA, July 2002. – R. Bhagwan, S. Savage, G. Voelker, Understanding Availability, in Proc. 2nd International Workshop on Peer-to-Peer Systems (IPTPS), Berkeley, CA, Feb 2003. – S. Saroiu, P. Gummadi, S. Gribble, A Measurement Study of Peer-to-Peer File Sharing Systems, in Proceedings of Multimedia Computing and Networking 2002 (MMCN'02), San Jose, CA, January 2002. – aring and Copyright Law: A Primer for Developers,” IPTPS 2003

– Antony Rowstron and Peter Druschel, “Pastry: Scalable, Decentralized, Object Location and Routing for Large-scale Peer-to-peer Systems”, Proceedings of IFIP/ACM International Conference on Distributed Systems Platforms (Middelware)’02 – Ben Y. Zhao, John Kubiatowicz, Anthony Joseph, “Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing”, Technical Report, UC Berkeley – A. Rowstron and P. Druschel, "Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility", 18th ACM SOSP'01, Lake Louise, Alberta, Canada, October 2001. – S. Iyer, A. Rowstron and P. Druschel, "SQUIRREL: A decentralized, peer- to-peer web cache", appeared in Principles of Distributed Computing (PODC 2002), Monterey, CA – Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, and Ion Stoica, Wide-area cooperative storage with CFS, ACM SOSP 2001, Banff, October 2001 – Ion Stoica, Daniel Adkins, Shelley Zhaung, Scott Shenker, and Sonesh Surana, Internet Indirection Infrastructure, in Proceedings of ACM SIGCOMM'02, Pittsburgh, PA, August 2002, pp. 73-86 – L. Garces-Erce, E. Biersack, P. Felber, K.W. Ross, G. Urvoy-Keller, Hierarchical Peer-to-Peer Systems, 2003, http://cis.poly.edu/~ross/publications.html – Kangasharju, K.W. Ross, D. Turner, Adaptive Content Management in Structured P2P Communities, 2002, http://cis.poly.edu/~ross/publications.html

– K.W. Ross, E. Biersack, P. Felber, L. Garces-Erce, G. Urvoy-Keller, TOPLUS: Topology Centric Lookup Service, 2002, http://cis.poly.edu/~ross/publications.html – P. Felber, E. Biersack, L. Garces-Erce, K.W. Ross, G. Urvoy-Keller, Data Indexing and Querying in P2P DHT Networks, http://cis.poly.edu/~ross/publications.html – K.W. Ross, Hash-Routing for Collections of Shared Web Caches, IEEE Network Magazine, Nov-Dec 1997 – A. Keromytis, V. Misra, D. Rubenstein, SOS: Secure Overlay Services, in Proceedings of ACM SIGCOMM'02, Pittsburgh, PA, August 2002 – M. Reed, P. P. Syverson, D. Goldschlag, Anonymous Connections and Onion Routing, IEEE Journal on Selected Areas of Communications, Volume 16, No. 4, 1998. – V. Scarlata, B. Levine, C. Shields, Responder Anonymity and Anonymous Peer-to-Peer File Sharing, in Proc. IEEE Intl. Conference on Network Protocols (ICNP), Riverside, CA, November 2001. – E. Sit, R. Morris, Security Considerations for Peer-to-Peer Distributed Hash Tables, in Proc. 1st International Workshop on Peer-to-Peer Systems (IPTPS), Cambridge, MA, March 2002. – J. Saia, A. Fiat, S. Gribble, A. Karlin, S. Sariou, Dynamically Fault- Tolerant Content Addressable Networks, in Proc. 1st International Workshop on Peer-to-Peer Systems (IPTPS), Cambridge, MA, March 2002.

– M. Castro, P. Druschel, A. Ganesh, A. Rowstron, D. Wallach, Secure Routing for Structured Peer-to-Peer Overlay Netwirks, In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation (OSDI'02), Boston, MA, December 2002. – Edith Cohen and Scott Shenker, “Replication Strategies in Unstructured Peer-to-Peer Networks”, in Proceedings of ACM SIGCOMM'02, Pittsburgh, PA, August 2002 – Dan Rubenstein and Sambit Sahu, “An Analysis of a Simple P2P Protocol for Flash Crowd Document Retrieval”, Columbia University Technical Report

Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building Phone:

Similar presentations

Presentation on theme: "Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building Phone:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building Phone:

Similar presentations

Presentation on theme: "Data Management in Distributed Systems Minqi Zhou Software Engineering Institute Office: Room 111 Mathematics Building Phone:"— Presentation transcript:

Similar presentations

About project

Feedback