Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 17, 2008.

Similar presentations


Presentation on theme: "Introduction Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 17, 2008."— Presentation transcript:

1 Introduction Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 17, 2008

2 2 What this Course Is About  The focus is NOT on building Web applications in PHP, or servlets, or ASP…  It’s about how to build services like Google, Akamai, iTunes, …  What are the principles behind them?  Distributed systems concepts, with emphasis on scalability and interoperability  Data representation fundamentals, with emphasis on XML  Information retrieval concepts, including ranking and indexing  It’s a course that involves building software, evaluating it, and programming in teams

3 3 How Does this Relate to Other CIS Courses? CIS 505: focuses on distributed systems with an emphasis on concurrency CIS 330/550  Data representation and management  Building a DBMS-backed, servlet-based web site (e.g., mashup)  455/555 focuses on data with respect to interoperability CIS 350: focuses on software development and engineering CIS 573: software engineering & mashups Assumptions:  You know something about threads and synchronization primitives; and you’re at least vaguely familiar with database ideas (or can quickly learn them)

4 4 Some Things We’ll Look at  What are the principles behind building systems that work on the Internet?  How do these relate to many of today’s hot technologies?  Web servers, DHTML, Servlets, JSP, …  XML  Web services  Peer-to-peer  Application servers  Content distribution networks  Web search  Mash-ups  …

5 5 Staff  Instructor: Zack Ives, zives@cis  Office: 576 Levine North  Office hours T 3:30-4:30 (and by arrangement)  TA: Mengmeng Liu, mengmeng@seas  Office: 575 Levine North  Office hours TBA  Discussion group:  cis-455-555-spring08@googlegroups.com cis-455-555-spring08@googlegroups.com  http://groups.google.com/cis-455-555-spring08 http://groups.google.com/cis-455-555-spring08

6 6 Textbooks  Distributed Systems: Principles and Paradigms, 2 nd ed, Tanenbaum and van Steen  Frequent supplementary handouts  Excerpts from several books  Many recent research papers

7 7 Prerequisites, Workload, etc. Necessary skills:  The ability to code in Java – there is a substantial implementation project  The ability to work as a team with a classmate  A willingness to “push the envelope”  Knowledge of threads & sync: CSE 380 – Operating Systems – or equivalent  Suggested CIS 330 / 550 – Databases – or equivalent Workload:  Several programming/debugging-based homework assignments  A substantial term project with experimental evaluation and a report  Midterm and final exam  WARNING: this course should be considered 1.5 CU (and we’re in the process of making that happen)

8 8 A Disclaimer…  This is a “bleeding edge” course!  Goal 1: give you a look under the covers of today’s hottest topics – in lectures and in projects  Goal 2: give you a level of comfort in managing large, complex software development with others’ code  Part of this means doing a substantial implementation project  As in the real world: learning APIs, dealing with inadequate tools  Most of you will find this a struggle!  We will be using some immature technology  Not everything has been tested and validated ahead of time  We’ll do the best we can to smooth over the bugs  We hope it will be a fun course, though… … And an interesting one!

9 9 A Bit of Context for the Course

10 10 What Exactly Is the Web?  The Web consists of HTTP servers that publish HTML, XML, and a few other content types  These are hyperlinked via URLs (a subset of URIs)  Plus there are a huge number of web clients  The web is built on a number of Internet protocols:  DNS, TCP, IP  The Internet has many other protocols  SMTP, IMAP, POP, AIM, FTP, …  Streaming media, music swapping protocols, …  Web services, custom applications

11 11 The Internet is Built in Layers  Link layer (802.11x, 802.3, …)  IP layer – point-to-point and multicast  Transport, session layers:  TCP, UDP -- session-based vs. sessionless; reliable vs. unreliable  UDP is used as the core of many multimedia protocols (e.g., Real or WMP streaming protocols)  TCP is used as the basis of most of our session-oriented protocols:  Telnet, SSH, FTP  HTTP  Other protocols are built over HTTP (e.g., XML-RPC)  IM, P2P protocols, …  Middleware and application layers  Sometimes we interpose extra layers that are invisible  e.g., Akamai

12 12 What Is an Internet System?  Not just a web server or web application…  An application built over the Internet, whose functionality is distributed across more than one machine  Typically, at least in a client-server or server-to-server fashion, but may have many more participants  Typically, data and/or code must be exchanged in distributed fashion for the functioning of the application  Often, the data must be partitioned, replicated, translated, etc.  Often, the code is written in multiple different environments, languages, etc.  Often, there are concerns about handling failures, firewalls, …

13 13 Why Are Internet System Topics Interesting?  Understanding what’s underneath today’s web  How does it work?  What are its shortcomings?  What are its strengths?  Understanding distributed algorithms  Using the right approach when designing new protocols and web systems  Being able to anticipate what’s actually possible in the future

14 14 Example: Web Search Index Servers Crawlers Search Interface Servers queries HTML forms; results query results Web Pages pages keywords + locations client Uses a model of document/word similarity to rank matches

15 15 Example: Information Integration XML sources Mediator System queries results in “mediated schema” client Relational sources HTML sources XQuery + XPath over XML SQL ODBC results HTTP POST HTML Maps all data into a single format and virtual schema

16 16 Example: SETI@home Problem Partitioning client Breaks computation into many parts and distributes them to the clients Data Aggregation New sub- problems Computed subresults

17 17 Example: P2P File Sharing client request data Processes name-based requests for data; each node can make requests, forward requests, return data

18 18 What are the Hard Problems?  Disclaimer: most of the hard problems AREN’T solved (or solvable) – and there often isn’t any single BEST solution Much of systems design is about finding the right compromise for each specific problem  We can divide them into:  Scalability  Availability / reliability  Consistency  Interoperability  Location and resource discovery

19 19 Scalability  How do we support a large number of clients or requests?  Distribute work!  Challenges:  Coordination – takes significant overhead  Load balancing – avoid having bottlenecks  Parts of the solution:  Client-server, multi-tier, P2P architectures  Data partitioning, replication, remote procedure calls, …

20 20 Availability/Reliability  How do we ensure the system is “up” when we want it to be?  Replication and redundancy  Security measures against attacks  Ability to undo/redo  Challenges:  Keeping things consistent  Performance vs. security  Acknowledgments  Parts of the solution:  Data partitioning, replication, …  Logging, transactions, …  Redundant hardware, multiple sites, …

21 21 Consistency  Replication and distribution make it difficult to keep a unified, consistent view of the world – how do we combat this?  Locking, concurrency control, and invalidation schemes  Clock synchronization  Challenges:  Locking has huge performance overhead  Network partitions, disconnected operation  Parts of the solution:  Optimistic concurrency control, 2-phase locking  Conflict resolvers

22 22 Interoperability  How do we coordinate the efforts of components that have different data formats and/or source languages, and are on different machines?  Standardization!  Challenges:  Everything has a different semantics!  Parts of the solution:  Standard data formats: XML, XML schemas  “Schema mediation” and data translation  Remote procedure calls: CORBA, XML-RPC, …

23 23 Location & Resource Discovery  How do you find what you’re looking for?  Naming  Declarative queries over standard schemas  Advertisements  Challenges:  Naming has implicit semantics  What do you do when you don’t know what to call something?  Parts of the solution:  Directory systems – DNS, LDAP, etc.  Resource discovery and advertising protocols  Standardized schemas

24 24 Our First Focus: Single Machines, aka Servers  How do you handle large numbers of concurrent users?  Processes  Threads  Events  Hybrids (e.g., thread pools)  Staged architectures

25 25 Next Time…  We’ll look under the covers of an HTTP server  Key ideas in building scalable systems  Principles of HTTP and web servers  Management of concurrent sessions  To read:  Lampson and Saltzer paper  Tanenbaum Ch. 3.1  For next week: “HTTP Made Really Easy” and Rexford Ch. 4  If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar OS book on interprocess communication


Download ppt "Introduction Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 17, 2008."

Similar presentations


Ads by Google