Download presentation
Presentation is loading. Please wait.
Published byBritton Kennedy Modified over 9 years ago
1
Security and Replication … and Course Wrap-up Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems September 13, 2015 PNUTS slide content courtesy of Brian Cooper
2
2 Secure Transactions Authentication using public/private key pairs is essential today Consider every Web transaction – we want to know whom we’re conversing with! … versus ending up with a phishing attack!
3
3 Secure Sockets Layer (SSL) Relies on a trusted third party Certificate authority (CA) issues certificates to certify a server and its public key Verisign is perhaps the best known of these A server S generates public-private keypair Sends the public key, other info (plus $$$) to Verisign (etc.) Gets back a certificate with: CA name S’s name, URL, public key Timestamp and expiration info
4
4 Example Certificate Owner: CN=GTE CyberTrust Root, O=GTE Corporation, C=US Issuer: CN=GTE CyberTrust Root, O=GTE Corporation, C=US Serial number: 1a3 Valid from: Fri Feb 23 23:01:00 GMT 1996 until: Thu Feb 23 23:59:00 GMT 2006 Certificate fingerprints: MD5: C4:D7:F0:B2:A3:C5:7D:61:67:F0:04:CD:43:D3:BA:58 SHA1: 90:DE:DE:9E:4C:4E:9F:6F:D8:86:17:57:9D:D3:91:BC:65:A6:89:64
5
5 The SSL Protocol Client C connects to server S from enterprise E S sends E’s certificate (cleartext) C validates the certificate using the CA (e.g., Verisign)’s public key C generates and sends to S a session key encrypted with E’s public key Java has built-in support for SSL (Java Secure Socket Extension, integrated in 1.4) and a tool for managing certificates (keytool)
6
6 So… The client and server know each other given SSL How do we go ahead and make a purchase? Most commonly: you enter your credit card number Sometimes this is stored in the retailer’s system for future purposes! Best case: The CC info is stored in a special, firewalled server, not part of the web site Web server has other account info about you When a transaction goes through, web site sends order to this special server, which combines it with CC info and sends it onward
7
7 Replication… Core of the Cloud The vision of the “cloud”: a “computing utility” that is geographically distributed At its core: geographical replication as well as partitioning What to replicate (including granularity) Where to replicate How to maintain consistency (and how fresh data needs to be)
8
8 What to Replicate Cost to maintaining consistency if data is changing Larger objects, slower networks, frequent updates, freshness requirements replication is more expensive May be able to send a “diff” instead of the whole object Thus, difference between LAN and WAN replication: Local-area / cluster: Single-writer, multiple-reader data is often replicated e.g., CNN Wide-area: Need to limit replication to seldom-updated data, or relax the freshness or consistency constraints e.g., Akamai (images, video), Google index
9
9 Where to Place Replicas in the Internet Want to place them at points where they can handle many requests and reduce traffic in bottlenecks Commonly, at least one replica in Europe, Asia, US West Coast, US East Coast Server 1 Server 2 congested or failure-prone link C3 C2 C1 C4 C5C6 C7 C8 C9
10
10 Schemes for Maintaining Consistency Goal is to trade off performance vs. consistency guarantees Lock-based protocols Invalidation Lease Time-to-live
11
11 Lock-Based Protocols Guarantee strong consistency Similar to distributed version of what’s done in a database Client request for an item requires a read lock at its handling server Update to an item requires a write lock Multiple read locks can be held concurrently; write lock must be exclusive What are the potential pitfalls of this approach? Is it resilient to network partition?
12
12 Invalidation Protocols If a server is to update an item, it can multicast this to all replicas Requires servers to know who all of the other parties are May be somewhat weaker than lock-based models – why? Common variation: lease-based protocol A replicated item is “leased” for a particular period If the item is updated during its lease, it is invalidated/refreshed After it expires, it is dropped What are the pros and cons of these protocols?
13
13 Time-to-Live-Based Replication Generally used when freshness constraints aren’t severe Replicas are provided with an expectation for how likely they are likely to be current After the “time-to-live” expires, they need to be revalidated How does this compare to the previous approaches?
14
Replication in “Cloud” Services Yahoo’s PNUTS, Google’s BigTable are based on the notion that there is locality of data access Consider consistency within each record but ignore cross- record consistency e.g., in a social network, we should coordinate accesses to the same user (but don’t care about consistency with unrelated friends) … but even here, we might be able to tolerate relaxed consistency among the users 14
15
15 Yahoo’s PNUTS Platform E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E E 75656 C A 42342 E B 42521 W C 66354 W D 12352 E F 15677 E Parallel database Geographic replication Indexes and views Structured, flexible schema A 42342 E B 42521 W C 66354 W D 12352 E E 75656 C F 15677 E
16
16 Query model Per-record operations Get Set Delete Multi-record operations Multiget Scan Getrange Web service (RESTful) API
17
System Architecture 17 Storage units Routers Tablet controller REST API Clients Local region Remote regions YMB
18
18 Tablet splitting and balancing Each storage unit has many tablets (horizontal partitions of the table) Tablets may grow over time Overfull tablets split Storage unit may become a hotspot Shed load by moving tablets to other servers Storage unit Tablet
19
19 Accessing data SU 1 Get key k 2 3 Record for key k 4
20
20 Storage unit 1Storage unit 2Storage unit 3 Range queries Router Apple Avocado Banana Blueberry Canteloupe Grape Kiwi Lemon Lime Mango Orange Strawberry Tomato Watermelon Grapefruit…Pear? Grapefruit…Lime? Lime…Pear? MIN-CanteloupeSU1 Canteloupe-LimeSU3 Lime-StrawberrySU2 Strawberry-MAXSU1 Strawberry-MAX SU2Lime-Strawberry SU3Canteloupe-Lime SU1MIN-Canteloupe
21
21 Updates 1 Write key k 2 7 Sequence # for key k 8 SU 3 Write key k 4 5 SUCCESS 6 Write key k Routers Message brokers
22
22 Asynchronous replication and consistency
23
23 Asynchronous Replication
24
24 Goal: make it easier for applications to reason about updates and cope with asynchrony Consider a single record for Brian Cooper’s Facebook entry: Time Record inserted Update Delete Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Update Consistency Model
25
25 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Current version Stale version Read (local) Consistency Model
26
26 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read up-to-date Current version Stale version Consistency Model
27
27 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Read ≥ v.6 Current version Stale version Consistency Model
28
28 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write Current version Stale version Consistency Model
29
29 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Consistency Model
30
30 Time v. 1 v. 2 v. 3v. 4 v. 5 v. 7 Generation 1 v. 6 v. 8 Write if = v.7 ERROR Current version Stale version Mechanism: per record mastership Consistency Model
31
PNUTS Recap An interesting compromise between consistency and performance/availability Used underneath many of Yahoo’s properties … And an exemplar of the new generation of cloud services 31
32
32 Experiments – Show It’s So! The general goal: to help demonstrate and show why a real- world artifact provides a benefit Versus some benchmark or naïve strategy We also want to understand why there’s a benefit Some common kinds of experiments: Usability: some sort of user tests, versus a benchmark Performance: as we increase the workload, what happens? Scalability: as we increase the data, devices, nodes, what happens? Complexity: especially for things like code, what happens as we make the task harder or bigger?
33
33 Experimentation In general, experiments should follow the scientific method: Hypothesis (e.g., our method will do better than XYZ on workloads like QWV, which are representative of domain ABC) Experiment (examine this – may need many trials, random workloads, etc.) Conclusion (show, with statistically significant measurements, that the hypothesis is true) Often, the hypothesis almost goes unsaid in computer science – it’s implicit in the choice of the problem – but it is there! Note that many attributes, e.g., elegance, style, are not very amenable to experiments Others, like expressiveness, generally need to be proven rather than run
34
34 Experimental Workloads There are generally three kinds of systems experiments: Synthetic microbenchmark: experimental runs are done over inputs that are generated to stress a specific factor, but is not particularly realistic Examples: a hard disk random access test; a web server’s maximum throughput Really shows the factor of interest; can be tweaked, scaled, etc. Synthetic based on real behavior: experimental runs are done over inputs that are modeled after real data, but perhaps generated randomly Examples: SPEC benchmarks; TPC-W web transaction benchmark Enables us to generate more inputs, testing scalability, etc. Real-world: traces are collected of real system behavior over real data Disadvantage: hard to quantify or control the different factors
35
Experimental Methodology Consider the important factors that you wish to examine (and demonstrate) Scalability – can typically be in terms of running time, size of the problem, space consumed, etc. Here: performance is what matters Break it down into individual parameters Crawl & index time; time to answer a query; etc. Consider a workload that helps measure the parameter Crawl 1000 documents; run 50 queries 10 times apiece; etc. Vary one parameter at a time, study effects Number of machines; number of threads per machine; etc. Run experiment multiple times; average and show 95% confidence intervals in line (continuous) or bar (discrete) chart 35
36
36 Course Recap (Until Next Week’s Midterm 2!) Distributed, Web-scale systems are here to stay! They create many issues that are not totally resolved, and for which there is no one answer: Heterogeneity Timing Partitioning and replication Consistency and integrity Etc. This course tried to give you a sense of the issues and state-of-the-art – as well as the skills to go out and work in this domain I hope the amount of work we all sank into the material (and the homeworks) will pay off for you! And stay tuned – there’s lots more to come! Sensor networks, semantic Web, mobile systems, location-based services, …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.