Download presentation
Presentation is loading. Please wait.
Published byAnnis Welch Modified over 9 years ago
1
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley
2
Motivation The Internet has evolved to become a commercial infrastructure for service delivery –Web delivery, VoIP, streaming media … Challenges for Internet-scale services –Scalability: 600M users, 35M Web sites, 2.1Tb/s –Efficiency: bandwidth, storage, management –Agility: dynamic clients/network/servers –Security, etc. Focus on content delivery - Content Distribution Network (CDN) –Totally 4 Billion Web pages, daily growth of 7M pages –Annual traffic growth of 200% for next 4 years
3
How CDN Works
4
New Challenges for CDN Large multimedia files ― Efficient replication Dynamic content ― Coherence support Network congestion/failures ― Scalable network monitoring
5
Existing CDNs Fail to Address these Challenges Non-cooperative replication inefficient No coherence for dynamic content Unscalable network monitoring - O(M × N) M: # of client groups, N: # of server farms X
6
Provisioning (replica placement) Network Monitoring Coherence Support Ad hoc pair-wise monitoring O(M×N) Tomography -based monitoring O(M+N) IP multicast App-level multicast on P2P DHT Unicast SCAN: Scalable Content Access Network Granularity SCANPush Existing CDNsPull CooperativeNon-cooperative Per object Per Website Per cluster Access/Deployment Mechanisms
7
SCAN Coherence for dynamic content Cooperative clustering-based replication s1, s4, s5 s1 s4 s5
8
SCAN X Scalable network monitoring - O(M+N) M: # of client groups, N: # of server farms s1, s4, s5 Cooperative clustering-based replication Coherence for dynamic content
9
Evaluation of Internet-scale Systems Network topology Web workload Network end-to-end latency measurement Analytical evaluation Algorithm design Realistic simulation iterate Real evaluation?
10
Network Topology and Web Workload Network Topology –Pure-random, Waxman & transit-stub synthetic topology –An AS-level topology from 7 widely-dispersed BGP peers Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K5940-4781-76712378-1784-3011 –Aggregate MSNBC Web clients with BGP prefix »BGP tables from a BBNPlanet router –Aggregate NASA Web clients with domain names –Map the client groups onto the topology
11
Network E2E Latency Measurement NLANR Active Measurement Project data set –111 sites on America, Asia, Australia and Europe –Round-trip time (RTT) between every pair of hosts every minute –17M daily measurement –Raw data: Jun. – Dec. 2001, Nov. 2002 Keynote measurement data –Measure TCP performance from about 100 worldwide agents –Heterogeneous core network: various ISPs –Heterogeneous access network: »Dial up 56K, DSL and high-bandwidth business connections –Targets »40 most popular Web servers + 27 Internet Data Centers –Raw data: Nov. – Dec. 2001, Mar. – May 2002
12
Clustering Web Content for Efficient Replication
13
Overview CDN uses non-cooperative replication - inefficient Paradigm shift: cooperative push –Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] –But what content to be pushed? –At what granularity? Clustering of objects for replication –Close-to-optimal performance with small overhead Incremental clustering –Push before accessed: improve availability during flash crowds
14
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
15
CDN name server Client 1 Local DNS serverLocal CDN server Web content server Client 2 Local DNS server Local CDN server 3.GET request 6. Response 4.GET request if cache miss ISP 2 ISP 1 Conventional CDN: Non-cooperative Pull 5. Response Inefficient replication 1. Request for hostname resolution 2. Reply: local CDN server IP address
16
CDN name server Client 1 Local DNS serverLocal CDN server Web content server Client 2 Local DNS server Local CDN server ISP 2 ISP 1 3. GET request 4. Response 3.GET request if no replica yet SCAN: Cooperative Push 0. Push replicas Significantly reduce the # of replicas and update cost 1. Request for hostname resolution 2. Reply: nearby replica server or Web server IP address s2
17
Comparison between Conventional CDNs and SCAN Conventional CDNs SCAN Average retrieval latency (ms) 79.277.9 Number of object replicas deployed 121,0165,000 Number of update messages 1,349,65554,564
18
Problem Formulation How to use cooperative push for replication to reduce –Clients’ average retrieval cost –Replica location computation cost –Amount of replica directory state to maintain Subject to certain total replication cost (e.g., # of object replicas)
19
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
20
Per Web site 1 2 3 4 1 2 3 4 Per object
21
60 – 70% average retrieval cost reduction for Per object scheme Per object is too expensive for management! Replica Placement: Per Site vs. Per Object
22
Where R: # of replicas per object M: total # of objects in the Website To compute on average 10 replicas/object for top 1000 objects takes several days on a normal server! Replication SchemeReplica Directory State to Maintain Computation Cost Per WebsiteO(R) Per ObjectO(R × M) Overhead Comparison
23
Where R: # of replicas per object K: # of clusters M: total # of objects in the Website (M >> K) Replication SchemeReplica Directory State to Maintain Computation Cost Per WebsiteO(R) Per ClusterO(R × K + M)O(R × K) Per ObjectO(R × M) Overhead Comparison
24
Clustering Web Content General clustering framework –Define the correlation distance between objects –Cluster diameter: the max distance between any two members »Worst correlation in a cluster –Generic clustering: minimize the max diameter of all clusters Correlation distance definition based on –Spatial locality –Temporal locality –Popularity
25
Spatial Clustering Correlation distance between two objects defined as –Euclidean distance –Vector similarity 1 2 3 4 Object spatial access vector –Blue object
26
Clustering Web Content (cont’d) Popularity-based clustering –OR even simpler, sort them and put the first N/K elements into the first cluster, etc. Temporal clustering – Divide traces into multiple individuals’ access sessions [ABQ01] – In each session, – Average over multiple sessions in one day
27
Performance of Cluster-based Replication Use greedy algorithm for replication Spatial clustering with Euclidean distance and popularity-based clustering perform the best –Small # of clusters (with only 1-2% of # of objects) can achieve close to per-object performance, with much less overhead
28
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
29
Static clustering and replication Two daily traces: training trace and new trace Static clustering performs poorly beyond a week MethodsStatic 1Static 2Optimal Traces used for clusteringTraining New Traces used for replicationTrainingNew Traces used for evaluationNew Retrieval cost of static clustering almost doubles the optimal !
30
Incremental Clustering Generic framework 1.If new object o matches with existing cluster c, add o to c and replicate o to existing replicas of c 2.Else create new cluster and replicate them Two types of incremental clustering –Online: without any access logs »High availability –Offline: with access logs »Close-to-optimal performance
31
Groups of siblings Object 1 Object 2 Object 4 1 2 3 4 5 6 7 Groups of the same hyperlink depth (smallest # of links from root) 1 2 3 4 5 6 7 Online Incremental Clustering Predict access patterns based on semantics Simplify to popularity prediction Groups of objects with similar popularity? Use hyperlink structures!
32
Online Popularity Prediction Experiments –Crawl http://www.msnbc.com with hyperlink depth 4, then group the objects –Use corresponding access logs to analyze the correlation Groups of siblings have better correlation Measure the divergence of object popularity within a group: access freq span =
33
Semantics-based Incremental Clustering Put new object into existing cluster with largest number of siblings –In case of a tie, choose the cluster w/ more replicas Simulation on MSNBC daily traces –8-10am trace: static popularity clustering + replication –At 10am: M new objects - online inc. clustering + replication –Evaluated with 10-12am trace: each new object O(10 3 ) requests 1 2 34 5 6 +? 2 3 5 6 1 4 1 4 2 3 5 6
34
Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication
35
Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost
36
Conclusions Cooperative, clustering-based replication Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs Clustering reduce the management/computational overhead by two orders of magnitude –Spatial clustering and popularity-based clustering recommended Incremental clustering to adapt to emerging objects –Hyperlink-based online incremental clustering for high availability and performance improvement
37
Tie Back to SCAN Self-organize replicas into app-level multicast tree for update dissemination Scalable overlay network monitoring –O(M+N) instead of O(M × N), given M client groups and N servers For more info: http://www.cs.berkeley.edu/~yanchen/resume.html# Publications
38
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
39
Future Research (I) Measurement-based Internet study and protocol/architecture design –Use inference techniques to develop Internet behavior models »Network operators reluctant to reveal internal network configs –Root cause analysis: large, heterogeneous data mining »Leverage graphics/visualization for interactive mining –Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture –E.g., Internet bottleneck – peering links? How and Why? Implications?
40
Future Research (II) Network traffic anomaly characterization, identification and detection –Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) –Profile traffic patterns of new applications (e.g., P2P) –> benign anomalies –Understand the causes, patterns and prevalence of other unknown anomalies –Apply malicious patterns for intrusion detection –E.g., fight against Sapphire/Slammer Worm –Leverage Forensix for auditing and querying
41
Backup Materials
42
Tomography-based Network Monitoring B A P_i L_j 1 – P = (1 – l_0)(1 – l_1)(1 – l_2) M × N O(M + N) Given O(M+N) end hosts, power- law degree topology imply O(M+N) links Transform to the topology matrix Pick O(M + N) paths to compute the link loss rates Use link loss rates to compute the loss rates of other paths
43
Path Loss Rate Inference Ideal case: rank = # of links (K) Rank deficiency solved through topology transformation Real links Virtual links Topology transformation
44
Future Research (I) Internet behavior modeling and protocol / architecture design –Use inference techniques to develop Internet behavior models –Root cause analysis: large, heterogeneous data mining »Leverage graphics/visualization for interactive mining –Leverage SciClone Cluster for parallel network tomography –Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture –E.g., Internet bottleneck – peering links? How and Why? Implications?
45
Tomography-based Network Monitoring Observations –# of lossy links is small, dominate E2E loss –Loss rates are stable (in the order of hours ~ days) –Routing is stable (in the order of days) Identify the lossy links and only monitor a few paths to examine lossy links Make inference for other paths End hosts Routers Normal links Lossy links
46
SCAN Coherence for dynamic content Cooperative clustering-based replication X Scalable network monitoring O(M+N) s1, s4, s5
47
Problem Formulation Subject to certain total replication cost (e.g., # of URL replicas) Find a scalable, adaptive replication strategy to reduce avg access cost
48
CDN Applications (e.g. streaming media) SCAN: Scalable Content Access Network Provision: Cooperative Clustering-based Replication User Behavior/ Workload Monitoring Coherence: Update Multicast Tree Construction Network Performance Monitoring Network Distance/ Congestion/ Failure Estimation red: my work, black: out of scope
49
Evaluation of Internet-scale System Analytical evaluation Realistic simulation –Network topology –Web workload –Network end-to-end latency measurement Network topology –Pure-random, Waxman & transit-stub synthetic topology –A real AS-level topology from 7 widely-dispersed BGP peers
50
Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K5940-4781-76712378-1784-3011 World Cup May-Jul/1998All day29M – 1M – 73M103K–13K–218KN/A Aggregate MSNBC Web clients with BGP prefix –BGP tables from a BBNPlanet router Aggregate NASA Web clients with domain names Map the client groups onto the topology
51
Simulation Methodology Network Topology –Pure-random, Waxman & transit-stub synthetic topology –An AS-level topology from 7 widely-dispersed BGP peers Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K5940-4781-76712378-1784-3011 –Aggregate MSNBC Web clients with BGP prefix »BGP tables from a BBNPlanet router –Aggregate NASA Web clients with domain names –Map the client groups onto the topology
52
Online Incremental Clustering Predict access patterns based on semantics Simplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures! –Groups of siblings –Groups of the same hyperlink depth: smallest # of links from root
53
Challenges for CDN Over-provisioning for replication –Provide good QoS to clients (e.g., latency bound, coherence) –Small # of replicas with small delay and bandwidth consumption for update Replica Management –Scalability: billions of replicas if replicating in URL »O(10 4 ) URLs/server, O(10 5 ) CDN edge servers in O(10 3 ) networks –Adaptation to dynamics of content providers and customers Monitoring –User workload monitoring –End-to-end network distance/congestion/failures monitoring »Measurement scalability »Inference accuracy and stability
54
SCAN Architecture Leverage Decentralized Object Location and Routing (DOLR) - Tapestry for –Distributed, scalable location with guaranteed success –Search with locality Soft state maintenance of dissemination tree (for each object) data plane network plane data source Web server SCAN server client replica always update adaptive coherence cache Tapestry mesh Request Location Dynamic Replication/Update and Content Management
55
Cluster A Clients Cluster B Monitors Cluster C Distance measured from a host to its monitor Distance measured among monitors SCAN edge servers Wide-area Network Measurement and Monitoring System (WNMMS) Select a subset of SCAN servers to be monitors E2E estimation for Distance Congestion Failures network plane
56
Dynamic Provisioning Dynamic replica placement –Meeting clients’ latency and servers’ capacity constraints –Close-to-minimal # of replicas Self-organized replicas into app-level multicast tree –Small delay and bandwidth consumption for update multicast –Each node only maintains states for its parent & direct children Evaluated based on simulation of –Synthetic traces with various sensitivity analysis –Real traces from NASA and MSNBC Publication –IPTPS 2002 –Pervasive Computing 2002
57
Effects of the Non-Uniform Size of URLs Replication cost constraint : bytes Similar trends exist –Per URL replication outperforms per Website dramatically –Spatial clustering with Euclidean distance and popularity- based clustering are very cost-effective 1 2 3 4
58
End Host Cluster A Cluster B Cluster C Landmark Diagram of Internet Iso-bar
59
Cluster A End Host Cluster B Monitor Cluster C Distance probes from monitor to its hosts Distance probes among monitors Landmark Diagram of Internet Iso-bar
60
Real Internet Measurement Data NLANR Active Measurement Project data set –119 sites on US (106 after filtering out most offline sites) –Round-trip time (RTT) between every pair of hosts every minute –Raw data: 6/24/00 – 12/3/01 Keynote measurement data –Measure TCP performance from about 100 agents –Heterogeneous core network: various ISPs –Heterogeneous access network: »Dial up 56K, DSL and high-bandwidth business connections –Targets »Web site perspective: 40 most popular Web servers »27 Internet Data Centers (IDCs)
61
Related Work Internet content delivery systems –Web caching »Client-initiated »Server-initiated –Pull-based Content Delivery Networks (CDNs) –Push-based CDNs Update dissemination –IP multicast –Application-level multicast Network E2E Distance Monitoring Systems
62
Clien t Local DNS serverProxy cache server Web content server Client Local DNS server Proxy cache server 1.GET request 4. Response 2.GET request if cache miss 3. Response ISP 2 ISP 1 Web Proxy Caching
63
CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. local CDN server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: local CDN server IP address 5.GET request 8. Response 6.GET request if cache miss ISP 2 ISP 1 Conventional CDN: Non-cooperative Pull 7. Response Inefficient replication
64
CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. Redirected server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: nearby replica server or Web server IP address ISP 2 ISP 1 5. GET request 6. Response 5.GET request if no replica yet SCAN: Cooperative Push 0. Push replicas Significantly reduce the # of replicas and update cost
65
Internet Content Delivery Systems Scalability for request redirection Pre- configured in browser Use Bloom filter to exchange replica locations Centralized CDN name server Decentra- lized P2P location Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push- based CDNs SCAN Efficiency (# of caches or replicas) No cache sharing among proxies Cache sharing No replica sharing among edge servers Replica sharing Network- awareness No Yes, unscalable monitoring system NoYes, scalable monitoring system Coherence support No YesNoYes
66
Previous Work: Update Dissemination No inter-domain IP multicast Application-level multicast (ALM) unscalable –Root maintains states for all children (Narada, Overcast, ALMI, RMX) –Root handles all “join” requests (Bayeux) –Root split is common solution, but suffers consistency overhead
67
Design Principles Scalability –No centralized point of control: P2P location services, Tapestry –Reduce management states: minimize # of replicas, object clustering –Distributed load balancing: capacity constraints Adaptation to clients’ dynamics –Dynamic distribution/deletion of replicas with regarding to clients’ QoS constraints –Incremental clustering Network-awareness and fault-tolerance (WNMMS) –Distance estimation: Internet Iso-bar –Anomaly detection and diagnostics
68
Comparison of Content Delivery Systems (cont’d) Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push- based CDNs SCAN Distributed load balancing NoYes NoYes Dynamic replica placement Yes NoYes Network- awareness No Yes, unscalable monitoring system NoYes, scalable monitoring system No global network topology assumption Yes NoYes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.