Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley
Motivation Internet has evolved to become a commercial infrastructure for service delivery –Web delivery, VoIP, streaming media … Challenges for Internet-scale services –Scalability: 600M users, 35M Web sites, 28Tb/s –Efficiency: bandwidth, storage, management –Agility: dynamic clients/network/servers –Security, etc. Focus on content delivery - Content Distribution Network (CDN) –Totally 4 Billion Web pages, daily growth of 7M pages –Annual growth of 200% for next 4 years
CDN and its Challenges
Inefficient replication No coherence for dynamic content Unscalable network monitoring - O(M*N) X
CDN Applications (e.g. streaming media) SCAN: Scalable Content Access Network Provision: Cooperative Clustering-based Replication User Behavior/ Workload Monitoring Coherence: Update Multicast Tree Construction Network Performance Monitoring Network Distance/ Congestion/ Failure Estimation red: my work, black: out of scope
SCAN Coherence for dynamic content Cooperative clustering-based replication s1, s4, s5
SCAN X Scalable network monitoring - O(M+N) s1, s4, s5 Cooperative clustering-based replication Coherence for dynamic content
Internet-scale Simulation Network Topology –Pure-random, Waxman & transit-stub synthetic topology –An AS-level topology from 7 widely-dispersed BGP peers Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K World Cup May-Jul/1998All day29M – 1M – 73M103K–13K–218KN/A –Aggregate MSNBC Web clients with BGP prefix »BGP tables from a BBNPlanet router –Aggregate NASA Web clients with domain names –Map the client groups onto the topology
Internet-scale Simulation – E2E Measurement NLANR Active Measurement Project data set –111 sites on America, Asia, Australia and Europe –Round-trip time (RTT) between every pair of hosts every minute –17M daily measurement –Raw data: Jun. – Dec. 2001, Nov Keynote measurement data –Measure TCP performance from about 100 worldwide agents –Heterogeneous core network: various ISPs –Heterogeneous access network: »Dial up 56K, DSL and high-bandwidth business connections –Targets »40 most popular Web servers + 27 Internet Data Centers –Raw data: Nov. – Dec. 2001, Mar. – May 2002
Clustering Web Content for Efficient Replication
Overview CDN uses non-cooperative replication - inefficient Paradigm shift: cooperative push –Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] –But what content to be pushed? –At what granularity? Clustering of objects for replication –Close-to-optimal performance with small overhead Incremental clustering –Push before accessed: improve availability during flash crowds
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. local CDN server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: local CDN server IP address 5.GET request 8. Response 6.GET request if cache miss ISP 2 ISP 1 Conventional CDN: Non-cooperative Pull 7. Response Inefficient replication
CDN name server Client 1 Local DNS serverLocal CDN server 1. GET request 4. Redirected server IP address Web content server Client 2 Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: nearby replica server or Web server IP address ISP 2 ISP 1 5. GET request 6. Response 5.GET request if no replica yet SCAN: Cooperative Push 0. Push replicas Significantly reduce the # of replicas and update cost
Comparison between Conventional CDNs and SCAN Conventional CDNs SCAN Average retrieval latency (ms) Number of URL replicas deployed 121,0165,000 Number of update messages 1,349,65554,564
Problem Formulation Find a scalable, adaptive replication strategy to reduce –Clients’ average retrieval cost –Replica location computation cost –Amount of replica directory state to maintain Subject to certain total replication cost (e.g., # of URL replicas)
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
Per Web site Per URL
60 – 70% average retrieval cost reduction for Per URL scheme Per URL is too expensive for management! Replica Placement: Per Website vs. Per URL
Where R: # of replicas per URL M: # of URLs To compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server! Replication SchemeState to MaintainComputation Cost Per WebsiteO(R) Per URLO(R × M) Overhead Comparison
Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K) Replication SchemeStates to MaintainComputation Cost Per WebsiteO(R) Per ClusterO(R × K + M)O(R × K) Per URLO(R × M) Overhead Comparison
Clustering Web Content General clustering framework –Define the correlation distance between URLs –Cluster diameter: the max distance between any two members »Worst correlation in a cluster –Generic clustering: minimize the max diameter of all clusters Correlation distance definition based on –Spatial locality –Temporal locality –Popularity
Spatial Clustering Correlation distance between two URLs defined as –Euclidean distance –Vector similarity URL spatial access vector –Blue URL
Clustering Web Content (cont’d) Popularity-based clustering –OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation Temporal clustering – Divide traces into multiple individuals’ access sessions [ABQ01] – In each session, – Average over multiple sessions in one day
Performance of Cluster-based Replication Spatial clustering with Euclidean distance and popularity-based clustering perform the best –Small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
Static clustering and replication Two daily traces: training trace and new trace Static clustering performs poorly beyond a week MethodsStatic 1Static 2Optimal Traces used for clusteringTraining New Traces used for replicationTrainingNew Traces used for evaluationNew Performance of static clustering almost doubles the optimal !
Incremental Clustering Generic framework 1.If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c 2.Else create new clusters and replicate them Two types of incremental clustering –Online: without any access logs »High availability –Offline: with access logs »Close-to-optimal performance
Groups of siblings URL1 URL2 URL Groups of the same hyperlink depth (smallest # of links from root) Online Incremental Clustering Predict access patterns based on semantics Simplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures!
Online Popularity Prediction Experiments –Crawl on 5/3/2002 with hyperlink depth 4, then group the URLs –Use corresponding access logs to analyze the correlation –Groups of siblings have the best correlation Measure the divergence of URL popularity within a group: access freq span =
Semantics-based Incremental Clustering Put new URL into existing cluster with largest # of siblings –In case of a tie, choose the cluster w/ more replicas Simulation on 5/3/2002 MSNBC –8-10am trace: static popularity clustering + replication –At 10am: 16 new URLs - online inc. clustering + replication –Evaluation with 10-12am trace: 16 URLs has 33K requests ?
Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication
Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost
Conclusions Cooperative, clustering-based replication –Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs –URL Clustering reduce the management/computational overhead by two orders of magnitude »Spatial clustering and popularity-based clustering recommended –Incremental clustering to adapt to emerging URLs »Hyperlink-based online incremental clustering for high availability and performance improvement Self-organize replicas into app-level multicast tree for update dissemination Scalable overlay network monitoring –O(M+N) instead of O(M*N), given M client groups and N servers
Outline Architecture Problem formulation Granularity of replication Incremental clustering and replication Conclusions Future Research
Future Research (I) Measurement-based Internet study and protocol/architecture design –Use inference techniques to develop Internet behavior models »Network operators reluctant to reveal internal network configurations –Root cause analysis: large, heterogeneous data mining »Leverage graphics/visualization for interactive mining –Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture –E.g., Internet bottleneck – peering links? How and Why? Implications?
Future Research (II) Network traffic anomaly characterization, identification and detection –Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) –Profile traffic patterns of new applications (e.g. P2P) –> benign anomalies –Understand the cause, pattern and prevalence of other unknown anomalies –Identify malicious patterns for intrusion detection –E.g., fight against Sapphire/Slammer Worm
Backup Materials
SCAN Coherence for dynamic content Cooperative clustering-based replication X Scalable network monitoring O(M+N) s1, s4, s5
Problem Formulation Subject to certain total replication cost (e.g., # of URL replicas) Find a scalable, adaptive replication strategy to reduce avg access cost
Simulation Methodology Network Topology –Pure-random, Waxman & transit-stub synthetic topology –An AS-level topology from 7 widely-dispersed BGP peers Web Workload Web Site PeriodDuration# Requests avg –min-max # Clients avg –min-max # Client groups avg –min-max MSNBCAug-Oct/199910–11am1.5M–642K–1.7M129K–69K–150K15.6K-10K-17K NASAJul-Aug/1995All day79K-61K-101K –Aggregate MSNBC Web clients with BGP prefix »BGP tables from a BBNPlanet router –Aggregate NASA Web clients with domain names –Map the client groups onto the topology
Online Incremental Clustering Predict access patterns based on semantics Simplify to popularity prediction Groups of URLs with similar popularity? Use hyperlink structures! –Groups of siblings –Groups of the same hyperlink depth: smallest # of links from root
Challenges for CDN Over-provisioning for replication –Provide good QoS to clients (e.g., latency bound, coherence) –Small # of replicas with small delay and bandwidth consumption for update Replica Management –Scalability: billions of replicas if replicating in URL »O(10 4 ) URLs/server, O(10 5 ) CDN edge servers in O(10 3 ) networks –Adaptation to dynamics of content providers and customers Monitoring –User workload monitoring –End-to-end network distance/congestion/failures monitoring »Measurement scalability »Inference accuracy and stability
SCAN Architecture Leverage Decentralized Object Location and Routing (DOLR) - Tapestry for –Distributed, scalable location with guaranteed success –Search with locality Soft state maintenance of dissemination tree (for each object) data plane network plane data source Web server SCAN server client replica always update adaptive coherence cache Tapestry mesh Request Location Dynamic Replication/Update and Content Management
Cluster A Clients Cluster B Monitors Cluster C Distance measured from a host to its monitor Distance measured among monitors SCAN edge servers Wide-area Network Measurement and Monitoring System (WNMMS) Select a subset of SCAN servers to be monitors E2E estimation for Distance Congestion Failures network plane
Dynamic Provisioning Dynamic replica placement –Meeting clients’ latency and servers’ capacity constraints –Close-to-minimal # of replicas Self-organized replicas into app-level multicast tree –Small delay and bandwidth consumption for update multicast –Each node only maintains states for its parent & direct children Evaluated based on simulation of –Synthetic traces with various sensitivity analysis –Real traces from NASA and MSNBC Publication –IPTPS 2002 –Pervasive Computing 2002
Effects of the Non-Uniform Size of URLs Replication cost constraint : bytes Similar trends exist –Per URL replication outperforms per Website dramatically –Spatial clustering with Euclidean distance and popularity- based clustering are very cost-effective
End Host Cluster A Cluster B Cluster C Landmark Diagram of Internet Iso-bar
Cluster A End Host Cluster B Monitor Cluster C Distance probes from monitor to its hosts Distance probes among monitors Landmark Diagram of Internet Iso-bar
Real Internet Measurement Data NLANR Active Measurement Project data set –119 sites on US (106 after filtering out most offline sites) –Round-trip time (RTT) between every pair of hosts every minute –Raw data: 6/24/00 – 12/3/01 Keynote measurement data –Measure TCP performance from about 100 agents –Heterogeneous core network: various ISPs –Heterogeneous access network: »Dial up 56K, DSL and high-bandwidth business connections –Targets »Web site perspective: 40 most popular Web servers »27 Internet Data Centers (IDCs)
Related Work Internet content delivery systems –Web caching »Client-initiated »Server-initiated –Pull-based Content Delivery Networks (CDNs) –Push-based CDNs Update dissemination –IP multicast –Application-level multicast Network E2E Distance Monitoring Systems
Clien t Local DNS serverProxy cache server Web content server Client Local DNS server Proxy cache server 1.GET request 4. Response 2.GET request if cache miss 3. Response ISP 2 ISP 1 Web Proxy Caching
CDN name server Clien t Local DNS serverLocal CDN server 1. GET request 4. local CDN server IP address Web content server Client Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: local CDN server IP address 5.GET request 8. Response 6.GET request if cache miss 7. Response ISP 2 Pull-based CDN ISP 1
CDN name server Clien t Local DNS serverLocal CDN server 1. GET request 4. Redirected server IP address Web content server Client Local DNS server Local CDN server 2. Request for hostname resolution 3. Reply: nearby replica server or Web server IP address ISP 2 Push-based CDN ISP 1 0. Push replicas 5. GET request 6. Response 5.GET request if no replica yet
Internet Content Delivery Systems Scalability for request redirection Pre- configured in browser Use Bloom filter to exchange replica locations Centralized CDN name server Decentra- lized P2P location Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push- based CDNs SCAN Efficiency (# of caches or replicas) No cache sharing among proxies Cache sharing No replica sharing among edge servers Replica sharing Network- awareness No Yes, unscalable monitoring system NoYes, scalable monitoring system Coherence support No YesNoYes
Previous Work: Update Dissemination No inter-domain IP multicast Application-level multicast (ALM) unscalable –Root maintains states for all children (Narada, Overcast, ALMI, RMX) –Root handles all “join” requests (Bayeux) –Root split is common solution, but suffers consistency overhead
Design Principles Scalability –No centralized point of control: P2P location services, Tapestry –Reduce management states: minimize # of replicas, object clustering –Distributed load balancing: capacity constraints Adaptation to clients’ dynamics –Dynamic distribution/deletion of replicas with regarding to clients’ QoS constraints –Incremental clustering Network-awareness and fault-tolerance (WNMMS) –Distance estimation: Internet Iso-bar –Anomaly detection and diagnostics
Comparison of Content Delivery Systems (cont’d) Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push- based CDNs SCAN Distributed load balancing NoYes NoYes Dynamic replica placement Yes NoYes Network- awareness No Yes, unscalable monitoring system NoYes, scalable monitoring system No global network topology assumption Yes NoYes
Network-awareness (cont’d) Loss/congestion prediction –Maximize the true positive and minimize the false positive Orthogonal loss/congestion paths discovery –Without underlying topology –How stable is such orthogonality? »Degradation of orthogonality over time Reactive and proactive adaptation for SCAN