Clustering of Web Content for Efficient Replication Yan Chen, Lili Qiu, Wei Chen, Luan Nguyen and Randy H. Katz {yanchen, wychen, luann,
CDN (Content Distribution Networks) improves Web performance by replicating contents to close to the clients Greedy algorithm is proved to be efficient and effective for static replica placement to reduce the response latency of end users Problem: What content to be replicated? All previous work assume replication of the whole Website. Per-URL scheme yields 60-70% reduction in clients’ latency, but too expensive Goal: To exploit the tradeoff so performance can be improved significantly without high overhead Our Solution: 1.Hot data analysis to filter out infrequently used data 2.Cluster URLs based on access pattern & replicate in unit of clusters 3.Incremental clustering + redistribution to adapt to the emerging URLs and changes of clients’ access pattern
Qiu et al. and Jamin et al. independently reported a greedy algorithm is close to optimal for static replica placement Lots of work on clustering Web content, however, focused on analyses of individual client access patterns In contrast, we are more interested in aggregated clients Among the first to use stability and performance as figure of merits for Web content clustering Problem formulation: Minimize the total latency of clients: subject to the constraint that the total replication cost,, is bounded by R, where |u| denotes the number of replicas
Network Topology: Pure-random & transit-Stub models from GT-ITM A real AS-level topology from 7 widely-dispersed BGP peers Real world traces: -- Cluster MSNBC Web clients with BGP prefix - BGP tables from a BBNPlanet router on 01/24/ K clusters left, chooses top 10% covering >70% of requests -- Cluster NASA Web clients with domain names -- Map the client clusters randomly onto the topology Web SitePeriodDurationTotal RequestsRequests/day MSNBC8-10/199910–11am10,284,7351,469,248 (1 hr) NASA7/1995All day3,461,61256,748 WorldCup5-7/1998All day1,352,804,10715,372,774
Top 10% of URLs cover over 85% of requests Hot data remain stable for reasonably long time -- Top 10% URLs on a given day cover over 80% of requests for at least the subsequent week Conclusion: -- Only hot data need to be considered for replication MSNBCMSNBC Stability of popularity ranking Stability of requests coverage
Replication Unit: per-website, per-URL, cluster of URLs Where M: number of hot objectsR: number of replicas/URLK: number of clusters C: number of clientsS: number of CDN servers f p : placement adaptation frequencyf c : clustering frequency Replication SchemeStates to MaintainComputation Cost Per WebsiteO (R)fp * O(R*S*C) Per ClusterO(R*K + M)fp*O(K*R*(K+S*C)) + fc*O(M*K) Per URLO(R*M)fp*O(M*R*(M+S*C)) MSNBCMSNBC Big performance gap between per-Website and per-URL Clustering enables smooth tradeoff between cost and performance Directory-based clustering only provides marginal improvement
Greedy search: Iteratively choose pairs that gives largest performance gain per URL for replication -- Object could be individual URL or URL clusters Two steps: -- Define correlation distance between each pair of URLs -- Apply generic clustering methods below Generic clustering algorithms: -- Algorithm 1: Limit the diameter (max distance between any two URLs) of a cluster, and minimize number of clusters -- Algorithm 2: Limit the number of clusters, then minimize the max diameter of all clusters
Spatial Clustering: -- Represent the access distribution of a URL using a spatial access vector of K (number of client clusters) dimensions -- Correlation distance defined as: 1. Euclidean distance between two spatial access vectors in K- dimension space 2. Vector similarity of two spatial access vectors A & B: Temporal Clustering: -- Divide user requests into sessions, and analyze the access patterns in each session -- Correlation distance defined as:
Performance: Spatial clustering> spatial clustering with similarity> temporal clustering With only 1-2% of cost of URL-based scheme, achieves performance close to URL-based replication a) with 5 replicas/URL Performance of various clustering approaches for MSNBC 8/1/99 trace b) Can run up to 50 replicas/URL
Determine the frequency for re-clustering/replicating Static Clustering: -- Performance gap mostly due to the emerging URLs (1) Both clusters and replica locations based on old traces (2) Clusters based on old traces and replica locations based on new traces (3) Both clusters and replica locations based on new traces Incremental Clustering: -- Reclaim the space of cold URLs/clusters -- Assign new URLs to existing clusters if correlation match & replicate -- Generate new clusters for the remaining new URLs & replicate MSNBCMSNBC
(1)currReplicationCost = 0 (2)Initially, all the URLs reside at the origin Web servers (3)currReplicationCost = totalURL (4)For each URL, we find its best replications location, and the amount of reduction in cost if the URL were replicated to that application (5)While (currReplicationCost < maxReplicationCost) { Choose the URL that has the largest reduction in cost, and replicate the URL to the designed node For that URL, we find its best replication location, and the amount of reduction in cost if the URL were replicated to that location currReplicationCost++ } Backup Slides
LimitDiameterClustering-Greedy(Uncovered_point N) While(N is not empty)\ { Choose s N such that the K-dimension ball centered at s with radius covers the largest number of URLs in N Output the new cluster N s, which consists of all URLs covers by the K-dimension ball centered at s with radius N = N – N s }