Clustering of Web Content for Efficient Replication Yan Chen, Lili Qiu, Wei Chen, Luan Nguyen and Randy H. Katz {yanchen, wychen, luann,

Slides:



Advertisements
Similar presentations
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Advertisements

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Fast Firewall Implementation for Software and Hardware-based Routers Lili Qiu, Microsoft Research George Varghese, UCSD Subhash Suri, UCSB 9 th International.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
1 Efficient and Robust Streaming Provisioning in VPNs Z. Morley Mao David Johnson Oliver Spatscheck Kobus van der Merwe Jia Wang.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada ISP-Friendly Peer Matching without ISP Collaboration Mohamed Hefeeda (Joint.
On Computing Compression Trees for Data Collection in Wireless Sensor Networks Jian Li, Amol Deshpande and Samir Khuller Department of Computer Science,
SCAN: A Dynamic, Scalable, and Efficient Content Distribution Network Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy,
1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research.
Placement of Integration Points in Multi-hop Community Networks Ranveer Chandra (Cornell University) Lili Qiu, Kamal Jain and Mohammad Mahdian (Microsoft.
Peer-to-Peer Based Multimedia Distribution Service Zhe Xiang, Qian Zhang, Wenwu Zhu, Zhensheng Zhang IEEE Transactions on Multimedia, Vol. 6, No. 2, April.
Cache Placement in Sensor Networks Under Update Cost Constraint Bin Tang, Samir Das and Himanshu Gupta Department of Computer Science Stony Brook University.
1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University.
1 Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research.
1 Caching/storage problems and solutions in wireless sensor network Bin Tang CSE 658 Seminar on Wireless and Mobile Networking.
Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
Flash Crowds And Denial of Service Attacks: Characterization and Implications for CDNs and Web Sites Aaron Beach Cs395 network security.
1 Drafting Behind Akamai (Travelocity-Based Detouring) AoJan Su, David R. Choffnes, Aleksandar Kuzmanovic, and Fabian E. Bustamante Department of Electrical.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Introspective Replica Management Yan Chen, Hakim Weatherspoon, and Dennis Geels Our project developed and evaluated a replica management algorithm suitable.
Internet-Scale Research at Universities Panel Session SAHARA Retreat, Jan 2002 Prof. Randy H. Katz, Bhaskaran Raman, Z. Morley Mao, Yan Chen.
CS401 presentation1 Effective Replica Allocation in Ad Hoc Networks for Improving Data Accessibility Takahiro Hara Presented by Mingsheng Peng (Proc. IEEE.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 34 – Media Server (Part 3) Klara Nahrstedt Spring 2012.
UCSC 1 Aman ShaikhICNP 2003 An Efficient Algorithm for OSPF Subnet Aggregation ICNP 2003 Aman Shaikh Dongmei Wang, Guangzhi Li, Jennifer Yates, Charles.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
Ao-Jan Su, David R. Choffnes, Fabián E. Bustamante and Aleksandar Kuzmanovic Department of EECS Northwestern University Relative Network Positioning via.
SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.
Network Aware Resource Allocation in Distributed Clouds.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
 Mainak Ghosh, Wenting Wang, Gopalakrishna Holla, Indranil Gupta.
A Dynamic Data Grid Replication Strategy to Minimize the Data Missed Ming Lei, Susan Vrbsky, Xiaoyan Hong University of Alabama.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
A Scalable, Adaptive, Network-aware Infrastructure for Efficient Content Delivery Yan Chen Ph.D. Status Talk EECS Department UC Berkeley.
ECO-DNS: Expected Consistency Optimization for DNS Chen Stephanos Matsumoto Adrian Perrig © 2013 Stephanos Matsumoto1.
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
A Comparison of Layering and Stream Replication Video Multicast Schemes Taehyun Kim and Mostafa H. Ammar Networking and Telecommunications Group Georgia.
Microsoft Research1 Characterizing Alert and Browse Services for Mobile Clients Atul Adya, Victor Bahl, Lili Qiu Microsoft Research USENIX Annual Technical.
1 On the Placement of Web Server Replicas Lili Qiu, Microsoft Research Venkata N. Padmanabhan, Microsoft Research Geoffrey M. Voelker, UCSD IEEE INFOCOM’2001,
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
NUS.SOC.CS Roger Zimmermann (based in part on slides by Ooi Wei Tsang) 1 Proxy Caching for Streaming Media.
NUS.SOC.CS5248 Ooi Wei Tsang 1 Proxy Caching for Streaming Media.
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley.
Proxy Caching for Streaming Media
Lazaros Gkatzikis. Huawei, France Vasilis Sourlas
Server Allocation for Multiplayer Cloud Gaming
Accessing nearby copies of replicated objects
Clustering (3) Center-based algorithms Fuzzy k-means
Dynamic Replica Placement for Scalable Content Delivery
Storing and Replication in Topic-Based Pub/Sub Networks
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Presentation transcript:

Clustering of Web Content for Efficient Replication Yan Chen, Lili Qiu, Wei Chen, Luan Nguyen and Randy H. Katz {yanchen, wychen, luann,

 CDN (Content Distribution Networks) improves Web performance by replicating contents to close to the clients Greedy algorithm is proved to be efficient and effective for static replica placement to reduce the response latency of end users  Problem: What content to be replicated? All previous work assume replication of the whole Website. Per-URL scheme yields 60-70% reduction in clients’ latency, but too expensive  Goal: To exploit the tradeoff so performance can be improved significantly without high overhead  Our Solution: 1.Hot data analysis to filter out infrequently used data 2.Cluster URLs based on access pattern & replicate in unit of clusters 3.Incremental clustering + redistribution to adapt to the emerging URLs and changes of clients’ access pattern

 Qiu et al. and Jamin et al. independently reported a greedy algorithm is close to optimal for static replica placement  Lots of work on clustering Web content, however, focused on analyses of individual client access patterns  In contrast, we are more interested in aggregated clients  Among the first to use stability and performance as figure of merits for Web content clustering Problem formulation: Minimize the total latency of clients: subject to the constraint that the total replication cost,, is bounded by R, where |u| denotes the number of replicas

 Network Topology:  Pure-random & transit-Stub models from GT-ITM  A real AS-level topology from 7 widely-dispersed BGP peers  Real world traces: -- Cluster MSNBC Web clients with BGP prefix - BGP tables from a BBNPlanet router on 01/24/ K clusters left, chooses top 10% covering >70% of requests -- Cluster NASA Web clients with domain names -- Map the client clusters randomly onto the topology Web SitePeriodDurationTotal RequestsRequests/day MSNBC8-10/199910–11am10,284,7351,469,248 (1 hr) NASA7/1995All day3,461,61256,748 WorldCup5-7/1998All day1,352,804,10715,372,774

 Top 10% of URLs cover over 85% of requests  Hot data remain stable for reasonably long time -- Top 10% URLs on a given day cover over 80% of requests for at least the subsequent week  Conclusion: -- Only hot data need to be considered for replication MSNBCMSNBC Stability of popularity ranking Stability of requests coverage

 Replication Unit: per-website, per-URL, cluster of URLs Where M: number of hot objectsR: number of replicas/URLK: number of clusters C: number of clientsS: number of CDN servers f p : placement adaptation frequencyf c : clustering frequency Replication SchemeStates to MaintainComputation Cost Per WebsiteO (R)fp * O(R*S*C) Per ClusterO(R*K + M)fp*O(K*R*(K+S*C)) + fc*O(M*K) Per URLO(R*M)fp*O(M*R*(M+S*C)) MSNBCMSNBC  Big performance gap between per-Website and per-URL  Clustering enables smooth tradeoff between cost and performance  Directory-based clustering only provides marginal improvement

 Greedy search: Iteratively choose pairs that gives largest performance gain per URL for replication -- Object could be individual URL or URL clusters  Two steps: -- Define correlation distance between each pair of URLs -- Apply generic clustering methods below  Generic clustering algorithms: -- Algorithm 1: Limit the diameter (max distance between any two URLs) of a cluster, and minimize number of clusters -- Algorithm 2: Limit the number of clusters, then minimize the max diameter of all clusters

 Spatial Clustering: -- Represent the access distribution of a URL using a spatial access vector of K (number of client clusters) dimensions -- Correlation distance defined as: 1. Euclidean distance between two spatial access vectors in K- dimension space 2. Vector similarity of two spatial access vectors A & B:  Temporal Clustering: -- Divide user requests into sessions, and analyze the access patterns in each session -- Correlation distance defined as:

 Performance: Spatial clustering> spatial clustering with similarity> temporal clustering  With only 1-2% of cost of URL-based scheme, achieves performance close to URL-based replication a) with 5 replicas/URL Performance of various clustering approaches for MSNBC 8/1/99 trace b) Can run up to 50 replicas/URL

 Determine the frequency for re-clustering/replicating  Static Clustering: -- Performance gap mostly due to the emerging URLs (1) Both clusters and replica locations based on old traces (2) Clusters based on old traces and replica locations based on new traces (3) Both clusters and replica locations based on new traces  Incremental Clustering: -- Reclaim the space of cold URLs/clusters -- Assign new URLs to existing clusters if correlation match & replicate -- Generate new clusters for the remaining new URLs & replicate MSNBCMSNBC

(1)currReplicationCost = 0 (2)Initially, all the URLs reside at the origin Web servers (3)currReplicationCost = totalURL (4)For each URL, we find its best replications location, and the amount of reduction in cost if the URL were replicated to that application (5)While (currReplicationCost < maxReplicationCost) { Choose the URL that has the largest reduction in cost, and replicate the URL to the designed node For that URL, we find its best replication location, and the amount of reduction in cost if the URL were replicated to that location currReplicationCost++ } Backup Slides

LimitDiameterClustering-Greedy(Uncovered_point N) While(N is not empty)\ { Choose s N such that the K-dimension ball centered at s with radius covers the largest number of URLs in N Output the new cluster N s, which consists of all URLs covers by the K-dimension ball centered at s with radius N = N – N s }