Download presentation
Presentation is loading. Please wait.
Published byEthan Lang Modified over 8 years ago
1
On Network-Aware Clustering of Web Clients Balachander Krishnamurthy bala@research.att.com AT&T Labs-Research, Florham Park, NJ, USA Jia Wang jiawang@cs.cornell.edu Cornell University, Ithaca, NY, USA
2
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients2 Outline Introduction Simple approaches to clustering Network-aware approach Applications of client clustering Conclusion and future work
3
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients3 Introduction Original goal: identify the group of clients that are responsible for a significant portion of a Web site’s requests Cluster –Non-overlapping –Topologically close –Under common administrative control But, identifying clusters requires knowledge that is not available to anyone outside the administrative entities. Network-aware approach – BGP based
4
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients4 Simple approaches Two approaches 1.Use traditional Class A, Class B and Class C networks 2.Assume prefix length is 24 bits They are simple, but do not give good results (~50% accuracy). Counter example IP addressNamePrefix/netmask 151.198.194.17client-151-198-194-17.bellatlantic.net151.198.194.16/28 151.198.194.34mailsrv1.wakefern.com151.198.194.32/28 151.198.194.50firewall.commonhealthusa.com151.198.194.48/28
5
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients5 Network-aware approach Use BGP routing and forwarding table snapshots Routing table entries clusters Example snapshot of BGP routing table PrefixPrefix descriptionNext hopAS path Peer AS description 6.0.0.0/8Army Information System Center cs.ny- nap.vbns.net 7170 1455 (IGP) AT&T Government Markets 12.0.48.0/20Harvard University cs.cht.vbns.net1742 (IGP) Harvard University 18.0.0.0/8Massachusetts Institute of Technology cs.cht.vbns.net3 (IGP)Massachusetts Institute of Technology
6
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients6 Automated process Clustering process Source of IP addressesBGP routing tables IP address extraction IP addresses Prefix extraction, unification, merging Prefix table Client cluster identification Raw client clusters Validation (optional) Examining impact of network dynamics Client clusters Self-correction and adaptation
7
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients7 Network prefix extraction Prefix entry extraction (BGP tables from 14 places via automated scripts) AADS, MAE-EAST, MAE-WEST, PACBELL, PAIX, ARIN, AT&T-Forw, AT&T-BGP, CANET, CERFNET, NLANR, OREGON, SINGAREN, and VBNS. Prefix format unification and merging Three formats: x1.x2.x3.x4/k1.k2.k3.k4 x1.x2.x3.x4/m x1.x2.x3.0 Assembled total 391,497 unique prefix entries (412,109 entries by 7/24/2000)
8
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients8 Client cluster identification Methodology Extract the client IP address from the server log Perform longest prefix matching on each client IP address Classify all the client IP addresses which have the same longest matched prefix into a client cluster Experiments Experiments on wide range of Web server logs Results > 99% clients can be grouped into clusters ~ 90% sampled clusters passed our validation tests
9
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients9 Server logs used in our experiments LogDescriptionDateDuration (days) # requests# clients# clusters ApacheApache site10/1/99- 11/18/99 493,461,36151,53635,563 Ew3AT&T content hosting site 7/1/99- 7/31/99 311,199,27621,5197,754 Nagano1998 Winter Olympic Game 2/13/98111,665,71359,5829,853 SunSun Micro- systems site 9/30/97- 10/9/97 913,871,352219,52833,468
10
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients10 Example: Nagano server log
11
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients11 Example: Nagano server log (cont.)
12
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients12 Validation of clustering Validation - fundamentally difficult problem A client cluster may be mis-identified by being too large or too small Two approaches nslookup-based test Optimized traceroute-based test Results on sampled 1% client clusters A client cluster is mis-identified even if there is one client in the cluster doesn’t share same suffix with others. Error rate of network-aware approach: ~10% Error rate of simple approach: ~50% Possible reason of mis-clustering: route aggregation, national gateway proxies Effect of BGP prefix changes: < 3% (during 2 weeks)
13
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients13 Applications Web caching, content distribution, server replication, traffic management and load balancing, Internet map discovery, etc. Example: Web caching Client classification: Normal client, proxy, and spider Identifying spiders/proxies based on access patterns ? spiderproxy
14
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients14 Detecting proxy/spider
15
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients15 Thresholding client clusters Metric: number of requests issued from within a client cluster 70% of the total requests in the server log Web caching simulation Log# requests# clients# clusters# busy clusters Accuracy Apache3,461,36151,53635,5632,86992% Ew31,199,27621,5197,7541,60096% Nagano11,665,71359,5829,85371790% Sun13,871,352219,52833,4682,53691%
16
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients16 New dataset Altavista server log containing 60,011,458 requests issued by 2,503,974 clients all over the world. # clusters: 100,091 # busy clusters: 242 Accuracy: 91% Clustering works on large, general portal site data. Thanks to Altavista for sharing data with us. The data included only client IP addresses with no personally identifiable information.
17
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients17 Conclusion and future work Network-aware client clustering –Based on BGP routing table snapshots –Ability to cluster >99% of clients in the server logs –Error rate is 10% (~ 50% for the simple approach) –Immune to BGP dynamics –Variety of applications Ongoing work –Online algorithm –Super/sub clustering –Server clustering –Server replication application Future work –Better validation –Lower error rate –Other applications
18
ACM SIGCOMM'2000On Network-Aware Clustering of Web Clients18 Acknowledgement Thanks to the following people for helping us in this project. Jennifer RexfordAnja Feldmann Tim GriffinBill Manning Vern PaxsonCraig Labovitz Thomas Narten Steven Bellovin Emden GansnerNick Duffield S. KeshavWalter Willinger
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.