Download presentation
Presentation is loading. Please wait.
Published byAndrea Haynes Modified over 8 years ago
1
How to Evaluate the Effectiveness of URL Normalizations Snag Ho Lee, Sung Jin Kim, Hyo Sook Jeong in Proceedings of the Third International Conference on Human.Society@Internet, HIS
2
Contents Abstract Introduction URL Normalizations Evaluation of a URL Normalization Method Empirical Evaluation Conclusions and Future Works
3
Abstract Syntactically different URLs could represent the same web page Duplicate representation handle a large amount of same web pages unnecessarily URL normalization helps eliminate duplicate URLs In this paper presents a method that evaluates the effectiveness of a URL normalization method
4
Introduction URL (Uniform Resource Locator) A string that represents a web resource (a web page) Equivalent URL If more than two URLs locate the same web page The inability to recognize two equivalent URLs being equivalent gives rise to a large amount of processing overhead
5
Introduction (2) False negative Determining equivalent URLs not to be equivalent False positive Determining non-equivalent URLs to be equivalent
6
Introduction (3) URL normalizations [5] Transform syntactically different but equivalent URLs into a syntactically identical string The three types of URL normalizations syntax-based normalization scheme-based normalization protocol-based normalization The first two types of normalizations reduce false negatives while strictly avoiding false positives Standard community does not give specific methods for the protocol-based normalization [6]
7
Introduction (4) Extended normalization methods (1) [6] Changing letters in the path component into the lower- case letters or into the upper-case letters http://acm.org/PUBS/journals.html- >http://acm.org/pubs/journals.html Attaching and eliminating the “www” prefix to URLs with and without the prefix in the host component http://www.ssu.ac.kr->http://www.ssu.ac.kr Eliminating the last slash symbol from URLs http://www.acm.org/pubs/->http://www.acm.org/pubs Eliminating default page names in the path component http://www.acm.org/index.htm->http://www.acm.org/
8
Introduction (5) Extended normalization methods (2) Allow false positives Lose, gain, or change web pages unintentionally Reduce the number of total URLs in operation Presents a scheme to evaluate the effectiveness of URL normalization methods URL reduction rate Web page loss/gain/change rate 94 million URLs (20,799 web sites in Korea) Help select normalization methods
9
URL Normalizations URL components scheme : protocol (here, Hypertext Transfer Protocol) authority : user information, host, port path : directories query : parameter names, values fragment : particular part of a document
10
Standard URL Normalizations A process that transforms a URL into a canonical form syntax-based normalization Characters in the scheme and host components into lower- case letters HTTP://EXAMPLE.com -> http://example.com All unreserved characters (i.e., uppercase and lowercase letters, decimal digits, …) should be decoded http://example.com/%7Esmith -> http://example.com/~smith path segment “.” and “..” are removed appropriately http://example.com/a/b/./../c.htm -> http://exmaple.com/a/c.htm
11
Standard URL Normalizations (2) Scheme-based normalization Default port number is truncated from the URL http://example.com:80/ -> http://example.com/ If path string is null, then the path string is transformed into “/” http://example.com -> http://example.com/ Fragment in the URL is truncated http://example.com/list.htm#chap1 -> http://example.com/list.htm Protocol-based normalization result of accessing the resources the common conventions of their scheme’s dereference algorithm http://example.com/a/b -> http://example.com/a/b/
12
Extended URL Normalizations Standard Normalization No false positive High possibility of false negatives In web applications (such as web crawlers) handle a huge number of URLs reducing the possibility of false negatives implies reduction of URLs that need to be considered http://www.acm.org/ http://www.acm.org/index.html Extended URL Normalization Significantly reduce the possibility of false negatives Allow false positives on a limited level How to evaluate the effectiveness of an extended normalization method precisely ?
13
Evaluation of a URL Normalization Method Two different points of view how much URLs are reduced how many pages are lost, gained, or changed Suppose Transform a given URL u1 in the original form into a URL u2 in a canonical form The u1 and u2 locate web page p1 and p2 on the web, respectively There are totally ten cases to consider
14
Evaluation of a URL Normalization Method (2) Lose a web page (2, 4, 9) Gain a web page (8) or Get a different page (7) Negative false (2, 4, 7, 8, 9)
15
Evaluation of a URL Normalization Method (3) (1) Page p1 exists on the web (A) Page p2 does not exist (4, 9) False positive, lose one page p1 (B) Page p2 exists, p1 & p2 same page (1, 6) No false positive, save one page request (C) Page p2 exists, p1 & p2 are not same (7, 2) False positive, loss (2) or loss & gain (7) (2) Page p1 does not exist (A) URL u2 is already known to us (3, 5) Do not loss any pages, save one page request (B) URL u2 is not known to us (8, 10) Gain one web page (8), lose nothing (10) The number of page requests remains unchanged
16
Evaluation of a URL Normalization Method (4) For evaluating the effectiveness of the URL normalization, we propose a number of metrics Let N be the total number of URLs that are considered Page loss rate = the total number of lost pages / N. Page gain rate = the total number of gain pages / N Page change rate = the total number of change pages / N Page non-loss rate = the total number of non-loss pages / N Reduction of URL URL reduction rate = 1 - (the unique number of URLs after normalization / the unique number of URLs before normalization) If we normalize 100 distinct URLs into 90 distinct URLs The URL reduction rate is 0.1 (1 -90/100, or 10%) A good normalization method A high value of URL reduction rate low values of page loss/gain/change
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.