Download presentation
Presentation is loading. Please wait.
Published byFrederick Toby Fowler Modified over 9 years ago
1
Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw 2015/12/5 1 Data Mining & Machine Learning Lab
2
Outlines Introduction ▫Duplicate URLs ▫Problem Definition Related Works Algorithms ▫URL Preprocessing ▫Rule Generation Evaluation Conclusions 2015/12/5 2 Data Mining & Machine Learning Lab
3
Introduction Duplicate URLs Problem Definition 2015/12/5 3 Data Mining & Machine Learning Lab
4
Making URLs search engine friendly ▫http://en.wikipedia.org/wiki/Casino_Royalehttp://en.wikipedia.org/wiki/Casino_Royale ▫http://en.wikipedia.org/?title=Casino_Royalehttp://en.wikipedia.org/?title=Casino_Royale Session-id or cookie information present in URLs ▫http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8 ▫http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8 Irrelevant or superfluous components in URLs ▫http://www.amazon.com/Lord-Rings/dp/B000634DCWhttp://www.amazon.com/Lord-Rings/dp/B000634DCW ▫http://www.amazon.com/dp/B000634DCWhttp://www.amazon.com/dp/B000634DCW Webmaster construct URL representations with custom delimiters ▫http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ439 73351QQ_tabZ2http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ439 73351QQ_tabZ2 ▫http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351 &_tab=2http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351 &_tab=2 Duplicate URLs 2015/12/5Data Mining & Machine Learning Lab 4
5
Given a set of duplicate clusters and their corresponding URLs ▫Learning Rules from URL strings which can identify duplicates ▫Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL Problem Definition 2015/12/5Data Mining & Machine Learning Lab 5
6
Do not crawl in the dust: different urls with similar text ▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld. ▫Conference: International conference on World Wide Web 2007 ▫DUST algorithm Discovering substring substitution rules to transform URLs of similar content to one canonical URL Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure Related Works 2015/12/5Data Mining & Machine Learning Lab 6
7
De-duping urls via rewrite rules ▫Authors: A. Dasgupta, R. Kumar, and A. Sasturkar ▫Conference: ACM SIGKDD international conference on Knowledge discovery and data mining ▫Considering a broader set of rule types which subsume the DUST rules DUST rules session-id rules irrelevant path components Complicate rewrites ▫Algorithm learns rules from a cluster of URLs with similar page content such a cluster is referred to as a duplicate cluster or a dup cluster Related Works (cont.) 2015/12/5Data Mining & Machine Learning Lab 7
8
URL Preprocessing ▫Basic Tokenization ▫Deep Tokenization Rule Generation ▫Pair-wise Rule Generation ▫Rule Generalization Algorithms 2015/12/5Data Mining & Machine Learning Lab 8
9
Basic Tokenization ▫Using the standard delimiters specified in theRFC 1738 ▫Extracted Tokens: Protocol Hostname Path components Query-args Deep Tokenization ▫Using unsupervised technique to learn custom URL encodings used by webmasters URL Preprocessing 2015/12/5Data Mining & Machine Learning Lab 9
10
URL Preprocessing (cont.) 2015/12/5Data Mining & Machine Learning Lab 10
11
Definitions ▫URLURL ▫RuleRule Example ▫u1: http://360.yahoo.com/friends-lttU7d6kIuGqhttp://360.yahoo.com/friends-lttU7d6kIuGq u 1 = {k (1,3) = http, k (2,2) = 360.yahoo.com, k (3.1,1.3) = friends, k (3.2,1.2) = −, k (3.3,1.1) = lttU7d6kIuGq} ▫u2: http://360.yahoo.com/friendsnMfcaJRPUSMQhttp://360.yahoo.com/friendsnMfcaJRPUSMQ u 2 = {k (1,3) = http, k (2,2) = 360.yahoo.com, k (3.1,1.3) = friends, k (3.2,1.2) = −, k (3.3,1.1) = nMfcaJRPUSMQ} ▫Rule Context (C ): c(k (1,3) ) = http, c(k (2,2) ) = 360.yahoo.com, c(k (3.1,1.3) ) = friends, c(k (3.2,1.2) ) = −, c(k (3.3,1.1) ) = nMfcaJRPUSMQ Transformation (T): t(k (3.3,1.1) ) = lttU7d6kIuGq. Rule Generation 2015/12/5Data Mining & Machine Learning Lab 11
12
Pair-wise Rule Generation ▫Target Selection ▫Source Selection Rule Generalization ▫Pair 1: http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/mediaindex http://www.imdb.com/title/tt0810900/mediaindex ▫Pair 2: http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/mediaindex http://www.imdb.com/title/tt0053198/mediaindex ▫Rule 1: c(k (1,5) ) = http, c(k (2,4) ) = www.imdb.com, c(k (3,3) ) = title, c(k (4.1,2.2) ) = tt, c(k (4.2,2.1) ) = 0810900, c(k (5,1) ) = photogallery, t(k (5,1) ) = mediaindex ▫Rule 2: c(k (1,5) ) = http, c(k (2,4) ) = www.imdb.com, c(k (3,3) ) = title, c(k (4.1,2.2) ) = tt, c(k (4.2,2.1) ) = 0053198, c(k (5,1) ) = photogallery, t(k (5,1) ) = mediaindex Rule Generation (cont.) 2015/12/5Data Mining & Machine Learning Lab 12
13
Dataset Rule Numbers after each step Evaluation 2015/12/5Data Mining & Machine Learning Lab 13
14
Small dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 14
15
Small dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 15
16
Large dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 16
17
Large dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 17
18
Presented a set of scalable and robust techniques for de-duplication of URLs ▫Basic and deep tokenization ▫Rule generation and generalization Easy adaptability to MapReduce paradigm Evaluate effectiveness on both small and large dataset Conclusion 2015/12/5Data Mining & Machine Learning Lab 18
19
Questions? Thanks for your attention 2015/12/5Data Mining & Machine Learning Lab 19
20
Algorithm 1 2015/12/5Data Mining & Machine Learning Lab 20
21
Algorithm 2 2015/12/5Data Mining & Machine Learning Lab 21
22
Algrithm 3 2015/12/5Data Mining & Machine Learning Lab 22
23
Algorithm 4 2015/12/5Data Mining & Machine Learning Lab 23
24
Algorithm 5 2015/12/5Data Mining & Machine Learning Lab 24
25
URL: A URL u is defined as function ▫u : K → V ∪ { ⊥ } ▫K: keys k(x.i,y.j) x, y represent the position index from the start and end of the URL i,j represent the deep token index ▫V: Values ▫A key not present in the URL is denoted by ⊥ Definitions of URL 2015/12/5Data Mining & Machine Learning Lab 25
26
RULE: A Rule r is defined as a function ▫r : C → T ▫C: context C : K → V ∪ { ∗ } ▫T: transformation T : K → V ∪ { ⊥,K’} K’ = K ∪ ValueConversions ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K),...} Definitions of Rule 2015/12/5Data Mining & Machine Learning Lab 26
27
Rule Coverage 2015/12/5Data Mining & Machine Learning Lab 27
28
MapReduce 2015/12/5Data Mining & Machine Learning Lab 28
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.