Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu 2015/12/5.

Similar presentations


Presentation on theme: "Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu 2015/12/5."— Presentation transcript:

1 Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu Email: D9815013@mail.ntust.edu.tw 2015/12/5 1 Data Mining & Machine Learning Lab

2 Outlines Introduction ▫Duplicate URLs ▫Problem Definition Related Works Algorithms ▫URL Preprocessing ▫Rule Generation Evaluation Conclusions 2015/12/5 2 Data Mining & Machine Learning Lab

3 Introduction Duplicate URLs Problem Definition 2015/12/5 3 Data Mining & Machine Learning Lab

4 Making URLs search engine friendly ▫http://en.wikipedia.org/wiki/Casino_Royalehttp://en.wikipedia.org/wiki/Casino_Royale ▫http://en.wikipedia.org/?title=Casino_Royalehttp://en.wikipedia.org/?title=Casino_Royale Session-id or cookie information present in URLs ▫http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=67873&cat=8 ▫http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8http://cs.stanford.edu/degrees/mscs/faq/index.php?sid=78813&cat=8 Irrelevant or superfluous components in URLs ▫http://www.amazon.com/Lord-Rings/dp/B000634DCWhttp://www.amazon.com/Lord-Rings/dp/B000634DCW ▫http://www.amazon.com/dp/B000634DCWhttp://www.amazon.com/dp/B000634DCW Webmaster construct URL representations with custom delimiters ▫http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ439 73351QQ_tabZ2http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0QQ_fclsZ1QQ_pcatidZ1QQ_pidZ439 73351QQ_tabZ2 ▫http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351 &_tab=2http://catalog.ebay.com/The- Grudge_UPC_043396062603_W0?_fcls=1&_pcatid=1&_pid=43973351 &_tab=2 Duplicate URLs 2015/12/5Data Mining & Machine Learning Lab 4

5 Given a set of duplicate clusters and their corresponding URLs ▫Learning Rules from URL strings which can identify duplicates ▫Utilizing learned Rules for normalizing unseen duplicate URLs into a unique normalized URL Applications such as crawlers can apply these generalized Rules on a given URL to generate a normalized URL Problem Definition 2015/12/5Data Mining & Machine Learning Lab 5

6 Do not crawl in the dust: different urls with similar text ▫Authors: Z. Bar-Yossef, I. Keidar, and U.Schonfeld. ▫Conference: International conference on World Wide Web 2007 ▫DUST algorithm  Discovering substring substitution rules to transform URLs of similar content to one canonical URL  Rules are learned from URLs obtained from previous crawl logs or web server logs with a confidence measure Related Works 2015/12/5Data Mining & Machine Learning Lab 6

7 De-duping urls via rewrite rules ▫Authors: A. Dasgupta, R. Kumar, and A. Sasturkar ▫Conference: ACM SIGKDD international conference on Knowledge discovery and data mining ▫Considering a broader set of rule types which subsume the DUST rules  DUST rules  session-id rules  irrelevant path components  Complicate rewrites ▫Algorithm learns rules from a cluster of URLs with similar page content  such a cluster is referred to as a duplicate cluster or a dup cluster Related Works (cont.) 2015/12/5Data Mining & Machine Learning Lab 7

8 URL Preprocessing ▫Basic Tokenization ▫Deep Tokenization Rule Generation ▫Pair-wise Rule Generation ▫Rule Generalization Algorithms 2015/12/5Data Mining & Machine Learning Lab 8

9 Basic Tokenization ▫Using the standard delimiters specified in theRFC 1738 ▫Extracted Tokens:  Protocol  Hostname  Path components  Query-args Deep Tokenization ▫Using unsupervised technique to learn custom URL encodings used by webmasters URL Preprocessing 2015/12/5Data Mining & Machine Learning Lab 9

10 URL Preprocessing (cont.) 2015/12/5Data Mining & Machine Learning Lab 10

11 Definitions ▫URLURL ▫RuleRule Example ▫u1: http://360.yahoo.com/friends-lttU7d6kIuGqhttp://360.yahoo.com/friends-lttU7d6kIuGq  u 1 = {k (1,3) = http, k (2,2) = 360.yahoo.com, k (3.1,1.3) = friends, k (3.2,1.2) = −, k (3.3,1.1) = lttU7d6kIuGq} ▫u2: http://360.yahoo.com/friendsnMfcaJRPUSMQhttp://360.yahoo.com/friendsnMfcaJRPUSMQ  u 2 = {k (1,3) = http, k (2,2) = 360.yahoo.com, k (3.1,1.3) = friends, k (3.2,1.2) = −, k (3.3,1.1) = nMfcaJRPUSMQ} ▫Rule  Context (C ):  c(k (1,3) ) = http, c(k (2,2) ) = 360.yahoo.com, c(k (3.1,1.3) ) = friends, c(k (3.2,1.2) ) = −, c(k (3.3,1.1) ) = nMfcaJRPUSMQ  Transformation (T):  t(k (3.3,1.1) ) = lttU7d6kIuGq. Rule Generation 2015/12/5Data Mining & Machine Learning Lab 11

12 Pair-wise Rule Generation ▫Target Selection ▫Source Selection Rule Generalization ▫Pair 1:  http://www.imdb.com/title/tt0810900/photogallery http://www.imdb.com/title/tt0810900/photogallery  http://www.imdb.com/title/tt0810900/mediaindex http://www.imdb.com/title/tt0810900/mediaindex ▫Pair 2:  http://www.imdb.com/title/tt0053198/photogallery http://www.imdb.com/title/tt0053198/photogallery  http://www.imdb.com/title/tt0053198/mediaindex http://www.imdb.com/title/tt0053198/mediaindex ▫Rule 1:  c(k (1,5) ) = http, c(k (2,4) ) = www.imdb.com, c(k (3,3) ) = title, c(k (4.1,2.2) ) = tt, c(k (4.2,2.1) ) = 0810900, c(k (5,1) ) = photogallery, t(k (5,1) ) = mediaindex ▫Rule 2:  c(k (1,5) ) = http, c(k (2,4) ) = www.imdb.com, c(k (3,3) ) = title, c(k (4.1,2.2) ) = tt, c(k (4.2,2.1) ) = 0053198, c(k (5,1) ) = photogallery, t(k (5,1) ) = mediaindex Rule Generation (cont.) 2015/12/5Data Mining & Machine Learning Lab 12

13 Dataset Rule Numbers after each step Evaluation 2015/12/5Data Mining & Machine Learning Lab 13

14 Small dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 14

15 Small dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 15

16 Large dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 16

17 Large dataset Evaluation (cont.) 2015/12/5Data Mining & Machine Learning Lab 17

18 Presented a set of scalable and robust techniques for de-duplication of URLs ▫Basic and deep tokenization ▫Rule generation and generalization Easy adaptability to MapReduce paradigm Evaluate effectiveness on both small and large dataset Conclusion 2015/12/5Data Mining & Machine Learning Lab 18

19 Questions? Thanks for your attention 2015/12/5Data Mining & Machine Learning Lab 19

20 Algorithm 1 2015/12/5Data Mining & Machine Learning Lab 20

21 Algorithm 2 2015/12/5Data Mining & Machine Learning Lab 21

22 Algrithm 3 2015/12/5Data Mining & Machine Learning Lab 22

23 Algorithm 4 2015/12/5Data Mining & Machine Learning Lab 23

24 Algorithm 5 2015/12/5Data Mining & Machine Learning Lab 24

25 URL: A URL u is defined as function ▫u : K → V ∪ { ⊥ } ▫K: keys  k(x.i,y.j)  x, y represent the position index from the start and end of the URL  i,j represent the deep token index ▫V: Values ▫A key not present in the URL is denoted by ⊥ Definitions of URL 2015/12/5Data Mining & Machine Learning Lab 25

26 RULE: A Rule r is defined as a function ▫r : C → T ▫C: context  C : K → V ∪ { ∗ } ▫T: transformation  T : K → V ∪ { ⊥,K’}  K’ = K ∪ ValueConversions  ValueConversions = {Lowercase(K), Uppercase(K), Encode(K), Decode(K),...} Definitions of Rule 2015/12/5Data Mining & Machine Learning Lab 26

27 Rule Coverage 2015/12/5Data Mining & Machine Learning Lab 27

28 MapReduce 2015/12/5Data Mining & Machine Learning Lab 28


Download ppt "Learning URL Patterns for Webpage De-duplication Authors: Hema Swetha Koppula… WSDM 2010 Reporter: Jing Chiu 2015/12/5."

Similar presentations


Ads by Google