Download presentation
Presentation is loading. Please wait.
1
Web Page Categorization without the Web Page Author: Min-Yen Kan WWW-2004
2
Basic Idea Web Page Categorization ~ Text Categorization Some retrieve the whole document This yields URLs of additional documents Could result in cyclic crawling or non- terminating crawling Glean information from intuitive URLs Avoid the bottleneck
3
An Example http://cs.cornell.edu/Info/Courses/Current/CS415/CS 415.html Classify the above webpage into one of the following categories: Course Faculty Project Student
4
Approach 2 phase URL segmentation First phase Baseline scheme://host/path-elements/document.extension More segmentation like, faculty-info faculty info Refined Break the URL if a transition between uppercase, lowercase and digits is observed
5
Approach Second phase Information content reduction Examines all possible partitions of the segment Adds information content (IC) of all such partitions Pick the one with lowest IC Title token based finite state transducer What about acronyms Non-deterministic weighted finite-state transducer splits and expands segments based on previously seen web page titles
6
An Example FST RuleScoreOutput 1. Match the initial letter in the subsequent token2|l|l 2. Match the initial letter in the non-subsequent token1|l|l 3. Match a subsequent letter in the current token1l 4. Match the final letter in the current token3l 5. Skip a character in the candidate expansion0є nytimes New York Times Ф N e w Y o r k T i m e s Score of 12 and outputs |n|y|times R1 R5 R5 R1 R5 R5 R5 R1 R3 R3 R3 R4
7
Experiments Dataset used: WebKB (4167 pages) Classified under student, faculty, course and project Classification used: SVM Compared with: FOIL-PILFS (based on inductive logic programming) Evaluation made based on (U)RL {U b,U r,U i,U f }, (A)nchor text, (T)itle text and page te(X)t
8
Experiments
9
Conclusion URLs contain tokens effective for classification Its faster Careful URL segmentation boosts classification URL segmentation is more powerful than expansion Can assist source based classification to a limited extent FST can not expand what it hasn’t seen Cryptic URLs are hard to tackle
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.