Using Web Structure for Classifying and Describing Web Pages Eric J. Glover1, Kostas Tsioutsiouliklis1,2, Steve Lawrence1, David M. Pennock1, Gary W. Flake1 International World Wide Web Conference, 2002 Presented by Zaihan Yang CSE Web Mining
Introduction Aim Classification of web pages Description of web pages (to name clusters of web pages) Using Web Structure Extracting patterns from hyperlinks in the web. HyperLink The destination page Associated anchortext describing link
Typical Text-based classification To utilize the words (or phrases) of a target document, considering the most significant features. Not Effective. E.g. The home page of General Motors ( does not state that they are a car company. Full text Anchortext Extended-anchortext A combination
Virtual Document A virtual document is: Anchortext: A collection of anchortexts or extended anchortexts from links pointing to the target document. Anchortext: The words occurring inside of a link Extended anchortext: The set of rendered words occurring up to 25 words before and after an associated link (as well as the anchortext itself).
Main Method Main Procedure Full-text classifier Virtual documents classifier Two Improvement methods Name a cluster Main Procedure Datasets Features EFL Ranking Train SVM
Datasets Positive: a set of web pages downloaded from various Yahoo! Categories. Negative: Random documents from outside Yahoo! WebKB dataset Features: All words and two or three word phrases i.e. My favorite game is scrabble. Possible features: My, my favorite, my favorite game, favorite, favorite game, etc.
Dimensionality reduction To remove useless features. Two step process: First, remove all features that do not occur in a specified percentage of documents. i.e. (|Af|/|A| < T+) and (|Bf|/|B| < T-) A: the set of positive examples. B: the set of negative examples. Af: documents in A that contain feature f. Bf: documents in B that contain feature f. T+: threshold for positive features. T-: threshold for negative features. Second, rank the remaining features based on expected entropy loss.
Expected Entropy Loss The expected entropy loss: The prior entropy of the class distribution: The posterior entropy of the class when the feature is present: The posterior entropy of the class when the feature is absent: The expected entropy loss:
Train SVM A set of data points: {(x1,y1),…, (xN,yN)} xi is an input and yi is a target output (1 or -1). Separating hyperplane: w•φ(xi) + b = 0 w•φ(xi) + b ≥ 1 if yi = 1 w•φ(xi) + b ≤ -1 if yi = -1 w•φ(xi) + b where minimizing Output: Kernel function:
Improvement-Uncertainty Sampling The result from an SVM classifier is a real number from -∞ to +∞. When the output is on the interval (-1,1) it is less certain than if it is on the intervals (-∞,-1) and (1,+∞). The region (-1,1) is called the “uncertain region”. Uncertainty sampling A human judges the documents in the “uncertain region”
Improvement-Combination To combine results from the extended anchortext based classifier with the less accurate full-text classifier. Result of extended-AT classifier Web page Extended-AT classifier Negative but uncertain? Full-text Positive and |output| > |outputAT|? N Y positive negative
Name the Cluster Using the top ranked features extracted from the extended anchotexts virtual documents to name a cluster. Beliefs: The words near the anchortexts are descriptions of the target documents. The top ranked features by expected entropy loss are those which occur in many positive examples,and few negative ones.
Results-classifying Anchortext alone is comparable for classification purpose with the full-text. Classification accuracy is significantly improved when using the extended anchortext instead of the document full-text. Combination method is highly effective for improving positive-class accuracy, but reduces negative class accuracy. Uncertainty sampling required examining only 8% of the documents on average, while providing an average positive class accuracy improvement of almost 10 percentage points.
Result--Clustering The full-text appears comparable to the extended anchortext. The anchortext alone appears to do a poor job of describing the category.
Future Work To include other features on the inbound web pages besides extended anchortext: To examine the effects of the number of inbound links. To examine the nature of the category by expanding this to thousands of categories. To study the effects of the positive set size.