Download presentation
Presentation is loading. Please wait.
1
Website Clustering Combining Website Lexical Data and Query Semantic Data Nana Huang, Ray Li
2
Traditional Lexical Features Traditional website clustering uses lexical data parsed from each webpage to classify the websites into different categories. Regular text tags tags (description, keywords, arthur) What if the webpage consists of mainly automatically generated content from scripts? What if the webpage is a empty frame page with two or more frame?
3
AOL Clickthrough Data Back in August 2006, AOL released 2.2 GBs of search logs, which includes queries, clicked websites, and website page rank information. brochures for business5http://www.hp.comhttp://www.hp.com brochures for business6http://www.hansonmarketing.comhttp://www.hansonmarketing.com brochures for business8http://www.smallbusinessbrief.comhttp://www.smallbusinessbrief.com brochures for business10http://www.quickbrochures.comhttp://www.quickbrochures.com brochures for business9http://www.smallbusinessbrief.comhttp://www.smallbusinessbrief.com brochures for business7http://www.printingforless.comhttp://www.printingforless.com
4
Query-Website Graph We parsed a subset of this data to generate a query-document bipartite graph, where each link in the graph represents the number of times each query lead a website being clicked. Q1Q1 Q2Q2 Q3Q3 Q4Q4 Q5Q5 Queries D1D1 D2D2 D3D3 D4D4 D5D5 Documents
5
Query-Website Graph A graph like this is most likely too sparse to be useful. There are a lot of unobserved ‘clicks’ between queries and other related webpages. We use an iterative process to ‘smooth’ out the bipartite relationship between queries and websites, based on the observation that: Documents are considered ‘similar’ to some extent if they have been seen by the same query. Queries are considered ‘similar’ to some extent if they produce the same document.
6
Query-Website Graph This will produce a more realistic query-website bipartite relationship We can then use a list of queries associated with each website as a semantic feature vector. Q1Q1 Q2Q2 D2D2 D1D1 D3D3 Q1Q1 Q2Q2 D2D2 D1D1 D3D3
7
Combined Feature Vectors We have three sets of feature vectors for each document: Lexical features (consists of text and different html tags from the webpage itself) Semantic features (consists of queries information related to each webpage) Combination of both There are 10000 words and 2000 queries – too many features.
8
Latent Semantic Analysis We then apply Latent Semantic Analysis to reduce the 12000 features into a lower-ranked 30 ‘virtual concepts’ approximation {Chicken, Beef, Apple, Oranges} -> {Meat, Fruits} Each website is transformed from the original vector of features into a new vector of ‘virtual concepts’.
9
K-Means + Results We then apply K-means on this new vector space to classify websites into different categories. Results show that, while using only the semantic query vector performs worse than using the lexical feature vector, combining both features together results in a slightly better clustering performance. Lexical + Semantic QueryF1: 0.50 Lexical onlyF1: 0.47 Queries onlyF1: 0.30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.