Download presentation
Presentation is loading. Please wait.
Published byLindsay Gilbert Modified over 8 years ago
1
A Personalized Search Engine Based on Web Snippet Hierarchical Clustering Paolo Ferragina, Antonio Gulli Presented by Bin Tan
2
Clustering Web Search Results Challenges: On short snippets instead of whole docs Clustering must be done on the fly Clusters should be labeled with meaningful text (accurate and intelligible) Clusters need to be distinctive Vivisimo S NAKE T
3
Categorization of Works Flat clustering vs. Hierarchical clustering Label representation: Bag of words vs. contiguous phrase vs. non-contiguous phrase (“gapped sentence”)
4
Preprocessing Fetch snippets from 16 search engines Enrich snippets with anchor texts from a crawled database of 200M web pages
5
Identification of Candidate Phrases for Labels Enumerate all pairs of words within a certain proximity window (of size 4) in snippets Score them based on: NLP features: PoS, NE ODP occurrences: term frequency (col freq * inv cat freq?), containing category Discard low-score pairs
6
Identification of Candidate Phrases for Labels (cont.) Word pairs are atomic phrases (how about single words?) Incrementally merge word pairs into longer phrases (preserve ordering and limit size) Score phrases based on its constitutes’ scores Discard low-score phrases
7
Hierarchical Clustering Group all snippets containing a candidate phrase into an atomic cluster – allow overlapping Primary label: the aforementioned candidate phrase Secondary labels: other candidate phrases occurring in 80% of the snippets in the cluster
8
Hierarchical Clustering (cont.) Merge atomic clusters into candidate second- level clusters if they share primary/secondary labels Primary label: the shared label Secondary label: other labels occurring in 80% of the snippets in the cluster Prune second-level clusters that are have similar coverage or similar labels Recursively produce third-level clusters
9
How S NAKE T can be Used Hierarchical browsing for knowledge extraction Hierarchical browsing for result selection Query reformulation Personalized ranking(?)
10
Evaluation
11
Evaluation (cont.)
12
Clustering technology: PageRank of the future? Pros: Ambiguous query: narrow down result list Less-ambiguous query: get a bird’s eye view of different aspects Cons: Clustering is slow but often unnecessary Takes time to look at the clusters Cluster and label quality still to be desired
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.