Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer.

Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine

Searching for People on the Web Approx. 5-10% of internet searches look for specific people by their names [Guha&Garg’04] Challenges: –Heterogeneity of representations: Same entity represented multiple ways. –Ambiguity in representations: Same representation of multiple entities. WePS systems address both these challenges. Andrew McCallum’s Homepage …Associate Professor… Andrew McCallum’s Homepage …Associate Professor… ACM: Andrew Kachites McCallum … ACM: Andrew Kachites McCallum … Items for Author: McCallum, R. Andrew … Items for Author: McCallum, R. Andrew … UMass professor Mental Floss Andrew McCallum …… Mental Floss Andrew McCallum …… MusicianUMass professor DBLP: Andrew McCallum …… DBLP: Andrew McCallum …… News: … Andrew McCallum …Australia … News: … Andrew McCallum …Australia … CROSS president Heterogeneity Ambiguity

Search Engines v WePS Google results for “Andrew McCallum” WePS results for “Andrew McCallum” Umass professor photographer physician WePS system return clusters of web pages wherein pages of the same entity are clustered together

Clustering Search Engines v WePS Search results of “Andrew McCallum” using Clusty WePS system results of “Andrew McCallum” Examples of clustering search engines: Clusty – http://www.clusty.com Kartoo – http://www.kartoo.comhttp://www.clusty.comhttp://www.kartoo.com Differences : While both return clustered search results, Clustering search engines cluster similar/related pages based on broad topics (e.g., research, photos) while WePS system cluster based on disambiguating amongst various namesakes.

WePS System Architecture Client side solution –the client resolves references Proxy Solution –Proxy resolves references Server side solution –The server resolves references Web Client Server Disamb. Approach Query Returned pages Web Proxy Client Server Disamb. Approach Query Returned pages Web clusters Client Server Disamb. Approach Query clusters Classified based on where references are disambiguated

Current WePS Approaches Commercial Systems: –Zoominfo – http://www.zoominfo.com http://www.zoominfo.com –Spock – http://www.spock.comhttp://www.spock.com Multiple research proposals –[Kamha et al. 04; Artiles et. al ’07 &’05; Bekkerman et. al. ’05; Bollegala et. al. ’06;Matsou et. al. 06] etc. Web people search task included in Sem-Eval workshop in 2007 –16 teams participated in the task WePS techniques differ from each other in: –data analyzed for disambiguation & Clustering approach used Type of data analyzed Baseline –exploit information in result pages only –Keywords, named entities, hyperlinks, emails, … –[Bekkerman et.al. 05], [Li, et. al. 05][Elmachioglu, et. al. 07] exploit external information –Ontologies, Encyclopedia, … –[Kalashnikov, et. al, 07] Exploit Web content –Crawling [Kanani et. al. 07] –Search engine Statistics

Contributions of this paper Exploiting Co-occurrence statistics from the search engine for WePS. A novel classification approach based on skylines to support reference resolution. Outperforms existing techniques on multiple data sets

Approach Overview Top-K Webpages 1.Extract NEs 2. Filter Refining Clusters using Web Queries Preprocessing Initial Clustering Post processing Preprocessed pages 1. Web Query 2. Feature Extraction 3. Skyline based classifier 4. Cluster refinement Initial clusters Final clusters Person1Person2 Person3 1. Compute similarity using TF/IDF on NE 2. Cluster based on TF/IDF similarity 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters

Preprocessing: Named Entity Extraction Locate query name on web page Extract M neighboring names entities –organization –people Names Stanford Named Entity Recognizer & GATE used People NEs Lise Getoor Ben Taskar John McCallum Charles Organization UMASS ACM Queue Location Amherst

Preprocessing: Filtering Named Entities Location Filter Ambiguous name Filter Ambiguous Org. Filter Same last name Filter Input: Extracted NEs –Remove location NEs by looking up a dictionary –Remove one-word person names, such as ‘John’ –Remove common English words –Remove the person names with the same last name. Output: Cleaned NEs

Initial Clustering Intuition: –If two web pages “significantly” overlap in terms of associated NEs, then the two pages refer to the same individual. Approach: –Compute similarity between web pages based on TF/IDF over Named Entities –Merge all pairs of Web pages whose TF/IDF similarity exceeds a threshold –Threshold learnt over the training dataset. Threshold = 0.5 … 0.10.9 0.8 0.3 0.2 … 0.10.9 0.8 0.3 0.2 Resolved references

Cluster Refinement Initial Clustering based on TF/IDF similarity: –Accurate – references resolved are usually correct. –Conservative – too many unresolved references Andrew McCallum ……CMU Andrew McCallum ……CMU Andrew McCallum ……UMASS Andrew McCallum ……UMASS 0.01 Key Observation: Web may contain additional information that may further resolve unresolved references. Co-occurrence between contexts associated with unresolved references Andrew McCallum ……Job: UMASS Phd: CMU Andrew McCallum ……Job: UMASS Phd: CMU 0.5 0.7 Co-occurrence statistics can be learnt by querying the search engine

Using Co-occurrence to Disambiguate Context associated with Search results Andrew McCallum ……CMU Andrew McCallum ……CMU Andrew McCallum ……UMASS Andrew McCallum ……UMASS C1 = UMASSC2 = CMU Overlap between contexts over the Web D1 D2 Query N “Andrew McCallum” Search Results Number of web pages containing “Andrew McCallum “ AND “UMASS” AND “CMU” overlap between contexts associated with two pages of the same entity expected to be higher compared to pages of unrelated entities. Merge Web pages Di and Dj based on |N.Ci.Cj|/|N|? What if context NEs Ci and Cj are very “general” The co-occurrence counts could be high for all namesakes! Solution -- Normalize |N Ci Cj| using Dice similarity

Dice Similarity Two possible versions of dice: will be large for “general context” words

Dice2 Similarity -- Example Dice 2 captures that Yahoo and Google are too unspecific Organization Entities to be a good context for “Andrew McCallum” Dice 2 leads to visibly better results in practice! Let an N O 1 O 2 query be: Q: “Andrew McCallum” AND “Yahoo” AND “Google” Let counts be: “Andrew McCallum” AND “Yahoo” AND “Google” – 522 “Andrew McCallum” AND “Yahoo” – 2470 “Andrew McCallum” AND “Google” – 8950 “Andrew McCallum” – 38500 “Yahoo” AND “Google” – 73000000 Dice: Dice 1 = 0.05 Dice 2 = 7.15e-6

Web Queries to Gather Co-Occurrence Statistics People NEs Lise Getoor Ben Taskar Organization UMASS ACM People NEs Charles A. Sutton Ben Wellner Organization DBLP KDD Context Association between context (L.G. OR B.T.) AND (C.A.S. OR B.W.) (L.G. OR B.T.) AND (DBLP OR KDD) (UMASS OR ACM) AND (C.A.S. OR B.W.) ( UMASS OR ACM ) AND (DBLP OR KDD) Q1 Q2Q3Q4 Association amongst Context within pages relevant to original query N N AND Q1N AND Q2 N AND Q3N AND Q4 8 web queries Ex: Q1 = (“Lise Getoor” OR “Ben Taskar”) AND (“Charles A. Sutton” OR “Ben Wellner”) (#97) (#2,570) (#720) (#2,190K) (#79)(#49) (#63)(#10,100)

Role of a Classifier Classifier used to mark edges as “merge” (+) or “do not merge” (-) resulting in the final cluster. Since the features are “similarity” based, the classifier should satisfy the dominance property … Initial Clusters … Final Clusters Edge Classification Web Co-Occurrence Feature Generation … Edges with 8D Features (0.1,0.8,…,0.4) (0.2,0.1,…,0.9) (0.3,0.5,…,0.1)

Dominance Property Let f and g be 8D feature vectors –f = (f 1, f 2,…,f 8 ) –g = (g 1,g 2,…,g 8 ) Notation –(f up dominates g) if f 1 ≤ g 1, f 2 ≤ g 2, …, f 8 ≤ g 8 –(f down dominates g) if f 1 ≥ g 1, f 2 ≥ g 2, …, f 8 ≥ g 8 Dominance Property for classifier –Let f and g be the features associated with edges (d i, d j ) and (d m, d n ) respectively. –If f ≤ g and the classifier decides to merge (d i, d j ), then it must also merge (d m, d n )

Skyline-Based Classification A skyline classifier consists of an up-dominating classification skyline –All points “above” or “on” it are declared “merge” (“+”) –The rest are “don’t merge” (“-”) –Automatically takes care of dominance Our goal is to learn a classification skyline that supports a good final clusters of search results.

Learning Classification Skyline Greedy Approach Initialize classification skyline with the point that dominates the entire space. Pool of points to add to CS –Down-dominating skyline on active “+” edges Maintain “current” clustering –Clustering induced by current CS Choosing a point from pool to CS that –Stage1: Minimizes loss of quality due to adding new “-” edges –Stage2: Maximizes overall quality Iterate until done Output best skyline discovered Initial CS

Example: Adding a point to CS Ground Truth Clustering: Initial Clustering: Current Clustering: Current CS = {q,d,c} Point Pool = {f,e,u,h}

Example: Considering f Stage 1: Adding edge k Fp = 0.805 Stage 2: Adding edge f Fp = 0.860 What if add f to CS?

Example: Considering e Stage 1: Adding edges k and v Fp = 0.718 Stage 2: Adding edge e Fp = 0.774 What if add e to CS?

Example: Considering u Stage 1: Adding edge i Fp = 0.805 Stage 2: Adding edge u Fp = 0.833 What if add u to CS?

Example: Considering h Stage 1: Adding edges i and t Fp = 0.625 Stage 2: Adding edge h Fp = 0.741 What if add h to CS?

Example: Stage2 Stage 1 (for f): Fp = 0.805 Stage 1 (for u): Fp = 0.805 After Stage1 only f and u remain Stage 2 (for f): Fp = 0.860 Stage 2 (for u): Fp = 0.833 Thus adding f to CS!

Why “Skyline-based” classifier? Specialized classifier –Designed specifically for the problem at hand –Is aware of the clustering underneath Fine-tunes itself to a given clustering quality metric –Such as F 1, F P, F B Takes into account the dominance in 8D data –Explained shortly Takes into account the transitivity of merges –Classifying one edge as “+” might lead to classifying multiple edges as “+” –For instance when two large clusters are merged

Post Processing Top-K Webpages 1.Extract NEs 2. Filter Refining Clusters using Web Queries Preprocessing Initial Clustering Post processing Preprocessed pages 1. Web Query 2. Feature Extraction 3. Skyline based classifier 4. Cluster refinement Initial clusters Final clusters Person1Person2 Person3 1. Compute similarity using TF/IDF on NE 2. Cluster based on TF/IDF similarity 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters Page Ranking –Use the original ranking from search engine Cluster Ranking –each cluster we simply select the highest ranked page and use that as the order of the cluster. Cluster Sketches –coalesce all pages in the cluster into a single page –removing the stop words we compute the –TF/IDF of the remaining words –Select set of terms above a certain threshold (or top N terms)

Experiments Datasets –WWW’05 By Bekkerman and McCallum 12 different Person Names all related to each other. Each Person name has different number of namesakes. –SIGIR’05 By Artiles et al 9 different Person Names, each with different number of namesakes. –WEPS For Semeval Workshop on WEPS task (by Artiles et al) Training dataset ( 49 Different Person Names) Test Dataset (30 Different Person Names)

Quality Measures F P : Harmonic mean of purity and inverse purity F B : Harmonic mean of B-cubed precision and recall.

Experiments Baselines –Named Entity based Clustering (NE) TF/IDF merging scheme that uses the NEs in the Web Pages –Web-based Co-Occurrence Clustering (WebDice) Computes the similarity between Web pages by summing up the individual dice similarities of People-People and People-Organization similarities of Web pages.

Experiments on WWW & SIGIR Datasets

Experiments on the WEPS Dataset

Effect of Initial Clustering

Conclusion and Future Work Web co-occurrence statistics can significantly improve disambiguation quality in WePS task. Skyline classifier is effective in making merge decisions. However, current approach has limitations: –large number of search engine queries to collect statistics  will work only as a server side solution given limitations today. –Based on the assumption that context of different entities is very different. May not work as well on the difficult case when contexts of different entities significantly overlap.

Additional Slides

Techniques tested in SemEval 2007 CU-COMSEM: –Use various token-based and phrase-based features. –Agglomerative clustering with a single linkage PSNUS: –Merging using TF/IDF on Named Entities IRST-BP –Features include Named Entities, frequent words (phrases), time expression, etc. –similarity score is the sum of the individual scores of the common features which are weighed according to the maximum of distances between the name of interest and the feature.

Techniques tested in SemEval 2007 (Cont.) AUG: –A combined supervised and unsupervised approach –Classification based on feature vectors forming initial clusters –Clustering based on weighted keyword matrix –Combining the clustering results DFKI2: –An Information Extraction Based Approach –Construct two feature vectors for each file based on the counts of the NEs and predicate argument structures that contain the specific person name. –Use hierarchical clustering. FICO: –an approach via Weighted Similarity of Entity Contexts. –Use agglomerative clustering

Techniques tested in SemEval 2007 (Cont.) JHU1 : –Using Web Snippets as additional features. –K-mean clustering on rich-feature-enhanced document vectors TITPI: –Using semi-supervised clustering approach to integrate similar documents into a labeled document. –Controlling the fluctuation of the centroid of a cluster in order to generate more accurate clusters. SHEF: –Agglomerative clustering on feature vectors created by summarization and semantic tagging analysis components

Techniques Tested in SemEval 2007 (Cont.) UA-ZSA: –Discover different underlying meanings of a name on the basis of the title of the web page, the body content, the common named entities across the documents and the sub-links. –Apply K-Means clustering UC3M_13: –Web documents are represented as different sets of features or terms of different types (bag of words, URLs, names and numbers). –Agglomerative clustering based on the composition of simple bags of typed terms. UNN-WePS: –using co-presence of person names to form seed clusters. –Extend seed clusters with pages that are deemed conceptually similar based on a lexical chaining analysis computed using Roget’s thesaurus.

Techniques Tested in SemEval 2007 (Cont.) UVA: –Using Language Modeling Techniques –Single pass clustering using title, snippet and body to represent documents WIT: –Use a graph to model the web pages –Compute a similarity matrix for web pages using random walks over the graph

TF/IDF similarity TF: IDF: ni,j : the number of occurrences of the considered term ti (i.e., NE) in document dj, and the denominator is the number of occurrences of all terms in document dj. : number of documents where the term ti appears | D | : total number of documents in the corpus TF/IDF:

Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer.

Similar presentations

Presentation on theme: "Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer.

Similar presentations

Presentation on theme: "Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer."— Presentation transcript:

Similar presentations

About project

Feedback