Disambiguation Algorithm for People Search on the Web

Slides:



Advertisements
Similar presentations
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Advertisements

A Unified Framework for Context Assisted Face Clustering
Improved TF-IDF Ranker
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
A review on “Answering Relationship Queries on the Web” Bhushan Pendharkar ASU ID
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Project Title: Deepin Search Member: Wenxu Li & Ziming Zhai CSCI 572 Project.
INFO 624 Week 3 Retrieval System Evaluation
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer.
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Exploiting Relevance Feedback in Knowledge Graph Search
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Post-Ranking query suggestion by diversifying search Chao Wang.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Using Bayesian Networks to Predict Plankton Production from Satellite Data By: Rob Curtis, Richard Fenn, Damon Oberholster Supervisors: Anet Potgieter,
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 FollowMyLink Individual APT Presentation First Talk February 2006.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Free SEO for Blogs & YouTube Channels.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Hansheng Xue School of Computer Science and Technology
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Federated & Meta Search
Search Techniques & Strategies
Applying Key Phrase Extraction to aid Invalidity Search
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Computer Science Department University of California, Irvine
Martin Rajman, EPFL Switzerland & Martin Vesely, CERN Switzerland
CS 440 Database Management Systems
Exploratory search: New name for an old hat?
Information retrieval and PageRank
Self-tuning in Graph-Based Reference Disambiguation
Ying Dai Faculty of software and information science,
Feature Selection for Ranking
Graph and Link Mining.
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Discussion Class 9 Google.
Inductive Clustering: A technique for clustering search results Hieu Khac Le Department of Computer Science - University of Illinois at Urbana-Champaign.
Introduction Dataset search
Presentation transcript:

Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions visit: http://www.ics.uci.edu/~dvk Computer Science Department University of California, Irvine If somebody has questions during the presentation which you cannot answer, please ask them to contact Dmitri V. Kalashnikov, the URL is provided. That webpage contains email of Dmitri V. Kalashnikov, which is not provided here to avoid spam.

Entity (People) Search Person1 Person2 Top-K Webpages Person3 Problem Definition: the goal is to group the webpages that co-refer (talk about the same person) Why? – Better (next-generation) search capabilities! Unknown beforehand

Standard Approach to Entity Resolution So, how do you solve the disambiguation problem? This slidea shows the standard approach: using features. Feature in this case can be TF/IDF of webpages.

Key Observation: More Info is Available = This is a key observation for understanding RelDC framework: relational data can be represented in the form of the Entity Relationship Graph. ER graph – nodes are entities, edges are relationships.

RelDC Framework RelDC framework combines methods for using: features, context (features derived from context) and relationships. Intuition: a presence of a path between X and Y might indicate that they co-refer. Thus analyze paths!

Where is the Graph here? Use Extraction! Unlike in regular (structured) database, the graph is not readily available here. Solution: from each webpage extract: - Named Entities (using GATE) Hypelinks/emails connect them as explained in the paper - parese links/emails

Overall Algorithm Overview User Input. A user submits a query to the middleware via a web-based interface. Web page Retrieval. The middleware queries a search engine’s API, gets top-K Web pages. Preprocessing. The retrieved Web pages are preprocessed: TF/IDF. Preprocessing steps for computing TF/IDF are carried out. Ontology. Ontologies are used to enrich the Webpage content. Extraction. Named entities, and web related information is extracted from the Webpages. Graph Creation. The Entity-Relationship Graph is generated Enhanced TF/IDF. Ontology-enhanced TF/IDF values are computed Clustering. Correlation clustering is applied Cluster Processing. Each resulting cluster is then processed as follows: Sketches. A set of keywords that represent the web pages within a cluster is computed for each cluster. The goal is that the user should be able to find the person of interest by looking at the sketch. Cluster Ranking. All cluster are ranked by a choosing criteria to be presented in a certain order to the user Web page Ranking. Once the user hones in on a particular cluster, the Web pages in this cluster are presented in a certain order, computed on this step. Visualization of Results. The results are presented to the user in the form of clusters (and their sketches) corresponding to namesakes and which can be explored further.

Correlation Clustering In CC, each pair of nodes (u,v) is labeled with “+” or “-” edge labeling is done according to a similarity function s(u,v) Similarity function s(u,v) if s(u,v) believes u and v are similar, then label “+” else label “-” s(u,v) is typically trained from past data Clustering looks at edges tries to minimize disagreement disagreement for element x placed in cluster C, is a number of “-” edges that connect x and other elements in C

Connection strength between u and v: Similarity Function Connection strength between u and v: where ck – the number of u-v paths of type k and wk – the weigh of u-v paths of type k Similarity s(u,v) is a combination

Training s(u,v) on pre-labeled data It is a “linear” optimization problem. (LP have efficient solutions) The system says that for edges labeled “+” similarity should exceed a threshold, for “-” – should be less than threshold. Delta is used for “clearly” less or more than t. Since such contsructed inequalities might not have a solution that would satify them all – to handle it slack is added to each equation. The goal is to miminize the overall slack.

Experiments: Quality of Disambiguation By Artiles, et al. in SIGIR’05 These two experiments show quality on two datasets: from SIGIR’05 and from WWW’05 (By famous group of Andrew McCallum). The “+” value in brackets are improvements from what is reported in those papers. In WWW’05, the author do a different experiment (from what we do in this paper) – we implemented that experiment also and have 9.5% improvement over them (the measure is computed the same way they do – so we made sure that everything is comparable). By Bekkerman & McCallum in WWW’05

Experiments: Effect on Search Effects on Precision, Recall, F-measure for “Andrew McCallum” for different representative clusters For Umass professor: his cluster is first and dominant (large) For customer-support person: his cluster is small (3 webpages) towards the end of Google’s search In all experiments, the goal is to find all the pages of a given person. The experiments show that the new interface allows to do so quicker.