Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.

Slides:



Advertisements
Similar presentations
Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc) Related papers: VLDB 2011,
Advertisements

Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Introduction to Information Retrieval
Reporter: Jing Chiu Advisor: Yuh-Jye Lee /7/181Data Mining & Machine Learning Lab.
Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Confidence Measures for Speech Recognition Reza Sadraei.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
6/2/ An Automatic Personalized Context- Aware Event Notification System for Mobile Users George Lee User Context-based Service Control Group Network.
Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.
Web People Search using Extracted Attributes Joseph S. Park Computer Science Brigham Young University.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Project 4 U-Pick – A Project of Your Own Design Proposal Due: April 14 th (earlier ok) Project Due: April 25 th.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Partitioning Search-Engine Returned Citations for Proper-Noun Queries Reema Al-Kamha Supported by NSF.
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
Partitioning Search-Engine Returned Citations for Proper-Noun Queries Reema Al-Kamha.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
PhishNet: Predictive Blacklisting to Detect Phishing Attacks Pawan Prakash Manish Kumar Ramana Rao Kompella Minaxi Gupta Purdue University, Indiana University.
Using Network Simulation Heung - Suk Hwang, Gyu-Sung Cho
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Internet Basics A management-level overview of the Internet, its architecture, capabilities, and protocols. Copyright 2011 SPMI / Online Development.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc)
XP New Perspectives on The Internet, Fifth Edition— Comprehensive, 2005 Update Tutorial 7 1 Mass Communication on the Internet Using Newsgroups Tutorial.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Performance Debugging for Distributed Systems of Black Boxes Marcos K. Aguilera Jeffrey C. Mogul Janet L. Wiener HP Labs Patrick Reynolds, Duke Athicha.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
DOCUMENT CLUSTERING USING HIERARCHICAL ALGORITHM Submitted in partial fulfillment of requirement for the V Sem MCA Mini Project Under Visvesvaraya Technological.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 CS 430: Information Discovery Lecture 5 Ranking.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Internet The internet is the largest computer network system in the world. It consists of many smaller networks connected together by a global public.
Mohammed I DAABO COURSE CODE: CSC 355 COURSE TITLE: Data Structures.
User Modeling for Personal Assistant
CSC 102 Lecture 12 Nicholas R. Howe
Improving searches through community clustering of information
Database Management System
PageRank and Markov Chains
Chapter 12: Query Processing
Information Technology Ms. Abeer Helwa
Computer Networking A computer network, often simply referred to as a network, is a collection of computers and devices connected by communications channels.
Information Retrieval and Web Design
Presentation transcript:

Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF

2 The Problem Search engines return too many citations. Example: “Kelly Flanagan”. Google returns around 685 citations. Many people named “Kelly Flanagan” It would help to group the citations by person. How do we group them?

3 “Kelly Flanagan” Query to Google

4 A Multi-faceted approach Attributes Links Page Similarity Confidence matrix for each facet Final confidence matrix Grouping algorithm Our Solution

5 A Multi-faceted Approach Gather evidence from each of several different facets Combine the evidence

6 Attributes Phone number, address, state, city, zip code. Regular expression for each attribute.

7 Links People usually post information on only a few host servers.  Returned citations that have a same host. People often link one page about a person to another page about the same person.  The URL of one citation has the same host as one of the URLs that belongs to the web page referenced by the other citation.

8 Links (Cont)

9 Page Similarity “adjacent cap-word pairs”: Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word.

10 Page Similarity The number of shared adjacent cap-word pairs (1, 2, 3, 4 or more). Ignore adjacent cap-word pairs that often occur on web pages (Home Page and Privacy Policy) by constructing a stop-word list.

11 Confidence Matrix Construction For each facet we construct a confidence matrix. C1C1 C 2 …..C i …..C j …CnCn C1C1 1C 12 C 1i C 1j C 1n C2C2 1C 2i C 2j C 2n :: : : CiCi 1C ij C in : : : CjCj 1C jn : : CnCn 1 P(C i and C j refer to a same person | evidence for a facet f ) 0 if no evidence for a facet f C ij = Training set to compute the conditional probabilities.

12 Confidence Matrix Construction (Cont) We select 9 person names. For each name we collect the first 50 citations. For 50 citations we have 1,225 comparison pairs. The size of our training set is 11,025.

13 Confidence Matrix Construction (Cont) For attribute facet P(Same Person = “Yes” | = “yes”) P(Same Person = “Yes” | City = “yes” and State = “Yes”) For link facet P(Same Person = “Yes” | Host1 = “yes” and Host1 is non-popular) For page similarity facet P(Same Person = “Yes” | Share2 = “yes”)

14 Confidence Matrix for Attribute Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C C2C C3C C4C C5C C6C C7C C8C8 100 C9C9 10 C 10 1 C 1 and C 2 have the same zip, city, and state, which are “Provo”, “UT”, and “84604”. C 1 and C 8, C 2 and C 8 have the same city and state, which are “Provo” and “UT”. C 4 and C 7 have the same city and state, which are“Palm Desert” and “California”.

15 Confidence Matrix for Link Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C C2C C3C C4C C5C C6C C7C C8C8 100 C9C9 10 C 10 1 C 1 and C 2 have the same host name, and C 1 refers to the host of C 2.. C 5 and C 6 have the same host name. C 3 refers to the host of C 5 and C 3 refers to the host of C 6

16 Confidence Matrix for Page Similarity Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C C2C C3C C4C C5C C6C C7C C8C8 100 C9C9 10 C 10 1 C 1 and C 2 share Associate Professor, Brigham Young, Performance Evaluation, Trace Collection, Computer Organization, Computer Architecture. C 2 and C 3 share Memory Hierarchy, Brent E. Nelson, System-Assisted Disk, Simulation Technique, Stochastic Disk, Winter Simulation, Chordal Spoke, Interconnection Network, Transaction Processing, Benchmarks Using, Performance Studies, Incomplete Trace, Heng Zho. C 1 and C 8, C 2 and C 8 share Brigham Young. C 4 and C 7 share Palm Desert, Real Estate, Desert Real.

17 Final Matrix Combine the confidence matrices for the three facets using Stanford Certainty Measure. For some observation B, If CF(E 1 ) is the certainty factor associated with E 1 If CF(E 2 ) is the certainty factor associated with E 2 the new certainty factor for B is: CF(E1) + CF(E2) – CF(E1) * CF(E2).

18 Final Matrix (Cont) * * * * 0 * 0.78 = Confidence Matrix for AttributesConfidence Matrix for LinksConfidence Matrix for Page Similarity

19 Final Confidence Matrix C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C C2C C3C C4C C5C C6C C7C C8C8 100 C9C9 10 C 10 1

20 Grouping Algorithm Input: the final confidence matrix. Output: groups of search engine returned citations, such that each group refers to the same person. The idea is: {C i, C j } and {C j, C k } then {C i, C j, C k } The threshold we use for “highly confident” is 0.8.

21 Grouping Algorithm(Cont) C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C C2C C3C C4C C5C C6C C7C C8C8 100 C9C9 10 C 10 1 {C 1, C 2 }, {C 2, C 3 }, {C 3, C 5 }, {C 3, C 6 }, {C 4, C 7 }, {C 1, C 8 }, {C 2, C 8 } Group1: {C 1, C 2, C 3, C 5, C 6, C 8 }, Group 2: {C 4, C 7 }, Group 3: {C 9 }, Group4: {C 10 }

22 Experimental Results Choose 10 arbitrary different names. For each name we get the first 50 returned citations. The size of the test set is 500. Use split and merge measures. Consider 8 returned citations C 1, C 2, C 3, C 4, C 5, C 6, C 7, C 8 the correct grouping result: Group 1: {C 1, C 2, C 4, C 6, C 7 }, Group 2: {C 3, C 8 }, Group 3: {C 5 } grouping result of our system: Group 1: {C 1, C 2, C 4 }, Group 2 :{C 3, C 6, C 7 }, Group 3: {C 5, C 8 } The number of splits is 0+1+1=2. The total number of merges is 2. Normalized the split and merge scores.

23 Experimental Results (Cont) Official College, Sports Network, Student Advantage.

24 Cases that Caused Missing Merges--Attributes Facet No shared attributes pairs (out of 1036 pairs) in 41 groups in Larry Wild. Only the value of attribute State is shared. 6 pairs in 41 groups in Larry Wild.

25 Techniques that Used to Judge In Case of no Evidence or Weak Evidence

26 Conclusions Multi-faceted approach is useful, low normalized split score (0.004) and a low normalized merge score (0.014). No individual facet scored better than using all facets together.

27 Contributions Grouped person-name queries by person. Provided an additional tool for search engine queries.