Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.

Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF

2 The Problem Search engines return too many citations. Example: “Kelly Flanagan”. Google returns around 685 citations. Many people named “Kelly Flanagan” It would help to group the citations by person. How do we group them?

3 “Kelly Flanagan” Query to Google

4 A Multi-faceted approach Attributes Links Page Similarity Confidence matrix for each facet Final confidence matrix Grouping algorithm Our Solution

5 A Multi-faceted Approach Gather evidence from each of several different facets Combine the evidence

6 Attributes Phone number, email address, state, city, zip code. Regular expression for each attribute.

7 Links People usually post information on only a few host servers.  Returned citations that have a same host. People often link one page about a person to another page about the same person.  The URL of one citation has the same host as one of the URLs that belongs to the web page referenced by the other citation.

8 Links (Cont)

9 Page Similarity “adjacent cap-word pairs”: Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))? Cap-Word.

10 Page Similarity The number of shared adjacent cap-word pairs (1, 2, 3, 4 or more). Ignore adjacent cap-word pairs that often occur on web pages (Home Page and Privacy Policy) by constructing a stop-word list.

11 Confidence Matrix Construction For each facet we construct a confidence matrix. C1C1 C 2 …..C i …..C j …CnCn C1C1 1C 12 C 1i C 1j C 1n C2C2 1C 2i C 2j C 2n :: : : CiCi 1C ij C in : : : CjCj 1C jn : : CnCn 1 P(C i and C j refer to a same person | evidence for a facet f ) 0 if no evidence for a facet f C ij = Training set to compute the conditional probabilities.

12 Confidence Matrix Construction (Cont) We select 9 person names. For each name we collect the first 50 citations. For 50 citations we have 1,225 comparison pairs. The size of our training set is 11,025.

13 Confidence Matrix Construction (Cont) For attribute facet P(Same Person = “Yes” | Email = “yes”) P(Same Person = “Yes” | City = “yes” and State = “Yes”) For link facet P(Same Person = “Yes” | Host1 = “yes” and Host1 is non-popular) For page similarity facet P(Same Person = “Yes” | Share2 = “yes”)

14 Confidence Matrix for Attribute Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C1 10.99000000.9600 C2C2 100000 00 C3C3 10000000 C4C4 100 000 C5C5 100000 C6C6 10000 C7C7 1000 C8C8 100 C9C9 10 C 10 1 C 1 and C 2 have the same zip, city, and state, which are “Provo”, “UT”, and “84604”. C 1 and C 8, C 2 and C 8 have the same city and state, which are “Provo” and “UT”. C 4 and C 7 have the same city and state, which are“Palm Desert” and “California”.

15 Confidence Matrix for Link Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C1 10.9900000000 C2C2 100 00000 C3C3 100 0000 C4C4 1000000 C5C5 1 0000 C6C6 10000 C7C7 1000 C8C8 100 C9C9 10 C 10 1 C 1 and C 2 have the same host name, and C 1 refers to the host of C 2.. C 5 and C 6 have the same host name. C 3 refers to the host of C 5 and C 3 refers to the host of C 6

16 Confidence Matrix for Page Similarity Facet C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C1 10.95000000.7800 C2C2 10.9500000.7800 C3C3 10000000 C4C4 1000.92000 C5C5 100000 C6C6 10000 C7C7 1000 C8C8 100 C9C9 10 C 10 1 C 1 and C 2 share Associate Professor, Brigham Young, Performance Evaluation, Trace Collection, Computer Organization, Computer Architecture. C 2 and C 3 share Memory Hierarchy, Brent E. Nelson, System-Assisted Disk, Simulation Technique, Stochastic Disk, Winter Simulation, Chordal Spoke, Interconnection Network, Transaction Processing, Benchmarks Using, Performance Studies, Incomplete Trace, Heng Zho. C 1 and C 8, C 2 and C 8 share Brigham Young. C 4 and C 7 share Palm Desert, Real Estate, Desert Real.

17 Final Matrix Combine the confidence matrices for the three facets using Stanford Certainty Measure. For some observation B, If CF(E 1 ) is the certainty factor associated with E 1 If CF(E 2 ) is the certainty factor associated with E 2 the new certainty factor for B is: CF(E1) + CF(E2) – CF(E1) * CF(E2).

18 Final Matrix (Cont) 0.96 + 0 + 0.78 - 0.96 * 0 - 0.96 * 0.78 - 0.78 * 0 + 0.96 * 0 * 0.78 = 0.9912 Confidence Matrix for AttributesConfidence Matrix for LinksConfidence Matrix for Page Similarity

19 Final Confidence Matrix C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C1 10.95000000.9900 C2C2 10.9500000.9900 C3C3 10 0000 C4C4 100 000 C5C5 100000 C6C6 10000 C7C7 1000 C8C8 100 C9C9 10 C 10 1

20 Grouping Algorithm Input: the final confidence matrix. Output: groups of search engine returned citations, such that each group refers to the same person. The idea is: {C i, C j } and {C j, C k } then {C i, C j, C k } The threshold we use for “highly confident” is 0.8.

21 Grouping Algorithm(Cont) C1C1 C2C2 C3C3 C4C4 C5C5 C6C6 C7C7 C8C8 C9C9 C 10 C1C1 10.95000000.9900 C2C2 10.9500000.9900 C3C3 10 0000 C4C4 100 000 C5C5 100000 C6C6 10000 C7C7 1000 C8C8 100 C9C9 10 C 10 1 {C 1, C 2 }, {C 2, C 3 }, {C 3, C 5 }, {C 3, C 6 }, {C 4, C 7 }, {C 1, C 8 }, {C 2, C 8 } Group1: {C 1, C 2, C 3, C 5, C 6, C 8 }, Group 2: {C 4, C 7 }, Group 3: {C 9 }, Group4: {C 10 }

22 Experimental Results Choose 10 arbitrary different names. For each name we get the first 50 returned citations. The size of the test set is 500. Use split and merge measures. Consider 8 returned citations C 1, C 2, C 3, C 4, C 5, C 6, C 7, C 8 the correct grouping result: Group 1: {C 1, C 2, C 4, C 6, C 7 }, Group 2: {C 3, C 8 }, Group 3: {C 5 } grouping result of our system: Group 1: {C 1, C 2, C 4 }, Group 2 :{C 3, C 6, C 7 }, Group 3: {C 5, C 8 } The number of splits is 0+1+1=2. The total number of merges is 2. Normalized the split and merge scores.

23 Experimental Results (Cont) Official College, Sports Network, Student Advantage.

24 Cases that Caused Missing Merges--Attributes Facet No shared attributes. 1030 pairs (out of 1036 pairs) in 41 groups in Larry Wild. Only the value of attribute State is shared. 6 pairs in 41 groups in Larry Wild.

25 Techniques that Used to Judge In Case of no Evidence or Weak Evidence

26 Conclusions Multi-faceted approach is useful, low normalized split score (0.004) and a low normalized merge score (0.014). No individual facet scored better than using all facets together.

27 Contributions Grouped person-name queries by person. Provided an additional tool for search engine queries.

Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.

Similar presentations

Presentation on theme: "Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF.

Similar presentations

Presentation on theme: "Grouping Search-Engine Returned Citations for Person Name Queries Reema Al-Kamha Research Supported by NSF."— Presentation transcript:

Similar presentations

About project

Feedback