Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Library and Information Science

Similar presentations


Presentation on theme: "School of Library and Information Science"— Presentation transcript:

1 School of Library and Information Science
Link Detection David Eichmann School of Library and Information Science The University of Iowa

2 Why? We focused on link detection this year to vet a new similarity scheme In building our extraction framework for question answering and bioinformatics we were able to derive: A reasonably clean scheme for mapping relationships between entities; and Decorating those entities with extracted attributes/properties (e.g., person age, relative geographical position, etc.)

3 Our Working Hypothesis
Assessing inter-document linkage using a concept graph derived from the extraction framework could prove to be more robust than term vector methods

4 Technique (in the ideal)
Sentence boundary detect the corpus Part-of-speech tag sentence terms Extract named entities and residual noun phrases Generate a parse for the sentence Using the resulting dependencies to generate graph fragments Merge the graph fragments into a single graph for a story Use a graph similarity scheme to assess story linkage

5 The graph similarity measure
Generate the Cook-Holder edit distance between two graphs Graph_sim(g1, g2) = 1 - norm(CHed(g1,g2) / max(|g1|,|g2|))

6 Reality sets in MT text doesn’t parse worth a …
ASR text rarely has clean sentence boundaries Off-the-shelf parsers aren’t trained for speech grammars Hence ASR text doesn’t parse worth a …

7 Regrouping Sentence boundary detect newswire sources
Approximate sentence boundaries with speech pauses longer than a certain threshold Skip the parse Generate graph fragments using a window of neighboring NPs Submitted run uses the current NP and the two downstream NPs This clearly misses syntactically close but lexically distant NP connections…

8 Contrastive Runs Cosine vector similarity of document term vectors
Cosine vector similarity of document phrase vectors A strawman edit distance Construct a single string for a document comprised of the concatenation of alphabetized NPs for the document If the graph scheme doesn’t outperform this, it’s probably not worth pursuing…

9 Official Results Run Scheme P(Miss) P(FA) Norm Clink UIowa1 Graph
0.7234 0.0018 0.7320 UIowa2 Edit 0.7308 0.0668 1.0582 UIowa3 Phrase 0.6971 0.0014 0.6984 UIowa4 Word 0.6851 0.0004 0.6871

10 Word Performance

11 Phrase Performance

12 Edit Distance Performance

13 Graph Similarity Performance

14 Word/Phrase Costs

15 Word/Edit Costs

16 Word/Graph Costs

17 Graph/Edit Costs

18 Conclusions Definitely signal present in the graph similarity scheme
More tuning needed Official Run Clink: Actual Minimum Clink: Official Run P(Miss): Actual Minimum Clink P(Miss):

19 Conclusions, con’t. Revisit the graph formation hack Hybrid scheme
Using ideal scheme for newswires Using hack for broadcasts Alternatively Aggressively segment ASR, resulting in smaller fragments Parse everything Note here that we don’t need full sentence structure, only good clausal structure


Download ppt "School of Library and Information Science"

Similar presentations


Ads by Google