Download presentation
Presentation is loading. Please wait.
1
Relation Extraction for Academic Collaboration 10-709 Project Proposal Justin Betteridge, Matthew Bilotti, Simon Fung, Sophie Wang Jan 26, 2006
2
Relation Extraction We want: CollaboratesWith(, ) where, are of type ‘person’ Two redundant sources of information for co-training: Extraction Patterns to find Relations expressed in surface text or tables on the web Rote learner keeps track of Relations it is told about, aggregating evidence in the form of confidence scores when Relations are multiply- extracted from different sources
3
Sketch of a Co-Training Algorithm Let: R = a set of Relations; P = a set of Extraction Patterns Initialize: R <- seed Relations, P <- seed Patterns do, until termination condition is reached: 1.For each p in P, where p is of the form ( “before context”,, “between context”,, “after context” ), query Google using the literal context strings in the Pattern to retrieve text windows from which a set of Relations (, ) can be extracted. 2.For each new Relation, compute new confidence score and add it to R, combining evidence if necessary. 3.Weed out any r in R the confidence of which is below a threshold, or optionally, any r the arguments of which are unlikely to be of type person. 4.For each r in R, where r is of the form (, ), query Google to retrieve a set of text windows containing the strings and. From these text windows, generalize a set of Patterns ( “before”,, “between”,, “after”) 5.For each new Pattern, compute new confidence score and add it to P, combining evidence if necessary. 6.Weed out any p in P the confidence of which is below a threshold.
4
Coverage as a Confidence Measure Confidence for an Extraction Pattern p For each r in R, query Google to see if p can extract r Coverage is the number of relations in R extractable by p divided by |R| Confidence for a Relation r For each p in P, query Google to see if p can extract r Similarly, coverage is the number of patterns in P that can extract r divided by |P|
5
Combining Confidence Scores Given a Relation with confidence c Extracted again; pattern has confidence p New confidence score of s (may be < c) One idea: MYCIN Calculus [Shortliffe 76] new confidence = c + ( 1 – c ) * p * s intuitively, going p * s percent of the way from old confidence c to maximal confidence 1.0 Another idea: = ( c + p * s ) / ( 1 + c * p * s ) confidences increase monotonically, stay between 0 and 1.0, but never reach 1.0
6
Example Seed Data for Co-Training Extraction Patterns “in collaboration with” “joint work with” Patterns that extract information from tables, lists of citations, etc... Relations CollaboratesWith( mbilotti, ehn ) CollaboratesWith( jbetter, teruko )...
7
Extraction Pattern Examples Query: “in collaboration with” site:web.mit.edu/biology/www
8
Open Questions Additional useful sources of information: Anchor text and link structure: advisor-advisee cross-refs, department or lab organization Heuristics or Named Entity Recognition to weed out relation arguments that are not people Confidence metrics for patterns, relations Methods of combining confidence scores Termination condition
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.