Download presentation
Presentation is loading. Please wait.
Published byChester Cain Modified over 9 years ago
1
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University sadafre@computing.dcu.ie Maarten de Rijke ISLA, University of Amsterdam mdr@science.uva.nl
2
Outline Motivation Task Approaches Experimental Setup Results Concluding Remarks
3
Motivation Over 60% of Web queries are informational “Tell me about X.” Queries are short TREC “Other” Questions DUC 2004 – Summarization “Who is X?”
4
Motivation Increasing amount of user annotated data - Wikipedia The largest reference work Open content Anyone can edit its content Rich set of categories Wikipedia as an “importance model” (Mishne et al. 2005) Nuggets from a news paper corpus are compared with nuggets from Wikipedia. Higher similarity implies importance.
5
Applications Uses of sentence importance estimation Information retrieval Question Answering (Ahn et al., 2004) Novelty checking (Allan, et al., 2003) Summarization Graph-based methods (Erkan & Radev, 2004) Topic Tracking (Kraaij & Spitters, 2003)
6
Task Given a topic, identify sentences that are important for the topic, in a general newspaper text corpus Example “William H. McNeill” (born 1917,Vancouver, British Columbia) is a Canadian historian. He is currently Professor Emeritus of History at the University of Chicago. McNeill’s most popular work is “The Rise of the West”. The book explored human history in terms of the effect of different old world civilizations on one another, and especially the dramatic effect of western civilization on others in the past 500 years. It had a major impact on historical theory, especially in distinction to Oswald Scientific aim: To compare techniques for determining important sentences
7
System Overview Passage retrievalGet Wikipedia categories Select sample articlesSentence extraction Candidate sentencesReference sorpus Rank sentences Ranked sentences
8
Candidate Sentence Selection Input Topic Name and category of a person Source corpus AQUAINT Corpus Sentence extraction Source corpus split into passages and indexed The topic is submitted as query Top 200 passages selected Passages are split into sentences Sentences containing the topic words are retained
9
Sentence Ranking Sentences are ranked based on their similarity with reference sentences Reference sentences Given a topic, and its category Brad Pitt, and Actor Reference corpus is a set of sentences describing other entities in the same category, i.e., other actors.
10
System Overview Passage retrievalGet Wikipedia categories Select sample articlesSentence extraction Candidate sentencesReference sorpus Rank sentences Ranked sentences
11
Ranking sentences Two dimension Graph-based vs non-graph-based Using (or not) a reference corpus Five ways Word overlap Language Modelling Graph-based methods Generic Graph-based method with reference corpus Graph-based method with reference corpus plus lexical layer
12
Assumptions Given an entity of some category We consider other entities of the same category and the properties that are typically described for them. That is, if a property is included in the descriptions of a significant portion of entities in the same category as our input entity, we assume it to be an important one.
13
Sentence Ranking Similarity Measures Word Overlap Compute Jaccard coefficient b/n candidate and references sentences Sentences are ranked by their maximum scores Language modelling Sentences are ranked by their likelihood w.r.t. the language model of the reference corpus Graph-based method …
14
Sentence Ranking Graph-based method for summarization (Erkan & Radev, 2004) Given a text to be summarized Construct a graph by linking related sentences Word overlap Assign score to each sentence using PageRank The sentence with highest PageRank score is assumed to contain the salient information
15
Sentence Ranking Graph-based method T1 T3 T2 T4 T7 T5 Target sentences T6 Target sentences Reference sentences T1 T3 T2 R2 R1 R3 R4 T1 T3 T2 Target sentences Reference sentences R1 R3 R2 R4 W1 W3 W2 Generic method without reference corpus With reference corpus With lexical level
16
Research questions Does the use of reference corpora help in improving importance estimation? Do graph-based estimation methods outperform non-graph-based methods? Does the additional representation of important lexical items help improve importance estimation for sentences?
17
Experimental Setup Data set TREC data Set? Preliminary experiment Some important snippets not included, Eg. Fred Durst: Born in Jacksonville, Fla., Durst grew up in Gastonia, N.C., where his love of hip-hop music and break dancing made him an outcast. Eileen Marie Collins: She was born Nov. 19, 1956, in Elmira, N.Y., to Jim and Rose Collins. New data set 30 Topics – Persons 10 Occupations
18
Experimental Setup Assessment Take the top 20 snippets returned by the different systems Manually assess each snippet for important biographical information Two assessors Assessors were allowed to examine the topic in Wikipedia or using a general purpose web search engine. Agreement – Kappa = 0.70 Baseline Rank sentence based on the retrieval scores (Performed well at TREC 2003)
19
Results 600 total snippets for each runs Two Score WOD – with out duplicates WD – with duplicates
20
Summary of importance estimation methods Word-overlap Based on single sentence Returns several duplicates Language Modelling Based on the combined corpus Does not distinguish between sentences Less effective Generic graph-based method Do not use on the reference corpus Based on redundancy in the news corpus Graph-based + reference-corpus Combine evidence from multiple sentences
21
Concluding Remark Task: estimating importance of sentences Main finding: combination of a corpus-based approach to capturing the knowledge encoded in sentences known to be important and graph- based method for ranking sentences performs best
22
Thank you
23
Result Significant differences?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.