Presentation is loading. Please wait.

Presentation is loading. Please wait.

SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract,

Similar presentations


Presentation on theme: "SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract,"— Presentation transcript:

1

2 SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre

3 SIGIR 2001 – WTS / DUC13 Sep 20012/28 Part 1 Informative Summaries

4 SIGIR 2001 – WTS / DUC13 Sep 20013/28 Informative Summaries  Informative = replaces the document with a shorter version Task Provide most important aspects of the document(s) Interaction Browsing Type Strategy Since search results are similar, put together similarities across documents

5 SIGIR 2001 – WTS / DUC13 Sep 20014/28 Algorithm 1. *Convert each document to a Document Topic Tree 2. *Compute Composite Topic Tree 3. Align query and topics across trees 4. Extract sentences 5. Order into summary

6 SIGIR 2001 – WTS / DUC13 Sep 20015/28 1. Document Topic Tree  Hierarchical view of the document Layout (Hu, et al 99) Lexical chains (Hearst 94, Choi 00) Done offline per document  AHA Recommendation Level: 2 Order: 1 Style: Prose Contents: 1 Table, … Related AHA publications Level: 2 Order:3 Style: Bulleted Contents: … See also in this guide Level: 2 Order: 3 Style: Prose Contents: 5 items, … High Blood Pressure Level: 1 Style: Prose Contents: 3 Headers, …

7 SIGIR 2001 – WTS / DUC13 Sep 20016/28 2. Composite Topic Tree  Norm for a particular type of document Create by aligning topics in example trees by similarity Stores order, frequency and variants of each topic Done offline per domain and genre combination handled joined node at level 1 (e.g. disease) doc tree 1 (yellow) doc tree 2 (blue) newly joined node at level 2 (e.g. symptoms) symptoms node newly joined node at level 3 (e.g. nausea) disease node joining nodes at level 2joining nodes at level 3

8 SIGIR 2001 – WTS / DUC13 Sep 20017/28 3. Topic Alignment  Use similarity metric to map query to composite and document trees  Focus topic defines 3 regions Done online, to find scope of information needed in summary root as focus topic (e.g. About hypertension) 2nd level subtopic as focus topic (e.g. Guide to Cardiac Diseases) = irrelevant = relevant = focus topic = too detailed Query: Hypertension Composite treeDocument trees

9 SIGIR 2001 – WTS / DUC13 Sep 20018/28 4. Sentence Extraction  Aligned topics chosen in descending typicality  Use SimFinder to choose sentences Cover as many topics as possible to ensure breadth of summary *Disease* Freq: 1.0 Diet Freq: 0.6 For more information Freq: 0.7 Treatment Freq: 0.9 Diagnosis Freq: 0.8 Surgery Freq: 0.3 Drugs Freq: 0.7 Definition Freq: 0.2 Causes Freq: 0.8 Symptoms Freq: 0.8 Nausea Freq: 0.2 = aligned = focus topic = unaligned (no instance in documents) Composite topic tree  1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … Extracted Sentences

10 SIGIR 2001 – WTS / DUC13 Sep 20019/28 5. Sentence Ordering  Order extracted sentences by order in composite tree (by norm) Order by norm order to get best results Reordered Sentences 1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … Extracted Sentences 1. (hypertension) Since blood is carried … "If a drug that blocks … 1.4 (causes) Blood pressure is … 1.5 (treatment) How Can I Reduce High … How Do I Manage My … 1.5.1 (drugs) "Over-the-counter“ … 1.5.2 (diet) Everybody's looking for … 1.6 (for more 2000 Heart and Stroke … information) (Ordered by typicality)(Ordered by normal first appearance) 

11 SIGIR 2001 – WTS / DUC13 Sep 200110/28 Part 2 Indicative Summaries

12 SIGIR 2001 – WTS / DUC13 Sep 200111/28 Indicative Summaries  Indicative = help decide whether document is worthwhile for retrieval TaskShow salient differences from other candidates Interaction Searching type StrategyIdentify content and non-content aspects in which each source is different

13 SIGIR 2001 – WTS / DUC13 Sep 200112/28 What goes into an Indicative Summary?  Examine existing indicative summaries: Library card catalog  Examine multidocument scenarios

14 SIGIR 2001 – WTS / DUC13 Sep 200113/28 Corpus Parameters  82 summaries from CU’s online catalog  Healthcare domain  Catalogued types of information present Document-derived features Metadata features Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]

15 SIGIR 2001 – WTS / DUC13 Sep 200114/28 Corpus Analysis Results Freq Document Feature (Document Derived)(Metadata) Topicality 100% Content Types 37% Readability 18% Internal Structure 17% Special Content 7% Title 31% Revised/Edition 28% Author/Editor 21% Purpose 18% Audience 17% …… Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]

16 SIGIR 2001 – WTS / DUC13 Sep 200115/28 Analysis - Multidocument  Prescriptive Guidelines Open Directory Project – website hierarchy Differences are important! 1. Differences between documents 2. Differences from the norm 3. Those relevant to the query (Grice `75) Make clear what makes a site different from the rest

17 SIGIR 2001 – WTS / DUC13 Sep 200116/28 Corpus Analysis Discussion  Topicality (i.e. content) is most important  Other features have a strong role  For Centrifuser Design summary around topics When space allows, add other features as needed –When feature differs from the norm –Future work: mimic the percentages in study Differences drive the text –Query and norm should affect the summary content.

18 SIGIR 2001 – WTS / DUC13 Sep 200117/28 Algorithm 1. *Make Composite and Document Topic Trees 2. Align query and topics across trees 3. Use region ratios to compute document categories 4. Decide messages to realize 5. Order messages 6. Generate the text

19 SIGIR 2001 – WTS / DUC13 Sep 200118/28 2. (recap) Align query and topics  Map the query to a topic  Query node divides nodes into relevant, irrelevant and intricate regions = irrelevant root as focus topic2nd level subtopic as focus topic = relevant = focus topic = intricate Query: Angina Query: Treatments of Angina Attributing the effect of the query on the generated text

20 SIGIR 2001 – WTS / DUC13 Sep 200119/28 Classifying Topics – By Norm  Relevant nodes divided into typical and rare Composite topic tree = focus topic = typical node (freq >=.5) = rare node (freq <.5) Document topic tree Attributing the effect of the norm on the generated text = unaligned topic

21 SIGIR 2001 – WTS / DUC13 Sep 200120/28 3. Categorizing Documents  Ratio of typical, rare, intricate and irrelevant determines category  7 categories altogether 3 typical, 2 rare, 2 intricate and 8 irrelevant 5 typical, 2 rare, 2 intricate Irrelevant Document 50+% irrelevant Specialized Document > 50+% typical, < 50% all possible typical

22 SIGIR 2001 – WTS / DUC13 Sep 200121/28 4. Forming Messages Messages and the text that they eventually realize  Other messages may include: Number of categories in summary Other optional information (e.g. content type) Relation: category-elements Args:docCat: atypical element: AMA Guide element: CU Guide Relation: category-description Args:[ docCat: atypical ] [] [ ] [] Relation: has-topics Args:docCat: atypical topic: definition topic: risks [] ][ Document category description Documents belonging to category Topics in category More information on additional topics which are not included in the summary are available in these files (The American Medical Association family medical guide and The Columbia University College of Physicians and Surgeon complete home medical guide).. The topics include “definition” and “what are …

23 SIGIR 2001 – WTS / DUC13 Sep 200122/28 5. Ordering Messages  Inter-category – by importance of dominant topic type.  Intra-category – document category and elements before optional information.

24 SIGIR 2001 – WTS / DUC13 Sep 200123/28 6. Text Generation Use a small grammar to realize the messages Referring Expression Issues Size of referring expressions Re-ordering documents in the set

25 SIGIR 2001 – WTS / DUC13 Sep 200124/28 Task Based Evaluation Scenario: “ You ’ ve been diagnosed with cancer …”  Compare against 3 real-world systems IR engine (google); Human expert (about.com).  Goals Evaluate on subjective criteria, use think aloud techniques See which document features best fit user need  Pilot study complete; full study going on now Hub (yahoo);

26 SIGIR 2001 – WTS / DUC13 Sep 200125/28 Conclusion  An application of summarization for IR  Performs informative and indicative summarization  By using extraction and text generation techniques  To support browsing and searching http://centrifuser.cs.columbia.edu


Download ppt "SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract,"

Similar presentations


Ads by Google