SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract,

Slides:



Advertisements
Similar presentations
Support.ebsco.com Nursing Reference Center Tutorial.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Introduction Information Management systems are designed to retrieve information efficiently. Such systems typically provide an interface in which users.
Copyright © 2003 by The McGraw-Hill Companies, Inc. All rights reserved. Business and Administrative Communication SIXTH EDITION.
Evaluating Search Engine
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Organising Information in your Website Steps and Schemes.
Aki Hecht Seminar in Databases (236826) January 2009
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
New Library Catalogue Interface Proposal 3. Introduction This presentation will outline the design decisions for the new interface of the on-line library.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Chapter 5 Searching for Truth: Locating Information on the WWW.
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Information Retrieval
MEDLINEplus: Your Gateway to Consumer Health Information on the Web.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
7 Selecting Design and Color Section 7.1 Identify presentation design principles Use a custom template Add pages to a navigation structure Section 7.2.
Section 7.1 Identify presentation design principles Use a custom template Add pages to a navigation structure Section 7.2 Identify color scheme guidelines.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Chapter 5 Searching for Truth: Locating Information on the WWW.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Page 1. July 2005 Page 2 Type search terms into box on the main page. Tutorial. Save searches in My NCBI ‘cubby.’ Enter PubMed by double- clicking in.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Organizing Your Information
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Developing a Departmental Style Guide by Jean Hollis Weber Presented by Elliot Jones.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
MedSearch Vaishnav Janardhan COMS E6125 Web-Enhanced Information Management.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Web Design Guidelines by Scott Grissom 1 Designing for the Web  Web site design  Web page design  Web usability  Web site design  Web page design.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
2004/051 >> Supply Chain Solutions That Deliver Users.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Third International Conference on Language Resources and Evaluation, Las Palmas, Canary Islands, Spain, May 2002 Columbia University Catalogued recommended.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
 1. optional (check to see if your college requires it)  2. Test Length: 50 min  3. Nature of Prompt: Analyze an argument  4. Prompt is virtually.
Human Computer Interaction Lecture 21 User Support
Web Searching Strategies
Section 7.1 Section 7.2 Identify presentation design principles
© Pennsylvania Department of Education 2006
Tutorial support.ebsco.com.
Table of Contents: Part B
What is MD Consult? Rich Collection of Clinical Content including:
Federated & Meta Search
The Use of Facets in Web Search Engines
Preparing Conference Papers (1)
Introduction to Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Preparing Conference Papers (1)
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Analyzing and Organizing Information
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Presentation transcript:

SIGIR 2001 – WTS / DUC13 Sep 20011/28 Centrifuser Output Min Yen Kan, 2001 Centrifuser’s output comes in three parts: Navigation; Informative extract, based on similarities; Indicative generated text, based on differences. Centrifuser can currently produce this output for documents with the same domain and genre

SIGIR 2001 – WTS / DUC13 Sep 20012/28 Part 1 Informative Summaries

SIGIR 2001 – WTS / DUC13 Sep 20013/28 Informative Summaries  Informative = replaces the document with a shorter version Task Provide most important aspects of the document(s) Interaction Browsing Type Strategy Since search results are similar, put together similarities across documents

SIGIR 2001 – WTS / DUC13 Sep 20014/28 Algorithm 1. *Convert each document to a Document Topic Tree 2. *Compute Composite Topic Tree 3. Align query and topics across trees 4. Extract sentences 5. Order into summary

SIGIR 2001 – WTS / DUC13 Sep 20015/28 1. Document Topic Tree  Hierarchical view of the document Layout (Hu, et al 99) Lexical chains (Hearst 94, Choi 00) Done offline per document  AHA Recommendation Level: 2 Order: 1 Style: Prose Contents: 1 Table, … Related AHA publications Level: 2 Order:3 Style: Bulleted Contents: … See also in this guide Level: 2 Order: 3 Style: Prose Contents: 5 items, … High Blood Pressure Level: 1 Style: Prose Contents: 3 Headers, …

SIGIR 2001 – WTS / DUC13 Sep 20016/28 2. Composite Topic Tree  Norm for a particular type of document Create by aligning topics in example trees by similarity Stores order, frequency and variants of each topic Done offline per domain and genre combination handled joined node at level 1 (e.g. disease) doc tree 1 (yellow) doc tree 2 (blue) newly joined node at level 2 (e.g. symptoms) symptoms node newly joined node at level 3 (e.g. nausea) disease node joining nodes at level 2joining nodes at level 3

SIGIR 2001 – WTS / DUC13 Sep 20017/28 3. Topic Alignment  Use similarity metric to map query to composite and document trees  Focus topic defines 3 regions Done online, to find scope of information needed in summary root as focus topic (e.g. About hypertension) 2nd level subtopic as focus topic (e.g. Guide to Cardiac Diseases) = irrelevant = relevant = focus topic = too detailed Query: Hypertension Composite treeDocument trees

SIGIR 2001 – WTS / DUC13 Sep 20018/28 4. Sentence Extraction  Aligned topics chosen in descending typicality  Use SimFinder to choose sentences Cover as many topics as possible to ensure breadth of summary *Disease* Freq: 1.0 Diet Freq: 0.6 For more information Freq: 0.7 Treatment Freq: 0.9 Diagnosis Freq: 0.8 Surgery Freq: 0.3 Drugs Freq: 0.7 Definition Freq: 0.2 Causes Freq: 0.8 Symptoms Freq: 0.8 Nausea Freq: 0.2 = aligned = focus topic = unaligned (no instance in documents) Composite topic tree  1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … Extracted Sentences

SIGIR 2001 – WTS / DUC13 Sep 20019/28 5. Sentence Ordering  Order extracted sentences by order in composite tree (by norm) Order by norm order to get best results Reordered Sentences 1.0 (hypertension) Since blood is carried … "If a drug that blocks … 0.9 (treatment) How Can I Reduce High … How Do I Manage My … 0.8 (causes) Blood pressure is … 0.7 (drugs) "Over-the-counter“ … 0.7 (for more 2000 Heart and Stroke … information) 0.6 (diet) Everybody's looking for … Extracted Sentences 1. (hypertension) Since blood is carried … "If a drug that blocks … 1.4 (causes) Blood pressure is … 1.5 (treatment) How Can I Reduce High … How Do I Manage My … (drugs) "Over-the-counter“ … (diet) Everybody's looking for … 1.6 (for more 2000 Heart and Stroke … information) (Ordered by typicality)(Ordered by normal first appearance) 

SIGIR 2001 – WTS / DUC13 Sep /28 Part 2 Indicative Summaries

SIGIR 2001 – WTS / DUC13 Sep /28 Indicative Summaries  Indicative = help decide whether document is worthwhile for retrieval TaskShow salient differences from other candidates Interaction Searching type StrategyIdentify content and non-content aspects in which each source is different

SIGIR 2001 – WTS / DUC13 Sep /28 What goes into an Indicative Summary?  Examine existing indicative summaries: Library card catalog  Examine multidocument scenarios

SIGIR 2001 – WTS / DUC13 Sep /28 Corpus Parameters  82 summaries from CU’s online catalog  Healthcare domain  Catalogued types of information present Document-derived features Metadata features Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]

SIGIR 2001 – WTS / DUC13 Sep /28 Corpus Analysis Results Freq Document Feature (Document Derived)(Metadata) Topicality 100% Content Types 37% Readability 18% Internal Structure 17% Special Content 7% Title 31% Revised/Edition 28% Author/Editor 21% Purpose 18% Audience 17% …… Practical Interventional Cardiology represents a practical reference for the interventional cardiologist and those in training, as well as the non-invasive cardiologist and physician. […] Rather than providing detailed and exhaustive reviews, the purpose of this book is to present practical information regarding cardiac interventional procedures. […]

SIGIR 2001 – WTS / DUC13 Sep /28 Analysis - Multidocument  Prescriptive Guidelines Open Directory Project – website hierarchy Differences are important! 1. Differences between documents 2. Differences from the norm 3. Those relevant to the query (Grice `75) Make clear what makes a site different from the rest

SIGIR 2001 – WTS / DUC13 Sep /28 Corpus Analysis Discussion  Topicality (i.e. content) is most important  Other features have a strong role  For Centrifuser Design summary around topics When space allows, add other features as needed –When feature differs from the norm –Future work: mimic the percentages in study Differences drive the text –Query and norm should affect the summary content.

SIGIR 2001 – WTS / DUC13 Sep /28 Algorithm 1. *Make Composite and Document Topic Trees 2. Align query and topics across trees 3. Use region ratios to compute document categories 4. Decide messages to realize 5. Order messages 6. Generate the text

SIGIR 2001 – WTS / DUC13 Sep /28 2. (recap) Align query and topics  Map the query to a topic  Query node divides nodes into relevant, irrelevant and intricate regions = irrelevant root as focus topic2nd level subtopic as focus topic = relevant = focus topic = intricate Query: Angina Query: Treatments of Angina Attributing the effect of the query on the generated text

SIGIR 2001 – WTS / DUC13 Sep /28 Classifying Topics – By Norm  Relevant nodes divided into typical and rare Composite topic tree = focus topic = typical node (freq >=.5) = rare node (freq <.5) Document topic tree Attributing the effect of the norm on the generated text = unaligned topic

SIGIR 2001 – WTS / DUC13 Sep /28 3. Categorizing Documents  Ratio of typical, rare, intricate and irrelevant determines category  7 categories altogether 3 typical, 2 rare, 2 intricate and 8 irrelevant 5 typical, 2 rare, 2 intricate Irrelevant Document 50+% irrelevant Specialized Document > 50+% typical, < 50% all possible typical

SIGIR 2001 – WTS / DUC13 Sep /28 4. Forming Messages Messages and the text that they eventually realize  Other messages may include: Number of categories in summary Other optional information (e.g. content type) Relation: category-elements Args:docCat: atypical element: AMA Guide element: CU Guide Relation: category-description Args:[ docCat: atypical ] [] [ ] [] Relation: has-topics Args:docCat: atypical topic: definition topic: risks [] ][ Document category description Documents belonging to category Topics in category More information on additional topics which are not included in the summary are available in these files (The American Medical Association family medical guide and The Columbia University College of Physicians and Surgeon complete home medical guide).. The topics include “definition” and “what are …

SIGIR 2001 – WTS / DUC13 Sep /28 5. Ordering Messages  Inter-category – by importance of dominant topic type.  Intra-category – document category and elements before optional information.

SIGIR 2001 – WTS / DUC13 Sep /28 6. Text Generation Use a small grammar to realize the messages Referring Expression Issues Size of referring expressions Re-ordering documents in the set

SIGIR 2001 – WTS / DUC13 Sep /28 Task Based Evaluation Scenario: “ You ’ ve been diagnosed with cancer …”  Compare against 3 real-world systems IR engine (google); Human expert (about.com).  Goals Evaluate on subjective criteria, use think aloud techniques See which document features best fit user need  Pilot study complete; full study going on now Hub (yahoo);

SIGIR 2001 – WTS / DUC13 Sep /28 Conclusion  An application of summarization for IR  Performs informative and indicative summarization  By using extraction and text generation techniques  To support browsing and searching