Information Extraction and Automatic Summarisation *

Slides:



Advertisements
Similar presentations
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Advertisements

DR SIMON NASH TE PUNA AKO LEARNING CENTRE THANKS TO CAROLINE MALTHUS FROM TE PUNA AKO FOR USE OF HER MATERIAL IN THIS PRESENTATION Literature.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Technical Writing S03 Providence University 1 Introduction - Establishing a Context PROVIDENCE UNIVERSITY College of Management Wu-Lin Chen
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
How to Read a Technical Paper Locking and Consistency 10/7/05.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Using the ERIC Database This tutorial will show you how to access ERIC which contains citations, abstracts and some full-text materials from journals and.
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
 Follow general rules for five paragraph essay.  Paragraph 1: Introduction ◦ Thematic Statement is your thesis statement  Paragraph 2: Body 1 w/ clear.
Mining and Summarizing Customer Reviews
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Search Engines and Information Retrieval Chapter 1.
13 Miscellaneous.  Computer languages ranking  ll&lang=all&lang2=sbcl
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
How to do Quality Research for Your Research Paper
Discourse Topics, Linguistics, and Language Teaching Richard Watson Todd King Mongkut’s University of Technology Thonburi arts.kmutt.ac.th/crs/research/
1 Text Summarization: News and Beyond Kathleen McKeown Department of Computer Science Columbia University.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Business and Management Research WELCOME. Lecture 4.
Making a summary. howard.syr.edu/Handouts/SumEss.html howard.syr.edu/Handouts/SumEss.html
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
COMM331 Effective Reading: Unpacking the text for better understanding Dr. Celeste Rossetto: Learning Development 2013.
Plagiarism, Paraphrasing and Documenting Quotations.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
1 CA201 Word Application Making Information in Longer Documents Accessible Week # 12 By Tariq Ibn Aziz Dammam Community college.
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
Basics of Information Retrieval and Query Formulation Bekele Negeri Duresa Nuclear Information Specialist.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Information Retrieval
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
How to Write Abstract How to write title? a good title (typically 10–12 words long) 6,7 will use descriptive terms and phrases that.
Modern Information Retrieval Lecture 2: Key concepts in IR.
Abstracting.  An abstract is a concise and accurate representation of the contents of a document, in a style similar to that of the original document.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
APA Style Abstract.
25 January 2016 SUMMARY WRITING Sokolova Elvira Yakovlevna.
Summary Paragraphs. Why is it important? Reading comprehension checked by summarizing text Learn to use your own words.
APA Review.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Academic Writing Fatima AlShaikh. A duty that you are assigned to perform or a task that is assigned or undertaken. For example: Research papers (most.
Academic writing.
Dr Anie Attan 26 April 2017 Language Academy UTMJB
Text Based Information Retrieval
DATA MODELS.
Software Documentation
Writing a Research Abstract
Concept of a document Lesson 3.
Multimedia Information Retrieval
Q1-Identify and Interpret List four things from the text about…
Information Retrieval
Question 3 Q3 – read the whole text and answer a question about structure [8 marks] The mark scheme is the same as for Q2 What types of things could you.
DATA MODELS.
Critical Essay Writing
Summarizing Use the following slides in order to organize your understanding of the article. After filling in the graphic organizer, then write your summary.
Introduction to Search Engines
Presentation transcript:

Information Extraction and Automatic Summarisation *

How IE fits in with IR l IR selects a few relevant documents from many l IE starts with one or a few relevant documents l IE pulls out the words and phrases most central to the meaning of that/those documents to produce an extract.

Two process associated with information extraction l determination of facts to go into structured fields in a database. l extraction of text that can be used to summarise an item. l In the first case only a subset of the important facts in an item may be identified and extracted. The term slot is used to define a particular category of information to be extracted. Slots are organised into semantic frames.

What do we most want to know from a journal article about agriculture? l AGENTchemical agent applied l CVcultivar (e.g. King Edward) l HLPhigh level property (e.g. yield) l INFinfluence (e.g. drought) l LABsite of test (e.g. laboratory) l LLPlow level property (e.g. root mass) l LOClocation l PESTpest or disease l SOILsoil l SPEC crop species (e.g. potato)

Automatic Abstracting l In the second case, rather than trying to determine specific facts, the goal of document summarisation is to extract a summary of an item maintaining the most important ideas while significantly reducing its size. For journal articles, this is called automatic abstracting. The abstract is a way for the user to determine the utility of an article without having to read the whole item.

Kupiek’s heuristics l Sentence length feature that requires the sentence to be over five words in length. l Fixed phrase feature that looks for the existence of “phrase” cues, e.g. “in conclusion…”. l Paragraph feature that places emphasis on the first ten and the last five paragraphs in an item and also the location of the sentences within the paragraph. l Thematic word feature that uses word frequency. l Uppercase word feature that places emphasis on proper names and acronyms. l discovered that location based heuristics give better results than the frequency based features.

Paice’s rules l Frequency Keyword Approach: First find a set of index terms for the document (manually, mid-frequency, tf * idf, words occurring in the title, etc.). Then choose the sentences which contain most keywords. l Location: The first sentence in a paragraph is most central to the theme of a text. The last sentence is the next most central. l Cue method: Not actually keywords, but their presence in a document show that the sentence is (or is not) important. These may be bonus words, e.g. greatest, significant, or stigma words, e.g. hardly, impossible. l Indicator phrases, e.g. “The main aim of our paper is to describe …”, “Our investigation has shown that …”.

Hoey method: cohesion in text. l The most important sentences in a document are those which are related to the largest number of other sentences. Find how many concepts in each sentence are related to concepts in other sentences. Concepts may be related by: l Exact match, e.g. computer and computer; l Grammatical variants e.g. computer, computing; l Synonyms e.g. sedate, tranquilise, drug ; l Antonymy e.g. cold, hot ; l General-specific e.g. scientists, biologists ;

Hoey (2) l Form a repetition net, with entries in the form s ( a, b) such as 26 ( 6, 4) meaning sentence no. 26 is bonded to 6 earlier sentences and 4 later sentences. l If a + b is high, the sentence is central to the topic ; l If only b is high, the sentence is a topic opener ; l If only a is high, the sentence is topic closing.

Hoey (3) l Cohesion in text is concerned with explicit references within a sentence which can only be understood by reference to material elsewhere in the text. l Anaphora come after their explicit mention in the text, e.g. Marie Curie was born in Warsaw. She devoted her life to the study of radioactivity. l Cataphora come before their explicit mention in the text, e.g. He was to become the best known physicist of his generation. His name was Albert Einstein.

Generating Canned Text l This paper studies the effect of AGENT on the HLP of SPEC l OR l This paper studies the effect of INF on the HLP of SPEC l when it is infested by PEST. l An experiment was undertaken l using cultivars CV l [in, at] LOC l where the soil was SOIL. l The HLP [is, are] measured by analysing the LLP.

Extracts vs. Abstracts (Mani, p6) l An extract is a summary consisting entirely of material copied from the input l An abstract is a summary at least some of whose material is not present in the input.