IE (Wilks)-1 Information Extraction: Beyond Document Retrieval Robert Gaizauskas and Yorick Wilks Computational Linguistics and Chinese Language Processing.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Processing of large document collections Part 8 (Information extraction) Helena Ahonen-Myka Spring 2005.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Extraction CS 652 Information Extraction and Integration.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Introduction to Computational Linguistics Lecture 2.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
Using Information Extraction for Question Answering Done by Rani Qumsiyeh.
Introduction to Information Extraction Chia-Hui Chang Dept. of Computer Science and Information Engineering, National Central University, Taiwan
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.
Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Natural Language Processing Guangyan Song. What is NLP  Natural Language processing (NLP) is a field of computer science and linguistics concerned with.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
An Intelligent Analyzer and Understander of English Yorick Wilks 1975, ACM.
November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
1 Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures Kevin Humphreys, George.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Summarisation Work at Sheffield Robert Gaizauskas Natural Language Processing Group Department of Computer Science University of Sheffield.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Data Mining: Text Mining
Information Retrieval
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Information Extraction from Single and Multiple Sentences Mark Stevenson Department of Computer Science University of Sheffield, UK.
POS Tagger and Chunker for Tamil
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
Darina SlatteryTKE’02 30 th August Automatic Analysis of Corporate Financial Disclosures Darina M. Slattery University of Limerick Ph.D. Postgraduate.
Realtime Financial Monitoring and Analysis System May 2010 Lietu Search Engine.
NATURAL LANGUAGE PROCESSING
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Introduction to Information Extraction
Social Knowledge Mining
Text Mining & Natural Language Processing
CS246: Information Retrieval
Presentation transcript:

IE (Wilks)-1 Information Extraction: Beyond Document Retrieval Robert Gaizauskas and Yorick Wilks Computational Linguistics and Chinese Language Processing vol. 3, no. 2, 1998, pp Journal of Documentation, Vol 54, No. 1, 1998, pp

IE (Wilks)-2 IE and IR IE –extracting pre-specified sorts of information from short, natural language texts –example business newswire texts for retirements, appointments, promotions, … extract the names of the participating companies and individuals, the post involved, the vacancy reason, and so on

IE (Wilks)-3 IE and IR (Continued) –Populating a structured information source (or database) from an unstructured, or free text, information source –the structured database is used for searching or analysis using conventional database queries or data-mining techniques for generating a summary for constructing indices into the source texts...

IE (Wilks)-4 IE and IR (Continued) IR –Given a user query selects a relevant subset of documents from a larger set. –The user then browses the selected documents in order to fulfil his or her information need. Differences –IR retrieves relevant documents from collections –IE extracts relevant information from documents

IE (Wilks)-5 In combination of IR and IE (a)an IR query chief executive officer had president chairman post succeed name (b)a retrieved text Who’s Burns Fry Ltd. 04/13/94 WALL STREET JOURNAL (J), PAGE B10 BURNS FRY Ltd. (Toronto) -- Donald Wright, 46 years old, was named executive vice president and director of fixed income at this brokerage firm. Mr. Wright resigned as president Merrill Lynch Canada Inc., a unit of Merrill Lynch & Co., to succeed Mark Kassirer, 48, who left Burns Fry last month. A Merrill Lynch spokerswoman said it has named a successor Mr. Wright, who is expected to begin his new position by the end of month.

IE (Wilks)-6 (c)an empty template := DOC_NR: CONTENT: := SUCCESSION_ORG: POST: IN_AND_OUT: VACANCY_REASON: := IO_REASON: NEW_STATUS: ON_THE_JOB: OTHER_ORG: REL_OTHER_ORG: := ORG_NAME: ORG_ALIAS: ORG_DESCRIPTOR:

IE (Wilks)-7 ORG_TYPE: ORG_LOCALE: ORG_COUNTRY: := PER_NAME: PER_ALIAS: PER_TITLE: (d)a fragment of the filled template := DOC_NR: “ ” CONTENT: := SUCCESSION_ORG: POST: “executive vice president” IN_AND_OUT: VACANCY_REASON: OTH_UNK

IE (Wilks)-8 := IO_PERSON: NEW_STATUS: IN ON_THE_JOB: NO OTHER_ORG: REL_OTHER_ORG: OUTSIDE_ORG := ORG_NAME: “Burns Fry Ltd.” ORG_ALIAS: “Burns Fry” ORG_DESCRIPTOR: “this brokerage firm” ORG_TYPE: COMPANY ORGLOCALE: Toronto CITY ORG_COUNTRY: Canada := ORG_NAME: “Merrill Lynch” ORG_ALIAS: “Merrill Lynch” ORG_DESCRIPTOR: “a unit of Merril Lynch & Co.” ORG_TYPE: COMPANY

IE (Wilks)-9 := PER_NAME: “Donald Wright” PER_ALIAS: “Wright” PER_TITLE: “Mr.” := PER_NAME: “Mark Kassirer” a summary generated from the filled template BURNS FRY Ltd. Named Donald Wright as executive vice president. Donald Wirght resigned as president of Merrill Lynch Canada Inc. Mark Kassirer left as president of BURNS FRY Ltd. (e)

IE (Wilks)-10 History of Information Extraction Early work on template filling –work carried out or under way before the DARPA programme work carries out in response to the DARPA MUC programme recent work on IE outside the DARPA programme

IE (Wilks)-11 Early Work on Template Filling The Linguistic String Project at New York University –Derive information formats (regularised table- like forms) from the profusion of natural language forms –Permit “fact retrieval” (as opposed to document retrieval) on such a database

IE (Wilks)-12 Early Work on Template Filling (Continued) –the information formats are not predefined a priori by experts in the field –the information formats are induced by using distributional analysis to discover word classes in a set of texts of a sub-language

IE (Wilks)-13 Early Work on Template Filling (Continued) Language understanding research at Yale University by Roger Schank –stories followed certain stereotypical patterns called scripts –knowing the script, language comprehenders are able to fill in details and make inferential leaps where the information required to make the leap is not present in the text –first attempt using this approach: FRUMP (Gerald De Jong)

IE (Wilks)-14 Message Understanding Conferences (Continued) MUC-1 (May 1987, San Diego) –six systems participated –tactical naval operations reports on ship sightings and engagements –12 training reports, 2 unseen messages MUC-2 (May 1989, San Diego) –eight systems participated –the same domain as MUC-1 –105 training messages, 20 blind messages (1st run), 5 blind messages (2nd run) –a template and fill rules for the slots

IE (Wilks)-15 Message Understanding Conferences (Continued) MUC-3 (May 1991, San Diego) –fifteen systems participated –newswire stories about terrorist attacks in nine Latin American countries –1,300 development texts, three blind test sets of 100 texts –a template consisting of 18 slots –formal evaluation criteria (precision & recall) –semi-automated scoring program available

IE (Wilks)-16 Message Understanding Conferences (Continued) MUC-4 (June 1992 McLean, Virginia) –seventeen sites participated –domain and template structures unchanged –changes to the task definitions, corpus, measures of performance, and test protocols

IE (Wilks)-17 Message Understanding Conferences (Continued) MUC-5 (August 1993 Baltimore, Maryland) –17 systems participated (14 American, 1 British, 1 Canadian, 1 Japanese) –financial newswire stories and microelectronics products announcements –English and Japanese –development and test corpora increased –new evaluation metrics and scoring programs

IE (Wilks)-18 Message Understanding Conferences (Continued) MUC-6 (Nov 1995 Columbus, Maryland) –17 sites took part –named entity recognition, coreference identification, template and scenario template extraction tasks –management succession events in financial news stories

IE (Wilks)-19 Task complexity measures text corpus complexity (vocabulary size, average sentence length) text corpus dimensions (volume of texts, total number of sentences/words) template characteristics (number of object types, number of slots) difficulty of tasks (hard to measure, but considered number of pages of relevance rules and template fill definitions)

IE (Wilks)-20 Evaluation Metrics Recall –a measure of the fraction of the required information that has been correctly extracted Precision –a measure of the fraction of the extracted information that is correct Beyond Precision and Recall –correct, partially correct, incorrect, missing, spurious, non- committal –overgeneration fraction of extracted information that is spurious –undergeneration fraction of information to have been extracted is missing –substitution fraction of the nonspurious extracted information is not correct

IE (Wilks)-21 MUC-5 Tasks –two domains: joint ventures and microelectronics –two languages: Japanese and English –acronyms: EJV, JJV, EME, JME Resources –EJV materials: Wall Street Journal, Lexus/Nexus, Prompt –gazetteer of place names, list of corporate names and nationalities, list of corporate designators, list of countries, list of nationalities, list of international organizations, definitions of standard industry codes, list of currency names/nationalities, list of female forenames, list of male forenames, CIA world fact book.

IE (Wilks)-22 MUC-6 Tasks –named entity recognition recognition and classification of definite named entities such as organizations, persons, locations, dates and monetary amounts Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan

IE (Wilks)-23 MUC-6 (Continued) –coreference resolution identification of expressions in the text that referred to the same object, set or activity Galactic Enterprises said it would build a new space station before the year 2016 –template element filling –scenarios template filling

IE (Wilks)-24 The Generic IE System text zoner –divide the input text into a set of segments preprocessor –convert a text segment into a sequence of sentences, where each sentence is a sequence of lexical items, with associated lexical attributes (e.g., part-of-speech) filter –eliminate some of the sentences from the previous stage by filtering out irrelevant ones preparser –detect reliable small-scale structures in sequences of lexical items (e.g., noun groups, verb groups, etc.)

IE (Wilks)-25 The Generic IE System fragment combiner –turn a set of parse tree of logical form fragments into a parse tree or logical form for the whole sentence semantic interpreter –generate a semantic structure of meaning representation of logical form from a parse tree or parse tree fragments lexical disambiguation –disambiguate any ambiguous predicates in the logical form coreference resolution or discourse processing –build a connected representation of the text by linking different descriptions of the same entity in different parts of the text template generator

IE (Wilks)-26 LaSIE: A Case Study Lexical Processing –Tokenisation text segmentation: distinguish the document header and segment the text into paragraphs tokenisation: identify which sequences of characters will be treated as individual tokens –Sentence splitting determine sentence boundaries in the text the full stops are not sufficient guides, e.g., Allan J. Smith, Mr. –Part-of-speech tagging process one sentence at a time, and associate with each token one of the 48 part- of-speech tags in University of Pennsylvania –Morphological analysis determine root forms of nouns and verbs –Gazetteer lookup employ 5 gazeetteers (lists of names) to facilitate the process of recognizing and classifying named entities organization names, location names, personal given names, company designators, and personal titles

IE (Wilks)-27 LaSIE: Parsing Parsing with a special named entity grammar –recognize multi-word structures which identify organizations, persons, locations, dates, and monetary amounts –ORGAN\_NP --> ORGAN\_NP LOC\_NP CDG Merrill Lynch Canada Inc. –PERSON\_NP --> FIRST\_NAME NNP Donald Wright –organization(e17), name(e17, “Burns Fry Ltd.”)

IE (Wilks)-28 LaSIE: Parsing (Continued) Parsing with a more general phrasal grammar –recognize noun phrases, verb phrases, prepositional phrases, adjective phrases, sentences, and relative clauses –[ NP Donald Wright], [ ADJP 46 years old], [ VP [ VP was named][ NP executive vice president and director of fixed income]][ PP at this brokerage firm] –person(e21), name(e21, “Donald Wright”) name(e22), lobj2(e22,e23) title(e23, “executive vice president”) firm(e24), det(e24, this)

IE (Wilks)-29 LaSIE: Parsing (Continued) Select a “best parse” from the set of partial, fragmentary, and possibly overlapping phrasal analyses –choose that sequence of non-overlapping phrases of semantically interpretable categories (sentence, noun phrase, verb phrase and prepositional phrase) which covers the most words and consists of the fewest phrases

IE (Wilks)-30 LaSIE: Discourse Processing

IE (Wilks)-31

IE (Wilks)-32 Application Areas of Information Extraction Finance –categorize newswire stories of relevance to stock traders Military Intelligence Medicine –help classification of patient records and discharge summaries to assist in public health research and in medical treatment auditing Law –support intelligent retrieval from legal texts Police –extract information about road traffic incidents from police incident log Technology/product tracking –track commodity price changes and factors affecting changes in the relevant newsfeeds

IE (Wilks)-33 Application Areas of Information Extraction (Continued) Fault Diagnosis –extract information from reports of car faults Software system requirements specification –NLP techniques used to assist in the process of deriving formal software specifications from less formal, natural language specifications –the formal specification is viewed as a template which needs to be filled from a natural language specifications, supplemented with a dialogue with the user Academic research –Academic journals and publications are increasingly becoming available on-line and offer a prime source of material for IE technology

IE (Wilks)-34 Challenges for the future Higher precision and recall User-defined IE –permit users to define the extraction task and then adapts to the new scenario Integration with other technologies –information retrieval –natural language generation –machine translation –data mining