Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge

Slides:



Advertisements
Similar presentations
Copyright 2001, ActiveState. XSLT and Scripting Languages or…XSLT: what is everyone so hot and bothered about?
Advertisements

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Extreme underspecification Using semantics to integrate deep and shallow processing.
LTAG Semantics on the Derivation Tree Presented by Maria I. Tchalakova.
Towards an NLP `module’ The role of an utterance-level interface.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Information Retrieval in Practice
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
ModelicaXML A Modelica XML representation with Applications Adrian Pop, Peter Fritzson Programming Environments Laboratory Linköping University.
SEM-I: why and what?. Overview Interfacing grammars to other systems via semantics: requirements What is in the SEM-I? SEM-I tools Some modest proposals...
RMRS some background and current work. Talk overview RMRS: integrating processors via semantics Underspecified semantics from shallow processing Integration.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
A System for A Semi-Automatic Ontology Annotation Kiril Simov, Petya Osenova, Alexander Simov, Anelia Tincheva, Borislav Kirilov BulTreeBank Group LML,
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Lecture 8 Applications and demos. Building applications Previous lectures have discussed stages in processing: algorithms have addressed aspects of language.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
An Architecture for Language Processing for Scientic Texts Ann Copestake, Peter Corbett, Peter Murray-Rust, CJ Rupp, Advaith Siddharthan, Simone Teufel,
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.
Survey of Semantic Annotation Platforms
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Deep Processing for Restricted Domain QA Yi Zhang Universit ä t des Saarlandes
Lecture 12 Applications and demos. Building applications Previous lectures have discussed stages in processing: algorithms have addressed aspects of language.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Semantic Construction lecture 2. Semantic Construction Is there a systematic way of constructing semantic representation from a sentence of English? This.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
CSA2050 Introduction to Computational Linguistics Lecture 1 What is Computational Linguistics?
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Argumentative Writing Grades College and Career Readiness Standards for Writing Text Types and Purposes arguments 1.Write arguments to support a.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Information Retrieval in Practice
Approaches to Machine Translation
PRESENTED BY: PEAR A BHUIYAN
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Social Knowledge Mining
Machine Learning in Natural Language Processing
Approaches to Machine Translation
Presentation transcript:

Robust Semantic Processing for Information Extraction Ann Copestake Computer Laboratory, University of Cambridge

Outline Information Extraction Combining deep and shallow processing RMRS MRS basic ideas of RMRS RASP-RMRS RMRS and IE in Deep Thought SciBorg project

Acknowledgements Deep Thought (EU funded, ) Computer Lab: Ann Copestake, Anna Ritchie, Ben Waldron Sussex, Saarland, DFKI, Xtramind, CELI, NTNU SciBorg (EPSRC, ) Computer Lab: Ann Copestake, Simone Teufel, CJ Rupp, Advaith Siddharthan Chemistry: Peter Murray-Rust, Peter Corbett CeSC: Mark Hayes, Andy Parker DELPH-IN (informal ongoing collaboration) Boeing funding to Computer Lab: Ben Waldron especially Dan Flickinger, Alex Lascarides, Stephan Oepen, John Carroll, Anette Frank

Information extraction Classic IE: MUC-style template filling, gene/protein interactions IE in general: acquiring specific types of knowledge from text via language processing: e.g., organic chemistry syntheses ontological relationships relationships between texts (for search) IR, QA, I2E

IE from Chemistry texts recipe expressed in CML To a solution of aldimine 1 (1.5mmol) in THF (5mL) was added LDA (1mL, 1.6 M in THF) at 0 °C under argon, the resulting mixture was stirred for 2h, then was cooled to -78 °C alkaloids and other complex polycyclic azacycles... Enamines have been used widely... (citation Y), however,... did not provide the desired products. X cites Y (contrast)

Standard IE architecture 1. Preprocessing of markup etc (specific to text type) 2. Tokenisation (not domain-specific) 3. Named Entity Recognition (domain-specific ontologies, domain-specific patterns) 4. Chunking: detection of noun and verb groups (not domain-specific) 5. Anaphora resolution (domain-specific ontologies) 6. Relationship detection via patterns over chunks (domain- and task- specific) 7. DB instantiation (task-specific)

State of the art in IE Several options for whole IE systems and individual components, especially for English Increasing integration of ontologies Commercial systems for some applications But, many IE-style tasks still done manually: IE performance (especially when high precision required) IE robustness to different text types IE porting requirement (especially NER and relation patterns) Performance of standard architecture may be reaching a plateau More advanced IE tasks are not generally attempted e.g., organic synthesis example could be done with adaptation of standard architecture, but would take substantial effort by highly trained people. Skill set: substantial domain skills plus substantial NLP

Objectives Integrate and adapt tools for language processing in general Eventual use by non-NLP people: black box for language processing Incorporate deeper processing (DELPH-IN technology): aim to get above plateau Integration with XML, semantic web Methodology: Combine statistical and symbolic processing, machine learning and hand-crafting Open Source where possible, collaborative development No toy systems, no artificial evaluations Multilingual via collaboration

Deep processing in IE Some early IE systems attempted to use deep processing: SRI (and also NYU) FASTUS was originally shallow preprocessor for TACITUS but TACITUS was dropped: much too slow, not sufficiently robust Often claimed: deep processing failed for IE, but: only two serious attempts(?), both under time pressure, limited types of IE task deep processing has improved since early 1990s: speed empirical coverage (note that hand-built deep grammars do scale, unlike traditional AI knowledge bases) integration of statistical techniques into deep processing if existing IE architecture is approaching a plateau, we have to try something else – i.e., combined deep and shallow processing (DFKI Whiteboard project)

Integrating processing No single system can do everything: deep and shallow processing have inherent strengths and weaknesses shallow: speed and robustness: e.g., POS tagging, chunking deep: detail, precision, potential for bidirectional processing: e.g., HPSG-based parsers and generators (DELPH-IN technology) also intermediate: RASP (Robust accurate statistical parser): relatively detailed but no lexicon. Domain-dependent and domain-independent processing must be linked Desirable to have a common representation language for processing above sentence level (e.g., anaphora) Long-term solutions...

Compositional semantics for component integration Need a common representation language for systems: pairwise compatibility between systems is too limiting Syntax is theory-specific and unnecessarily language-specific Eventual goal of sentence analysis should be semantics Core idea: shallow processing gives underspecified semantic representation, so deep and shallow systems can be integrated Full interlingua / common lexical semantics is too difficult (certainly currently), but can link predicates to ontologies, etc.

Integration via underspecified semantics Integrated parsing: shallow parsed phrases incorporated into deep parsed structures deep parsing invoked incrementally in response to information needs Knowledge sources expressed via semantics can be used by multiple components: e.g., NER, IE templates, anaphora resolution Advantages over ad-hoc representation approaches: Ability to link with detailed lexical semantics as it becomes available Language generation from semantic representation Explicit logic: formal properties clearer, representations more generally usable Deep semantics taken as normative: extensibility

Robust Minimal Recursion Semantics Minimal Recursion Semantics: MRS. Compositional semantics for deep processing: Copestake, Flickinger, Sag and Pollard (1999, in press) adopted for DELPH-IN and other HPSG work also compatible with LFG etc logically well-defined flat semantics (easier to process, allows information to be ignored) underspecification of quantifier scope (avoid ambiguity) novel approach to composition (monostratal) Robust MRS: adaptation of MRS allowing processing without a subcategorization lexicon

RMRS: Extreme underspecification Goal is to split up semantic representation into minimal components (cf Verbmobil VITs) Scope underspecification (MRS) Splitting up predicate argument structure Explicit equalities Hierarchies for predicates and sorts Compatibility with deep grammars: Sorts and (some) closed class word information in SEM-I (API for grammar, more later) No lexicon for shallow processing (apart from POS tags and possibly closed class words)

Semantics from POS tagging every_AT1 cat_NN1 chase_VVD some_AT1 dog_NN1 _every_q(x1), _cat_n(x2 sg ), _chase_v(e past ), _some_q(x3), _dog_n(x4 sg ) Tag lexicon: AT1 _lemma_q(x) NN1_lemma_n(x sg ) VVD _lemma_v(e past )

Deep parser output Conventional semantic representation Every dog chased some cat every(x,cat(x sg ),some(y sg,dog1(y sg ),chase(e sp,x sg,y sg ))) some(y sg,dog1(y sg ),every(x sg,cat(x sg ),chase(e sp,x sg,y sg ))) Compositional: reflects morphology and syntax Scope ambiguity is explicit May be awkward to process if you dont care about quantifier scope

Modifying syntax of deep grammar semantics: overview 1.Underspecification of quantifier scope: Minimal Recursion Semantics (MRS) – next 6 slides... 2.Robust MRS Separating arguments Explicit equalities Conventions for predicate names and sense distinctions Hierarchy of sorts on variables

PC trees every x cat x some y dog1 chase y xy some y dog1 y every x cat chase x Every cat chased some dog e xye

PC trees share structure every x cat x some y dog1 chase y some y dog1 y every x cat chase x xye xye

Bits of trees every x cat x some y dog1 y chase Reconstruction conditions: tree-ness variable binding xye

Label nodes and holes lb1:every xlb2:cat x lb4:some y lb5:dog1 y lb3:chase h6 h7 h0 h0 – hole corresponding to the top of the tree Valid solutions: equate holes and labels xye

Maximize splitting lb1:every x lb2:cat x lb4:some y lb5:dog1 y lb3:chase h6 h7 h0 h8 Constraints: h8=lb5 h9=lb2 h9 xye

MRS: flat representation elementary predications: lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), scope constraints: h9=lb2,h8=lb5 (actually qeqs) easy to ignore quantification when not relevant for application: cat(x), dog1(y), chase(e,x,y)

RMRS: Separating arguments lb1:every(x,h9,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y,h8,h7), lb3:chase(e,x,y), h9=lb2,h8=lb5 goes to: lb1:every(x), RSTR(lb1,h9), BODY(lb1,h6), lb2:cat(x), lb5:dog1(y), lb4:some(y), RSTR(lb4,h8), BODY(lb4,h7), lb3:chase(e),ARG1(lb3,x),ARG2(lb3,y), h9=lb2,h8=lb5

Naming conventions:predicate names without a lexicon lb1:_every_q(x1 sg ),RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ),RSTR(lb4,h8),BODY(lb4,h7), lb3:_chase_v(e sp ),ARG1(lb3,x2 sg ),ARG2(lb3,x4 sg ) h9=lb2,h8=lb5, x1 sg = x2 sg, x3 sg = x4 sg note also explicit equalities

POS output as underspecification DEEP – lb1:_every_q(x1 sg ), RSTR(lb1,h9), BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(e sp ), ARG1(lb3,x2 sg ),ARG2(lb3,x4 sg ), h9=lb2,h8=lb5, x1 sg =x2 sg, x3 sg =x4 sg POS – lb1:_every_q(x1), lb2:_cat_n(x2 sg ), lb3:_chase_v(e past ), lb4:_some_q(x3), lb5:_dog_n(x4 sg )

POS output as underspecification DEEP – lb1:_every_q(x1 sg ), RSTR(lb1,h9),BODY(lb1,h6), lb2:_cat_n(x2 sg ), lb5:_dog_n_1(x4 sg ), lb4:_some_q(x3 sg ), RSTR(lb4,h8), BODY(lb4,h7),lb3:_chase_v(e sp ), ARG1(lb3,x2 sg ),ARG2(lb3,x3 sg ), h9=lb2,h8=lb5, x1 sg =x2 sg, x3 sg =x4 sg POS – lb1:_every_q(x1), lb2:_cat_n(x2 sg ), lb3:_chase_v(e past ), lb4:_some_q(x3), lb5:_dog_n(x4 sg )

RMRS principles Split up information content as much as possible Accumulate information monotonically by simple operations Dont represent what you dont know but preserve everything you do know Use a flat representation to allow pieces to be accessed individually

Semantics from RASP RASP: robust, domain-independent, statistical parsing (Briscoe and Carroll) cant produce conventional semantics because no subcategorization can often identify arguments: S -> NP VP NP supplies ARG1 for V potential for partial identification: VP -> V NP S -> NP S NP might be ARG2 or ARG3

RMRS construction deep grammars: MRS RMRS converter. POS-RMRS: tag lexicon. RASP-RMRS: tag lexicon plus semantic rules associated with RASP rules. no lexical subcategorization, so rely on grammar rules to provide the ARGs output aims to match deep grammar (ERG) developed on basis of ERG semantic test suite default composition principles when no rule RMRS specified Composition algebra: MRS composition assumes a lexicalized approach: algebra defined in Copestake, Lascarides and Flickinger (2001) RMRS with non-lexicalised grammars has similar basic algebra All approaches have common composition principles, so there is compatibility at a phrasal level.

Some cat sleeps (in RASP) [h3,e],, {h3:_sleep(e)} sleeps [h,x],, {h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat S->NP VP: Head=VP, ARG1(, ) [h3,e],, {h3:_sleep(e), ARG1(h3,x), h1:_some(x),RSTR(h1,h2),h2:_cat(x)} some cat sleeps

ERG-RMRS / RASP-RMRS

Inchoative

Infinitival subject (unbound in RASP-RMRS)

Mismatch: Expletive it

SEM-I: semantic interface Meta-level: manually specified `grammar relations (constructions and closed-class) Object-level: linked to lexical database for deep grammars Object-level SEM-I auto-generated from expanded lexical entries in deep grammars (because type can contribute relations) Validation of other lexicons Need closed class items for RMRS construction from shallow processing

Alignment and XML Comparing RMRSs for same text efficiently requires `characterization labels RMRSs according to their source in the text currently characters, but also XPath plus characters RMRS-XML RMRS seen as levels of mark-up: standoff annotation

RMRS approach: current and planned applications Question answering: Cambridge CSTIT: deep parse questions, shallow parse answers QA from structured knowledge: Frank et al (QUETAL project) Information extraction: s (Deep Thought) Chemistry texts (SciBorg) Dictionary definition parsing for Japanese and English (Bond and Flickinger) Rhetorical structure, multi-document summarization... also LOGON: semantic transfer. MRSs from LFG used in HPSG generator.

RMRS in Deep Thought Different systems integrated via the HoG: Invoke shallow or deep parsing, full or partial results, all expressed in RMRS. Also shallow parsing as precursor to deep parsing: NER, unknown words. Preliminary test on response application (Xtramind Mailminder): categorized, then category-specific templates built from RMRS increase in precision of automatically instantiated templates (up to 29%) with the addition of deep parser to the system

IE architecture using deeper processing and RMRS 1. Preprocessing of markup etc 2. Tokenisation 3. Named Entity Recognition: delivers RMRS 4. Shallow processing (including chunking): delivers RMRS 5. Deep parsing: uses shallow processing and NER, delivers RMRS 6. Word sense disambiguation: uses RMRS from best available source, further instantiates RMRS according to ontology 7. Anaphora resolution: uses RMRS from best available source, further instantiates RMRS 8. Relationship detection via patterns over deepest possible RMRSs 9. DB instantiation

SciBorg: Chemistry texts eScience project started in October at Cambridge Computer Laboratory, Chemistry, CeSC Partners: Nature Publishing, Royal Society of Chemistry, International Union of Crystallography (supplying papers and publishing expertise) Aims: 1. Develop an NL markup language which will act as a platform for extraction of information. Link to semantic web languages. 2. Develop IE technology and core ontologies for use by publishers, researchers, readers, vendors and regulatory organisations. 3. Model scientific argumentation and citation purpose in order to support novel modes of information access. 4. Demonstrate the applicability of this infrastructure in a real-world eScience environment.

Outline architecture RSC papers Nature papers base XML IUCr papers Biology and CL (pdf) POS tagging NER RASP sentence splitting ERG/PET WSD anaphora tasks standoff annotation rhetorical analysis RMRS merge

Research markup Chemistry: The primary aims of the present study are (i) the synthesis of an amino acid derivative that can be incorporated into proteins /via/ standard solid-phase synthesis methods, and (ii) a test of the ability of the derivative to function as a photoswitch in a biological environment. Computational Linguistics: The goal of the work reported here is to develop a method that can automatically refine the Hidden Markov Models to produce a more accurate language model.

RMRS and research markup Specify cues in RMRS: e.g., l1:objective(x), ARG1(l1,y), l2:research(y) The concept objective generalises the predicates for aim, goal etc and research generalises study, work etc. Ontology for rhetorical structure. Deep process possible cue phrases to get RMRSs: feasible because domain-independent more general and reliable than shallow techniques allows for complex interrelationships e.g., our goal is not to... but to... Use zones for advanced citation maps (e.g., X cites Y (contrast)) and other enhancements to repositories

Conclusions Information Extraction is more than company mergers or gene- protein interactions! Combined deep-shallow processing techniques have potential for IE RMRS is a representation language that allows for deep-shallow compatibility via extreme underspecification various systems adapted to output RMRS and further work ongoing RMRS offers detailed compatibility at a phrasal level RMRS processing can be integrated with ontologies in various ways RMRS tools are distributed as Open Source via DELPH-IN SciBorg will further develop this approach for eScience applications using a generic standoff architecture

Further work on RASP-RMRS Fast enough (time not significant compared to RASP processing time because no ambiguity) Too many RASP rules! Need to generalise over classes. Requires SEM-I: i.e., API for MRS/RMRS from deep grammar RASP and ERG may change: compatible test suites – semi-automatic rule update? alternative technique for composition? Parse selection – need to generalise over RMRSs weighted intersections of RMRSs (cf RASP grammatical relations)