CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

Modelling with expert systems. Expert systems Modelling with expert systems Coaching modelling with expert systems Advantages and limitations of modelling.

An Introduction to GATE

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:

Comments on Natural Language and Argumentation Adam Wyner Department of Computer Science July 13, 2012 London Text Analytics Meetup.

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

Information Retrieval in Practice

Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Introduction to Computational Linguistics Lecture 2.

Basi di dati distribuite Prof. M.T. PAZIENZA a.a

1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.

Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.

The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.

Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

ELN – Natural Language Processing Giuseppe Attardi

CAREERS IN LINGUISTICS OUTSIDE OF ACADEMIA CAREERS IN INDUSTRY.

Logic Programming for Natural Language Processing Menyoung Lee TJHSST Computer Systems Lab Mentor: Matt Parker Analytic Services, Inc.

9/8/20151 Natural Language Processing Lecture Notes 1.

UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.

Survey of Semantic Annotation Platforms

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

APPLICATIONS OF CONTEXT FREE GRAMMARS BY, BRAMARA MANJEERA THOGARCHETI.

Information Extraction From Medical Records by Alexander Barsky.

1 ECE 453 – CS 447 – SE 465 Software Testing & Quality Assurance Instructor Kostas Kontogiannis.

Open Health Natural Language Processing Consortium (OHNLP)

Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.

MinorThird 서울시립대학교 인공지능연구실 곽별샘

NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.

Flexible Text Mining using Interactive Information Extraction David Milward

Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.

27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.

Using Several Ontologies for Describing Audio-Visual Documents: A Case Study in the Medical Domain Sunday 29 th of May, 2005 Antoine Isaac 1 & Raphaël.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

What you have learned and how you can use it : Grammars and Lexicons Parts I-III.

IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.

MedKAT Medical Knowledge Analysis Tool December 2009.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

©2012 Paula Matuszek CSC 9010: Information Extraction Overview Dr. Paula Matuszek (610) Spring, 2012.

Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.

For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

NATURAL LANGUAGE PROCESSING

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Facilitating Semantic Web Search with Embedded Grammar Tags (EGTs) Gautham K.Dorai Yaser Yacoob Department of Computer Science University of Maryland –

Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin

Natural Language Processing (NLP)

Improving a Pipeline Architecture for Shallow Discourse Parsing

Automatic Detection of Causal Relations for Question Answering

CS246: Information Retrieval

Natural Language Processing (NLP)

Natural Language Processing (NLP)

Presentation transcript:

CS4025: Advanced Information Extraction

Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General Architecture for Text Engineering (GATE) Examples

Main Point CS4025, Department of Computing Science, University of Aberdeen 3 Context: Textual material is expressed in natural language. We understand the structure and the meaning of textual material, but it is unstructured information for a machine. Problem: How to structure the information to support processing – information extraction for queries, reasoning, and further machine processing? Solution: Annotate the data with semantic mark ups using natural language processing systems. Makes data machine readable.

Problems for Annotation Annotate large legacy corpora. Address growth of corpora. Distribution of information. Reduce number of human annotators and tedious work. Make annotation systematic, automatic, and consistent. Annotate fine-grained information: names, locations, addresses, organisations, actions, relations amongst terms. CS4025, Department of Computing Science, University of Aberdeen 4

An Approach Knowledge heavy, using lists, rules, and processes. Labour and knowledge intensive. Transparent. Decompose large complex problems into smaller, manageable problems for which we can create solutions. Make implicit information explicit by adding machine readable annotations. Software engineering approach. CS4025, Department of Computing Science, University of Aberdeen 5

Development Cycle Source Text Linguistic Analysis Tool Construction Knowledge Extraction Evaluation CS4025, Department of Computing Science, University of Aberdeen 6

Computational Linguistic Cascade I Sentence segmentation - divide text into sentences. Tokenisation - words identified by spaces between them. Lemmatisation – homogenise 'run', 'ran', 'runs' to 'run'. Part of speech tagging - noun, verb, adjective.... Morphological analysis - singular/plural, tense, nominalisation,... Shallow syntactic parsing/chunking - noun phrase, verb phrase, subordinate clause,.... Identification of relevant terms and constructions. CS4025, Department of Computing Science, University of Aberdeen 7

Computational Linguistic Cascade II Named entity recognition - the entities in the text. Dependency analysis - subordinate clauses, pronominal anaphora,... Relationship recognition – X is president of Y; A hit B with a car and killed B. Enrichment - add lexical semantic information to verbs or nouns. Each step guided by pattern matching and rule application. CS4025, Department of Computing Science, University of Aberdeen 8

GATE General Architecture for Text Engineering (GATE) - open source framework which supports plug-in NLP components to process a corpus of text. GATE Training Courses A GUI to work with the tools. A Java library to develop further applications. Components and sequences of processes, each process feeding the next in a “pipeline”. Annotated text output or other sorts of output. CS4025, Department of Computing Science, University of Aberdeen 9

Methodology I Form the corpus of text. Identify terminology and sort into 'classes'. Maybe use a spreadsheet for development. Put sorted terminology into gazetteer lists (GAZ). Create JAPE rules to 'reveal' the terminology. Run the pipeline. Examine results either 'in situ' or query with semantic search. Refine/revise lists, rules, and queries. Add further GATE processing modules as needed. CS4025, Department of Computing Science, University of Aberdeen 10

Methodology II CS4025, Department of Computing Science, University of Aberdeen 11

Basic Process Flow CS4025, Department of Computing Science, University of Aberdeen 12

Example Process Flow CS4025, Department of Computing Science, University of Aberdeen 13

Gazetteers Gazetteers are lookup lists that add features - when a string in the text is located in a lookup list, annotate the string in the text with the feature. Conceptual covers. Feature: list of items... Obligation: ought, must, obliged, obligation.... Exception: unless, except, but, apart from.... Verbs according to thematic roles: lists of verbs and their associated roles, e.g. run has an agent (Bill ran), rise has a theme (The wind blew). Easy to change. CS4025, Department of Computing Science, University of Aberdeen 14

JAPES JAPE Rules (finite state transduction rules) create overt annotations and reuse other annotations (e.g. Parser Output): Easy to change. CS4025, Department of Computing Science, University of Aberdeen 15

CS4025, Department of Computing Science, University of Aberdeen 16 Example 1 Psychology

Corpus CS4025, Department of Computing Science, University of Aberdeen 17

Terminology CS4025, Department of Computing Science, University of Aberdeen 18

Spreadsheet CS4025, Department of Computing Science, University of Aberdeen 19 Facilitates adding, sorting, classifying.... the terminology.

GAZ CS4025, Department of Computing Science, University of Aberdeen 20

JAPE CS4025, Department of Computing Science, University of Aberdeen 21

Pipeline CS4025, Department of Computing Science, University of Aberdeen 22

Results in situ CS4025, Department of Computing Science, University of Aberdeen 23

Results with ANNIC (one) CS4025, Department of Computing Science, University of Aberdeen 24

Results with ANNIC (more) CS4025, Department of Computing Science, University of Aberdeen 25

Other Modules (POS, Sentiment, etc.) CS4025, Department of Computing Science, University of Aberdeen 26

CS4025, Department of Computing Science, University of Aberdeen 27 Example 2 Rule Extraction from Regulations

Identify and extract rules from regulations using a rule- based, bottom-up, linguistically expressive, open-source, verifiable tool. Fine-grained structure to identify rule structure, nouns with their thematic roles, exceptions, and lists. Carry out an experiment on a portion of a regulation, demonstrating the feasibility of the approach and tools. Results (start to be) useful for knowledge acquisition and engineering. Towards computational semantics of natural language. Main Points 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 28

Use Cases Legislation and regulations have rules that must be identified to: create and maintain up to date rule books that are used in compliance management, where a company must comply with the rules. (ComplianceTrack) create logic programs that, given input of ground facts, can generate determinations, e.g. whether an individual is due a benefit from the government or owes taxes. (Oracle) exchange machine readable rules. (LegalRuleML) 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 29

However.... The knowledge engineering bottleneck: It is knowledge, time, and labour intensive to identify, organise, and formalise rules which are expressed in natural language into rules that can be automatically processed. Solution - apply Natural Language Processing tools. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 30

Research Material US Code of Federal Regulations, US Food and Drug Administration, Department of Health and Human Services regulation for blood banks on testing requirements for communicable disease agents in human blood, Title 21 part 610 section page document of 1,777 words. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 31

Why Not Parse and Be Done? Applied the Stanford Parser, and it outputs: parses – sequences of words that form a grammatical phrase; dependencies – relationships between phrases, e.g. subject of verb; alternative parses. It failed to parse the whole text. Succeeds on portions, but still lots of issues. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 32

Whazza Problem? long, complex sentences; alternative parses; lists with punctuation; references with punctuation; embedded clauses (...that....;...to be....); diathesis (e.g. active-passive) and thematic roles (e.g. agent): You must test the blood; the blood must be tested by you. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 33

Proposition Define and target a more specific task. Work with simpler materials and build up from there. Develop a knowledge-based system to identify and extract information. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 34

Target Model Deontic rules: Conditional rules: Start with this. In future work, add punctuation, negation, temporal phrases, generics, Hohfeldian relations, tense in conditionals, references, and so on /09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 35

Methodology - Materials Given the problems of working with the source material directly, the data is systematically decomposed into less complex forms: Source – original materials, unparseable. Source Sections – sections of source material, parseable, but complex and inaccurate. Source Derived – edited confounding issues such as long conjunctive sentences, embeddings, and references. Created a Gold Standard in which we annotated the 'correct' elements and parses. Testing Data – simplified materials focusing on elements of the model. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 36

Source Derived 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 37

Gold Standard The Gold Standard encodes knowledge. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 38

Testing Data A. B. C. D. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 39

Methodology – Modules and Materials Develop modules for Testing Data Apply modules to Source Derived Test modules on Gold Standard Identify problems and refine modules Apply modules to SD and GS /09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 40

General Architecture for Text Engineering NLP pipeline 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 41

General Architecture for Text Engineering Gazetteers are lookup lists that add features - when a string in the text is located in a lookup list, annotate the string in the text with the feature. Conceptual covers. Feature: list of items... Obligation: ought, must, obliged, obligation.... Exception: unless, except, but, apart from.... Verbs according to thematic roles: lists of verbs and their associated roles, e.g. run has an agent (Bill ran), rise has a theme (The wind blew). Easy to change. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 42

General Architecture for Text Engineering JAPE Rules (finite state transduction rules) create overt annotations and reuse other annotations (e.g. Parser Output): Easy to change. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 43

General Architecture for Text Engineering Have Gazetteer lists and JAPE rules for: lists in various forms; exception phrases in various forms; conditionals in various forms; deontic terms; associating grammatical roles (e.g. subject and object) with thematic roles (agent and theme) in various forms. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 44

Sample Outputs Consequence, list structure, and conjuncts of the antecedent. Exception, agent NP, deontic concept, active main verb, theme. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 45

Sample Output Theme, deontic modal, passive verb, agent with complex relative clause. 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 46

Sample Output - Overall 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 47

Sample Output - XML 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 48

Sample Output – ANNIC Search 06/09/2013 Wyner, LEX 2013, Ravenna, Italy (cc) by-nc-sa license 49