Annotation Types for UIMA Edward Loper. UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific.

Slides:



Advertisements
Similar presentations
Programming Language Concepts
Advertisements

Impact of OASIS UIMA Standard on Apache UIMA OASIS Unstructured Information Management Architecture (UIMA) TC
ARCHITECTURES FOR ARTIFICIAL INTELLIGENCE SYSTEMS
Intermediate Code Generation
Binary Trees Chapter 6. Linked Lists Suck By now you realize that the title to this slide is true… By now you realize that the title to this slide is.
Database Processing: Fundamentals, Design, and Implementation, 9/e by David M. KroenkeChapter 5/1 Copyright © 2004 Please……. No Food Or Drink in the class.
Semantic analysis Parsing only verifies that the program consists of tokens arranged in a syntactically-valid combination, we now move on to semantic analysis,
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB JavaForum.
Chapter 6 Methodology Logical Database Design for the Relational Model Transparencies © Pearson Education Limited 1995, 2005.
DS-to-PS conversion Fei Xia University of Washington July 29,
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Ensuring Non-Functional Properties. What Is an NFP?  A software system’s non-functional property (NFP) is a constraint on the manner in which the system.
Methodology Logical Database Design for the Relational Model
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
The Composite Pattern.. Composite Pattern Intent –Compose objects into tree structures to represent part-whole hierarchies. –Composite lets clients treat.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Common Mechanisms in UML
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
© 2005 by Prentice Hall Chapter 9 Structuring System Requirements: Logic Modeling Modern Systems Analysis and Design Fourth Edition.
Thoughts on Treebanks Christopher Manning Stanford University.
Efficient XML Interchange. XML Why is XML good? A widely accepted standard for data representation Fairly simple format Flexible It’s not used by everyone,
MS Access: Database Concepts Instructor: Vicki Weidler.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
 1. Introduction  2. Development Life-Cycle  3. Current Component Technologies  4. Component Quality Assurance  5. Advantages and Disadvantages.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
1 Features and Unification Chapter 15 October 2012 Lecture #10.
Array.
Index Structures for Files Indexes speed up the retrieval of records under certain search conditions Indexes called secondary access paths do not affect.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Website Development & Management Creating Web Pages CIT Fall Instructor: John Seydel, Ph.D.
Sekimo Solutions mentioned by the TEI  CONCUR: an optional feature of SGML (not XML) that allows multiple.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
Object Oriented Analysis & Design & UML (Unified Modeling Language)1 Part V: Design The Design Workflow Design Classes Refining Analysis Relationships.
…using Git/Tortoise Git
SWE © Solomon Seifu ELABORATION. SWE © Solomon Seifu Lesson 10 Use Case Design.
Interpretation Environments and Evaluation. CS 354 Spring Translation Stages Lexical analysis (scanning) Parsing –Recognizing –Building parse tree.
August Chapter 6 - XPath & XPointer Learning XML by Erik T. Ray Slides were developed by Jack Davis College of Information Science and Technology.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Semantic Construction lecture 2. Semantic Construction Is there a systematic way of constructing semantic representation from a sentence of English? This.
Chapter 14 Advanced Function Topics CSC1310 Fall 2009.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Design Patterns VIII Chain of Responsibility, Strategy, State.
INFSY 547: WEB-Based Technologies Gayle J Yaverbaum, PhD Professor of Information Systems Penn State Harrisburg.
08 - StructuralCSC4071 Structural Patterns concerned with how classes and objects are composed to form larger structures –Adapter interface converter Bridge.
Structural Patterns1 Nour El Kadri SEG 3202 Software Design and Architecture Notes based on U of T Design Patterns class.
Well Formed XML The basics. A Simple XML Document Smith Alice.
XML CSC1310 Fall HTML (TIM BERNERS-LEE) HyperText Markup Language  HTML (HyperText Markup Language): December  Markup  Markup is a symbol.
Session 1 Module 1: Introduction to Data Integrity
Chapter 3: Introducing the UML
Component Patterns – Architecture and Applications with EJB copyright © 2001, MATHEMA AG Component Patterns Architecture and Applications with EJB Markus.
Refactoring Agile Development Project. Lecture roadmap Refactoring Some issues to address when coding.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
XML CORE CSC1310 Fall XML DOCUMENT XML document XML document is a convenient way for parsers to archive data. In other words, it is a way to describe.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Methodology - Logical Database Design. 2 Step 2 Build and Validate Local Logical Data Model To build a local logical data model from a local conceptual.
XML Extensible Markup Language
DATA FLOW DIAGRAMS.
LLVM IR, File - Praakrit Pradhan. Overview The LLVM bitcode has essentially two things A bitstream container format Encoding of LLVM IR.
Semantic Role Labelling Using Chunk Sequences Ulrike Baldewein Katrin Erk Sebastian Padó Saarland University Saarbrücken Detlef Prescher Amsterdam University.
Natural Language Processing Vasile Rus
Methodology Logical Database Design for the Relational Model
Multiway Search Trees Data may not fit into main memory
Compiler Construction (CS-636)
Ch. 4 – Semantic Analysis Errors can arise in syntax, static semantics, dynamic semantics Some PL features are impossible or infeasible to specify in grammar.
Lecture 15 (Notes by P. N. Hilfinger and R. Bodik)
Indexing and Hashing Basic Concepts Ordered Indices
XML Problems and Solutions
SYNTAX DIRECTED DEFINITION
Presentation transcript:

Annotation Types for UIMA Edward Loper

UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific tasks (tagging, parsing, etc.) –Each component declares its own interface (input/output, requirements, work flow metadata, etc) –All information is communicated using a single standard data format: CAS –Built-in support for network distribution, clustering, etc.

CAS Common Analysis Structure Tends to fall on the “weakly-merged” side of the spectrum (does not require annotations to be modified to ensure consistency). Annotations are encoded using typed feature structures. But the type definitions are left unspecified. C.f.: XML Components can only work together if they use the same type system.

Standard CAS Types Goal: design standard CAS types for ULA annotations. –In particular, we’re currently looking at Treebank, Propbank, & Timebank. Issues: –Redundancy of information –Coupling between annotations –Discontinuous constituents

CAS Types: background UIMA does provide a couple of top-level types. (e.g. Annotation) These make it clear that UIMA intends: –Standoff annotations… defined using spans… with character-based offsets C.f. AGTK

Treebank Typical representation for treebank: Questions: –Should children be explicitly marked? –Should parents be explicitly marked? These questions have consequences…

Treebank: Explicit children? How could we not mark children? They can be mostly reconstructed, if we assume… –All constituents are properly nested –Unary branch direction can be determined based on node type. Not quite true: SBAR/FRAG; S/NP; NP/FRAG; NP/PRN. Theoretical consequences of (not) marking children. –Have to assume proper nesting of constituents –Alternatively, allow for multiple coexisting bracketings (a la chart parse) -- probably not what we want.

Treebank: Explicit parents? Parent pointers are redundant -- it can be reconstructed. But it can be very handy to have when working with structures. Theoretical consequence of marking parents: –Every constituent has exactly one parent. –Rules out multi-parented trees. (fine.)

Propbank Probank’s current annotation… –Is strongly coupled to treebank Argument locations are specified using “tree pointers” –Includes trace chain information

Propbank: Tree Pointers Each propbank argument is specified using a tree pointer w:h –The h th constituent above the w th word. Problems with this strong coupling: –Propbank can’t be used without trees. –New propbanking can’t be done unless parsing has been done. –Changes to trees are annoying to propagate to propank.

Propbank: spans Can we get away with using spans instead (UIMA’s preferred approach)? Do we lose any information? –Potentially yes -- for binary branching nodes. –In practice: 99.92% of non-trace args select the low constituent. 97.9% of trace args select the high constituent. The differences appear to just be errors. … so no (important) lost info! About 50-55% of split arguments go away.

Propbank: trace chains For arguments that have undergone movement, propbank explicitly marks the trace chain. –But isn’t this something the tree should give us anyway? –Treebank & propbank have somewhat different notions of what gets included in trace chains. 1/3 of the Propbank annotation guidelines talk about null elements.

Propbank: trace chains How much can we recover? –Using very simple heuristics (e.g., link “NP-2 with *t*-2), ~60% –Using more advanced heuristics, maybe 80%. –Not close enough to 100% to throw them away. –Some differences harder to automate: e.g., propbank (usually) only marks traces that interact with the predicate in some way. “Asbestos i was used t i … and replaced t i …”

Propbank: trace chains (?s for discussion) Should marking trace chains be part of the propbanking task? –Or should we leave it up to the treebankers? If it should be part of propbanking, should it be split off as a separate subtask? –Would that help annotation speed any? Should the annotation be split off as a separate layer?

Discontinuous constituents Propbank has provisions for discontinuous constituents: w 1 :h 1,w 2 :h 2 Discontinuous constituents can appear almost anywhere –Temporal expressions –Named entities –Parse constituents (?) Want: a uniform way to handle them.

Discontinuous constituents Goals: –Make the common case easy –Make the uncommon case possible Preferred approach: –Add an optional property (eg “pieces”) that can be used to specify discontinuous chunks. –If used, then the start/end properties should be treated with appropriate care Open question: –Should this property be defined on the top-level type, or on individual types (eg PropBankArgument)?

A note on consistency CAS is “weakly merged” -- it doesn’t enforce consistency. But that doesn’t mean we can’t enforce consistency ourselves. For weakly merged formats, it will be important to: –Define consistencies that we want Both within annotations & between annotations –Actively check those consistencies during annotation. Weakly coupled annotations are a good thing. –But the more weakly coupled the annotations are, the more we’ll need to check consistency

Questions/discussion Strongly vs weakly merged (when) is redundancy good? How strongly coupled should annotations be? Handling discontinuous constituents? Where is there information overlap between annotations (e.g. coref chains)? What should be done about it? Any principled way to decide when to mark heads vs spans? Token offset vs character offset vs tree pointer