Download presentation
Presentation is loading. Please wait.
1
Annotation Types for UIMA Edward Loper
2
UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific tasks (tagging, parsing, etc.) –Each component declares its own interface (input/output, requirements, work flow metadata, etc) –All information is communicated using a single standard data format: CAS –Built-in support for network distribution, clustering, etc.
3
CAS Common Analysis Structure Tends to fall on the “weakly-merged” side of the spectrum (does not require annotations to be modified to ensure consistency). Annotations are encoded using typed feature structures. But the type definitions are left unspecified. C.f.: XML Components can only work together if they use the same type system.
4
Standard CAS Types Goal: design standard CAS types for ULA annotations. –In particular, we’re currently looking at Treebank, Propbank, & Timebank. Issues: –Redundancy of information –Coupling between annotations –Discontinuous constituents
5
CAS Types: background UIMA does provide a couple of top-level types. (e.g. Annotation) These make it clear that UIMA intends: –Standoff annotations… defined using spans… with character-based offsets C.f. AGTK
6
Treebank Typical representation for treebank: Questions: –Should children be explicitly marked? –Should parents be explicitly marked? These questions have consequences…
7
Treebank: Explicit children? How could we not mark children? They can be mostly reconstructed, if we assume… –All constituents are properly nested –Unary branch direction can be determined based on node type. Not quite true: SBAR/FRAG; S/NP; NP/FRAG; NP/PRN. Theoretical consequences of (not) marking children. –Have to assume proper nesting of constituents –Alternatively, allow for multiple coexisting bracketings (a la chart parse) -- probably not what we want.
8
Treebank: Explicit parents? Parent pointers are redundant -- it can be reconstructed. But it can be very handy to have when working with structures. Theoretical consequence of marking parents: –Every constituent has exactly one parent. –Rules out multi-parented trees. (fine.)
9
Propbank Probank’s current annotation… –Is strongly coupled to treebank Argument locations are specified using “tree pointers” –Includes trace chain information
10
Propbank: Tree Pointers Each propbank argument is specified using a tree pointer w:h –The h th constituent above the w th word. Problems with this strong coupling: –Propbank can’t be used without trees. –New propbanking can’t be done unless parsing has been done. –Changes to trees are annoying to propagate to propank.
11
Propbank: spans Can we get away with using spans instead (UIMA’s preferred approach)? Do we lose any information? –Potentially yes -- for binary branching nodes. –In practice: 99.92% of non-trace args select the low constituent. 97.9% of trace args select the high constituent. The differences appear to just be errors. … so no (important) lost info! About 50-55% of split arguments go away.
12
Propbank: trace chains For arguments that have undergone movement, propbank explicitly marks the trace chain. –But isn’t this something the tree should give us anyway? –Treebank & propbank have somewhat different notions of what gets included in trace chains. 1/3 of the Propbank annotation guidelines talk about null elements.
13
Propbank: trace chains How much can we recover? –Using very simple heuristics (e.g., link “NP-2 with *t*-2), ~60% –Using more advanced heuristics, maybe 80%. –Not close enough to 100% to throw them away. –Some differences harder to automate: e.g., propbank (usually) only marks traces that interact with the predicate in some way. “Asbestos i was used t i … and replaced t i …”
14
Propbank: trace chains (?s for discussion) Should marking trace chains be part of the propbanking task? –Or should we leave it up to the treebankers? If it should be part of propbanking, should it be split off as a separate subtask? –Would that help annotation speed any? Should the annotation be split off as a separate layer?
15
Discontinuous constituents Propbank has provisions for discontinuous constituents: w 1 :h 1,w 2 :h 2 Discontinuous constituents can appear almost anywhere –Temporal expressions –Named entities –Parse constituents (?) Want: a uniform way to handle them.
16
Discontinuous constituents Goals: –Make the common case easy –Make the uncommon case possible Preferred approach: –Add an optional property (eg “pieces”) that can be used to specify discontinuous chunks. –If used, then the start/end properties should be treated with appropriate care Open question: –Should this property be defined on the top-level type, or on individual types (eg PropBankArgument)?
17
A note on consistency CAS is “weakly merged” -- it doesn’t enforce consistency. But that doesn’t mean we can’t enforce consistency ourselves. For weakly merged formats, it will be important to: –Define consistencies that we want Both within annotations & between annotations –Actively check those consistencies during annotation. Weakly coupled annotations are a good thing. –But the more weakly coupled the annotations are, the more we’ll need to check consistency
18
Questions/discussion Strongly vs weakly merged (when) is redundancy good? How strongly coupled should annotations be? Handling discontinuous constituents? Where is there information overlap between annotations (e.g. coref chains)? What should be done about it? Any principled way to decide when to mark heads vs spans? Token offset vs character offset vs tree pointer
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.