Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA.

Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA

LAF Goal ISO TC37 SC4 - WG 1  Provide a generic means to represent linguistic data and annotations  Based on a formal model  Users map their formats into/out of LAF  User formats must conform to underlying model  Pivot or “dump” format for exchange, machine processing

User A’s representation User B’s representation DUMP FORMAT “interlingua”

Principles  Separation of data and annotations  Stand-off annotation  Separation of user annotation formats and the exchange (“dump”) format  Mappable to one another  Separation of referential structure and annotation content in dump format  Separation of annotation structure (relationships among parts) and content (data categories) in representation of annotations

LAF Development  LAF has gone through a slow evolution  Model development (GMT as base)  Consideration of processing needs  Application to different annotation types/structures/formats  Adjustments to development in other WGs on specific annotation types and feature structures  “Proof of concept” instantiation in the American National Corpus  Transduction of several different annotation types and formats to LAF format  API to merge, transduce to other formats

LAF Status  Have now  Reduced FS specification  Final XML format / schema  GrAF : Graph Annotation Format  Mapping “rules” and examples  Also  Coordination with UIMA  Header specification including information about annotation, similar to UIMA type definition

Basic Model  Annotation content represented by feature structures  Powerful means to represent any/all annotations  Referential structure represented as a directed acyclic graph (DAG)  Enables exploitation of well-understood graph traversal and manipulation algorithms

Referential Structure  Means by which annotation content is associated with primary data or other annotations  Very simple DAG model  No need to consider internal structure of annotation content (i.e. relations among bits of annotation information)

Primary Data  Primary data contains no annotations  “Read-only”  Modifications can be regarded as annotations  Insistence on the identification of a base segmentation of the primary data  Identifies contiguous sequences of indivisible logical units  For text, usually a character  “Compatible” annotations (i.e. those that can be merged etc.) use common base segmentation

Primary Segmentation  Set of disjoint edges over primary data  Vertices  Virtual, located between each logical unit  Sequentially numbered  Edges  Each edge (x,y) in the graph delimits a non- divisible region of primary data  Comformance to MAF, SynAF  call these edges over primary data a span

 Multiple primary segmentations may be defined over a single primary data set  Specify segmentations at different levels of granularity  A segmentation is “primary” vis a vis a given annotation, not the data itself  Edges in a primary segmentation can be defined over any span of contiguous primary data, regardless of its length  No need for spans to be contiguous  For text, most common primary segmentation is the token

Referring to Primary Segmentation  Define an edge graph over the edges (spans) in the primary segmentation  Given an edge set, E, create an edge graph E’ such that for each edge (x,y) in E, there is a vertex xy in E’  Annotations are associated with regions of primary data by referencing the edge graph vertices  Annotations never reference the primary data directly

 Edges in E’ are defined when annotations reference vertices in E’  Vertices may or may not be contiguous  An annotation is associated with vertices in E’ as follows: 1. Create a new vertex, v 2. Label it with the FS containing the annotation content 3. Create an edge from v to 0 or more vertices in E’  Zero reference is used in the special case where the annotation applies to information not present in the data  References to 2 or more vertices in E’ by by default concatenate the information covered by the referenced vertices (in order)  can be overridden to specify vertices are to be regarded as an ordered list or “bag”

Edge graph over primary data |T|h|e| |c|l|o|c|k| |s|t|r|u|c|k| |t|w|e|n|t|y|-|t|w|o| | Annotations associated with vertices in the primary data edge graph type=token pos=nn base=clock type=token pos=det base=the type=token pos=vbd base=strike type=token pos=cd base=twenty+two

As many annotations as desired can reference the same segmentation or be layered over lower-level annotations S EG 2 Primary data MS1 MS2NP Syn2 Co-Ref Syn1 SEG1SEG1 MS3Sem

Annotating Annotations  Vertices in an annotation may be referenced from other annotations 1. Create a new vertex, v’ 2. Label it with the FS containing the annotation content 3. Create an edge from v’ to one or more vertices associated with an annotation  The strategy described above may be applied recursively, thus creating a DAG whose leaves are the vertices in E’

Annotations associated with token annotations type=np number=sing type=vp tense=past type=np number=sing 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 type=token pos=nn base=clock type=token pos=det base=the type=token pos=vbd base=strike type=token pos=cd base=twenty+two |T|h|e| |c|l|o|c|k| |s|t|r|u|c|k| |t|w|e|n|t|y|-|t|w|o| |

XML Instantiation

Token Annotation Creates a new vertex (node) associated with the FS with a single edge to vertex “e2” in the primary segmentation edge graph

NP Annotation Creates a new vertex (node) associated with the FS with two outgoing edges to vertices “t1” and “t2” in the token annotation

Question ISO TC37 SC4 - WG 1 Beijing 2006  When referring to annotations, edge targets typically represent components  E.g. in the example: “the” and “clock” are components of “NP”  But this is not always the case  Could be e.g. a list of co-referents  Others?  Possible solution: let the processor deal with it using the FS type

Note ISO TC37 SC4 - WG 1 Beijing 2006  Edges are never labeled, unlike in many linguistic analyses  Preserves simplicity of the graph  Relations are DatCats  edgesTo attribute can be empty  Can create pseudo-nodes  Implies a flat (non-nested) structure in the dump format

ISO TC37 SC4 - WG 1 Beijing 2006 obj head s FLEA HAVE head gen subj DOG MY [DOG]

Advantages of DAG ISO TC37 SC4 - WG 1 Beijing 2006  Can apply graph algorithms to traverse the graph  Breadth-first, depth-first traversal, shortest path, minimum spanning tree  Connectedness, articulation vertices  Topological sort  Graph coloring, graph partitioning  Etc.  What can we do with this?  What is all info on path to/from node x  What is nearest common ancestor of nodes x and y  Find matching sub-graphs  Identify connected components  Which nodes (phenomena) are most connected, form articulation vertices, etc.  …

Feature Structures ISO TC37 SC4 - WG 1 Beijing 2006  Each edge is labeled with a feature value  Can be FS, collection (list, bag, set), atom  Alternation and grouping handled by the FS mechanisms  Need to identify “basic” FS mechanisms  90% of annotations use only these  Annotations may (optionally) use only this set  Ease of use  No need to implement procedures to handle full power of FS  Need to create a FS library for abbreviation

Implications for Other WGs ISO TC37 SC4 - WG 1 Beijing 2006  Should (conceptually at least) separate referential structure from annotation content  E.g. “tlink” in TimeML/SemAF: the link itself is the edge, “tlink” is the annotation content (?)  Need for coordination  Inter-project coordination committee?  Need examples!

Today’s Work ISO TC37 SC4 - WG 1 Beijing 2006  Discuss the format in terms of specific annotation types  Remember that dump format is in principle never seen by the user  Map user format into and out of dump format  Two topics  DAG for referential structure  FS for representing annotation content

Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA.

Similar presentations

Presentation on theme: "Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA.

Similar presentations

Presentation on theme: "Linguistic Annotation Framework SC4 WG 1 Nancy Ide Vassar College USA."— Presentation transcript:

Similar presentations

About project

Feedback