Download presentation
Presentation is loading. Please wait.
Published byGervase West Modified over 9 years ago
1
1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu dcristea@infoiasi.ro cris@infoiasi.ro “ Al. I. Cuza ” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science
2
LREC 2004 – Workshop on Richly Annotated Corpora 2/48 XML in LR annotation A de facto framework to support language annotation Used to: –record experts views on linguistic phenomena on corpora –store intermediate results in pipe-line NLP applications –post NLP results BUT: –annotation schemes: a chaos and not reusable –many annotations do share parts in common –not all layers are useful for the task at hand
3
LREC 2004 – Workshop on Richly Annotated Corpora 3/48 Presentation Motivation for a structural view on annotation schemes Proposal for a hierarchical representation –circular references –classification within the hierarchy –operations within the hierarchy Conclusions
4
LREC 2004 – Workshop on Richly Annotated Corpora 4/48 An annotation session a source XML annotated document a database image of the annotation or both DTD file Annotation session
5
LREC 2004 – Workshop on Richly Annotated Corpora 5/48 A sequence of annotation sessions DTD1 DTD2 Annotation session
6
LREC 2004 – Workshop on Richly Annotated Corpora 6/48 DTD1 DTD2 Mixing human with automatic annotation Manual annotation Automatic annotation
7
LREC 2004 – Workshop on Richly Annotated Corpora 7/48 Multiple parentage of a scheme +
8
LREC 2004 – Workshop on Richly Annotated Corpora 8/48 Multiple parentage
9
LREC 2004 – Workshop on Richly Annotated Corpora 9/48 Multiple parentage
10
LREC 2004 – Workshop on Richly Annotated Corpora 10/48 Multiple parentage
11
LREC 2004 – Workshop on Richly Annotated Corpora 11/48 Multiple parentage
12
LREC 2004 – Workshop on Richly Annotated Corpora 12/48 Multiple parentage
13
LREC 2004 – Workshop on Richly Annotated Corpora 13/48 Multiple parentage
14
LREC 2004 – Workshop on Richly Annotated Corpora 14/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG
15
LREC 2004 – Workshop on Richly Annotated Corpora 15/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG
16
LREC 2004 – Workshop on Richly Annotated Corpora 16/48 Definition of a scheme … …
17
LREC 2004 – Workshop on Richly Annotated Corpora 17/48 The subsumption relation A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: –any tag-name of A is also in B; –any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; –any semantic relation which holds in A also holds in B; –either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. A B
18
LREC 2004 – Workshop on Richly Annotated Corpora 18/48 Example Winston was dreaming of his mother. He must, he thought,
19
LREC 2004 – Workshop on Richly Annotated Corpora 19/48 How can circular references be notated? Winston was dreaming of his mother
20
LREC 2004 – Workshop on Richly Annotated Corpora 20/48 Representing circular references ST-ROOT ST-SEG Winston was dreaming of his mother SEG annotation
21
LREC 2004 – Workshop on Richly Annotated Corpora 21/48 Representing circular references ST-ROOT Winston was dreaming of his mother ST-VP VP annotation
22
LREC 2004 – Workshop on Richly Annotated Corpora 22/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP Winston was dreaming of his mother SEG refers into VP
23
LREC 2004 – Workshop on Richly Annotated Corpora 23/48 Representing circular references ST-ROOTST-VPST-SEG ST-VP-TO-SEG Winston was dreaming of his mother VP refers into SEG
24
LREC 2004 – Workshop on Richly Annotated Corpora 24/48 Representing circular references Winston was dreaming of his mother Keeping all references ST-ROOTST-VP ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP
25
LREC 2004 – Workshop on Richly Annotated Corpora 25/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP ST-ROOT ST-VP ST-SEG ST-SEG-VP Delete unnecessary layers
26
LREC 2004 – Workshop on Richly Annotated Corpora 26/48 In what conditions can a document interact with a hierarchy? Compatibility of names Matching of semantic relations
27
LREC 2004 – Workshop on Richly Annotated Corpora 27/48 In what conditions can a document interact with a hierarchy? Compatibility of names = tag and attribute names –simple translation –expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue”
28
LREC 2004 – Workshop on Richly Annotated Corpora 28/48 In what conditions can a document interact with a hierarchy? Matching of semantic relations –only by explicit declaration –automatic detection (intersection of attribute value ranges) is prone to errors
29
LREC 2004 – Workshop on Richly Annotated Corpora 29/48 Operations on the lattice: classification Automatic classification of a document on the lattice proceeds in two steps: –the witness-collection is formed: the document is parsed tag declarations semantic-relations declaration in the header ref declarations –the witness-collection is “classified” down the hierarchy
30
LREC 2004 – Workshop on Richly Annotated Corpora 30/48 Operations on the lattice: classification The “programming by classification” paradigm of Mellish&Reiter (1993) –the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection
31
LREC 2004 – Workshop on Richly Annotated Corpora 31/48 Operations on the lattice: classification Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG
32
LREC 2004 – Workshop on Richly Annotated Corpora 32/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
33
LREC 2004 – Workshop on Richly Annotated Corpora 33/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
34
LREC 2004 – Workshop on Richly Annotated Corpora 34/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
35
LREC 2004 – Workshop on Richly Annotated Corpora 35/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline
36
LREC 2004 – Workshop on Richly Annotated Corpora 36/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline inferior borderline
37
LREC 2004 – Workshop on Richly Annotated Corpora 37/48 Automatic classification of a document on the lattice ST-NP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-SEG-NP-VP-1 ST-SEG-NP-VP Operations on the lattice: classification
38
LREC 2004 – Workshop on Richly Annotated Corpora 38/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
39
LREC 2004 – Workshop on Richly Annotated Corpora 39/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
40
LREC 2004 – Workshop on Richly Annotated Corpora 40/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification
41
LREC 2004 – Workshop on Richly Annotated Corpora 41/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline
42
LREC 2004 – Workshop on Richly Annotated Corpora 42/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-PP Operations on the lattice: classification
43
LREC 2004 – Workshop on Richly Annotated Corpora 43/48 ST-SEG-NP-VP ST-ROOT ST-TOK ST-NP ST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-SEG Operations on the lattice: merge ST-SEG
44
LREC 2004 – Workshop on Richly Annotated Corpora 44/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG
45
LREC 2004 – Workshop on Richly Annotated Corpora 45/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG
46
LREC 2004 – Workshop on Richly Annotated Corpora 46/48 Conclusions Propose a data structure facilitating: –Definition and exploitation of annotation schemes –Visualization of the hierarchy –Representation of circular references –Concurrent annotations –Automatic classification –Operations initialize-hierarchy classify merge extract System developed in Java, freely available on request
47
LREC 2004 – Workshop on Richly Annotated Corpora 47/48 Acknowledgements The research presented in this paper has been partly supported by the EC IST- 2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research
48
LREC 2004 – Workshop on Richly Annotated Corpora 48/48 Thank you…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.