Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza.

Similar presentations


Presentation on theme: "1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza."— Presentation transcript:

1 1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu dcristea@infoiasi.ro cris@infoiasi.ro “ Al. I. Cuza ” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science

2 LREC 2004 – Workshop on Richly Annotated Corpora 2/48 XML in LR annotation A de facto framework to support language annotation Used to: –record experts views on linguistic phenomena on corpora –store intermediate results in pipe-line NLP applications –post NLP results BUT: –annotation schemes: a chaos and not reusable –many annotations do share parts in common –not all layers are useful for the task at hand

3 LREC 2004 – Workshop on Richly Annotated Corpora 3/48 Presentation Motivation for a structural view on annotation schemes Proposal for a hierarchical representation –circular references –classification within the hierarchy –operations within the hierarchy Conclusions

4 LREC 2004 – Workshop on Richly Annotated Corpora 4/48 An annotation session a source XML annotated document a database image of the annotation or both DTD file Annotation session

5 LREC 2004 – Workshop on Richly Annotated Corpora 5/48 A sequence of annotation sessions DTD1 DTD2 Annotation session

6 LREC 2004 – Workshop on Richly Annotated Corpora 6/48 DTD1 DTD2 Mixing human with automatic annotation Manual annotation Automatic annotation

7 LREC 2004 – Workshop on Richly Annotated Corpora 7/48 Multiple parentage of a scheme +

8 LREC 2004 – Workshop on Richly Annotated Corpora 8/48 Multiple parentage

9 LREC 2004 – Workshop on Richly Annotated Corpora 9/48 Multiple parentage

10 LREC 2004 – Workshop on Richly Annotated Corpora 10/48 Multiple parentage

11 LREC 2004 – Workshop on Richly Annotated Corpora 11/48 Multiple parentage

12 LREC 2004 – Workshop on Richly Annotated Corpora 12/48 Multiple parentage

13 LREC 2004 – Workshop on Richly Annotated Corpora 13/48 Multiple parentage

14 LREC 2004 – Workshop on Richly Annotated Corpora 14/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

15 LREC 2004 – Workshop on Richly Annotated Corpora 15/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

16 LREC 2004 – Workshop on Richly Annotated Corpora 16/48 Definition of a scheme … …

17 LREC 2004 – Workshop on Richly Annotated Corpora 17/48 The subsumption relation A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: –any tag-name of A is also in B; –any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; –any semantic relation which holds in A also holds in B; –either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. A B

18 LREC 2004 – Workshop on Richly Annotated Corpora 18/48 Example Winston was dreaming of his mother. He must, he thought,

19 LREC 2004 – Workshop on Richly Annotated Corpora 19/48 How can circular references be notated? Winston was dreaming of his mother

20 LREC 2004 – Workshop on Richly Annotated Corpora 20/48 Representing circular references ST-ROOT ST-SEG Winston was dreaming of his mother SEG annotation

21 LREC 2004 – Workshop on Richly Annotated Corpora 21/48 Representing circular references ST-ROOT Winston was dreaming of his mother ST-VP VP annotation

22 LREC 2004 – Workshop on Richly Annotated Corpora 22/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP Winston was dreaming of his mother SEG refers into VP

23 LREC 2004 – Workshop on Richly Annotated Corpora 23/48 Representing circular references ST-ROOTST-VPST-SEG ST-VP-TO-SEG Winston was dreaming of his mother VP refers into SEG

24 LREC 2004 – Workshop on Richly Annotated Corpora 24/48 Representing circular references Winston was dreaming of his mother Keeping all references ST-ROOTST-VP ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP

25 LREC 2004 – Workshop on Richly Annotated Corpora 25/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP ST-ROOT ST-VP ST-SEG ST-SEG-VP Delete unnecessary layers

26 LREC 2004 – Workshop on Richly Annotated Corpora 26/48 In what conditions can a document interact with a hierarchy? Compatibility of names Matching of semantic relations

27 LREC 2004 – Workshop on Richly Annotated Corpora 27/48 In what conditions can a document interact with a hierarchy? Compatibility of names = tag and attribute names –simple translation –expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue”

28 LREC 2004 – Workshop on Richly Annotated Corpora 28/48 In what conditions can a document interact with a hierarchy? Matching of semantic relations –only by explicit declaration –automatic detection (intersection of attribute value ranges) is prone to errors

29 LREC 2004 – Workshop on Richly Annotated Corpora 29/48 Operations on the lattice: classification Automatic classification of a document on the lattice proceeds in two steps: –the witness-collection is formed: the document is parsed  tag declarations semantic-relations declaration in the header  ref declarations –the witness-collection is “classified” down the hierarchy

30 LREC 2004 – Workshop on Richly Annotated Corpora 30/48 Operations on the lattice: classification The “programming by classification” paradigm of Mellish&Reiter (1993) –the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection

31 LREC 2004 – Workshop on Richly Annotated Corpora 31/48 Operations on the lattice: classification Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

32 LREC 2004 – Workshop on Richly Annotated Corpora 32/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

33 LREC 2004 – Workshop on Richly Annotated Corpora 33/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

34 LREC 2004 – Workshop on Richly Annotated Corpora 34/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

35 LREC 2004 – Workshop on Richly Annotated Corpora 35/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline

36 LREC 2004 – Workshop on Richly Annotated Corpora 36/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline inferior borderline

37 LREC 2004 – Workshop on Richly Annotated Corpora 37/48 Automatic classification of a document on the lattice ST-NP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-SEG-NP-VP-1 ST-SEG-NP-VP Operations on the lattice: classification

38 LREC 2004 – Workshop on Richly Annotated Corpora 38/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

39 LREC 2004 – Workshop on Richly Annotated Corpora 39/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

40 LREC 2004 – Workshop on Richly Annotated Corpora 40/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

41 LREC 2004 – Workshop on Richly Annotated Corpora 41/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline

42 LREC 2004 – Workshop on Richly Annotated Corpora 42/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-PP Operations on the lattice: classification

43 LREC 2004 – Workshop on Richly Annotated Corpora 43/48 ST-SEG-NP-VP ST-ROOT ST-TOK ST-NP ST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-SEG Operations on the lattice: merge ST-SEG

44 LREC 2004 – Workshop on Richly Annotated Corpora 44/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

45 LREC 2004 – Workshop on Richly Annotated Corpora 45/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

46 LREC 2004 – Workshop on Richly Annotated Corpora 46/48 Conclusions Propose a data structure facilitating: –Definition and exploitation of annotation schemes –Visualization of the hierarchy –Representation of circular references –Concurrent annotations –Automatic classification –Operations initialize-hierarchy classify merge extract System developed in Java, freely available on request

47 LREC 2004 – Workshop on Richly Annotated Corpora 47/48 Acknowledgements The research presented in this paper has been partly supported by the EC IST- 2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research

48 LREC 2004 – Workshop on Richly Annotated Corpora 48/48 Thank you…


Download ppt "1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza."

Similar presentations


Ads by Google