1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza.

Slides:



Advertisements
Similar presentations
A Common Standard for Data and Metadata: The ESDS Qualidata XML Schema Libby Bishop ESDS Qualidata – UK Data Archive E-Research Workshop Melbourne 27 April.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
XML: Extensible Markup Language
The Relational Model and Relational Algebra Nothing is so practical as a good theory Kurt Lewin, 1945.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
1 Relational Data Mining Applied to Virtual Engineering of Product Designs Monika Žáková 1, Filip Železný 1, Javier A. Garcia-Sedano 2, Cyril Masia Tissot.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Data Structures & Java Generics Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 8 Slide 1 System modeling 2.
Galia Angelova Institute for Parallel Processing, Bulgarian Academy of Sciences Visualisation and Semantic Structuring of Content (some.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
CH 11 Multimedia IR: Models and Languages
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Use Case Modelling Visual Annotator for studying ICU Notes Bacchus Beale.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Using SQL Queries to Generate XML- Formatted Data Joline Morrison Mike Morrison Department of Computer Science University of Wisconsin-Eau Claire.
4/20/2017.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.
School of Computing and Management Sciences © Sheffield Hallam University To understand the Oracle XML notes you need to have an understanding of all these.
XP 1 CREATING AN XML DOCUMENT. XP 2 INTRODUCING XML XML stands for Extensible Markup Language. A markup language specifies the structure and content of.
Extensible Markup and Beyond
AUTOMATIC ANNOTATION OF GEO-INFORMATION IN PANORAMIC STREET VIEW BY IMAGE RETRIEVAL Ming Chen, Yueting Zhuang, Fei Wu College of Computer Science, Zhejiang.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
TEXT ENCODING INITIATIVE (TEI) Inf 384C Block II, Module C.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Semantic Learning Instructor: Professor Cercone Razieh Niazi.
Knowledge Modeling, use of information sources in the study of domains and inter-domain relationships - A Learning Paradigm by Sanjeev Thacker.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
Chapter 11 Artificial Intelligence Introduction to CS 1 st Semester, 2015 Sanghyun Park.
XSDL & Relax : 2 new schema languages for XML Rajasekar Krishnamurthy.
1 Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project ( Linguistic Modeling Laboratory, Bulgarian.
Formal Specification of Intrusion Signatures and Detection Rules By Jean-Philippe Pouzol and Mireille Ducassé 15 th IEEE Computer Security Foundations.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
Image Classification for Automatic Annotation
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.
The 1st Global Tech Mining Conference, Atlanta, USA Analyzing Technology Evolution of Graphene Sensor Based on Patent Documents Fang Shu 1, Hu Zhengyin.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Final Project Presentation Information Extraction Learning to Extract Signature and Reply Lines from Vitor R. Carvalho.
XML: Extensible Markup Language
ece 720 intelligent web: ontology and beyond
Natural Language Processing (NLP)
Exploring and Navigating: Tools for GermaNet
The XML Language.
RichAnnotator: Annotating rich (XML-like) documents
Part of the Multilingual Web-LT Program
Dr. Bhavani Thuraisingham The University of Texas at Dallas
Using Natural Language Processing to Aid Computer Vision
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

1 Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu “ Al. I. Cuza ” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science

LREC 2004 – Workshop on Richly Annotated Corpora 2/48 XML in LR annotation A de facto framework to support language annotation Used to: –record experts views on linguistic phenomena on corpora –store intermediate results in pipe-line NLP applications –post NLP results BUT: –annotation schemes: a chaos and not reusable –many annotations do share parts in common –not all layers are useful for the task at hand

LREC 2004 – Workshop on Richly Annotated Corpora 3/48 Presentation Motivation for a structural view on annotation schemes Proposal for a hierarchical representation –circular references –classification within the hierarchy –operations within the hierarchy Conclusions

LREC 2004 – Workshop on Richly Annotated Corpora 4/48 An annotation session a source XML annotated document a database image of the annotation or both DTD file Annotation session

LREC 2004 – Workshop on Richly Annotated Corpora 5/48 A sequence of annotation sessions DTD1 DTD2 Annotation session

LREC 2004 – Workshop on Richly Annotated Corpora 6/48 DTD1 DTD2 Mixing human with automatic annotation Manual annotation Automatic annotation

LREC 2004 – Workshop on Richly Annotated Corpora 7/48 Multiple parentage of a scheme +

LREC 2004 – Workshop on Richly Annotated Corpora 8/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 9/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 10/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 11/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 12/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 13/48 Multiple parentage

LREC 2004 – Workshop on Richly Annotated Corpora 14/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 15/48 The hierarchy – a DAG representation ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 16/48 Definition of a scheme … …

LREC 2004 – Workshop on Richly Annotated Corpora 17/48 The subsumption relation A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: –any tag-name of A is also in B; –any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; –any semantic relation which holds in A also holds in B; –either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. A B

LREC 2004 – Workshop on Richly Annotated Corpora 18/48 Example Winston was dreaming of his mother. He must, he thought,

LREC 2004 – Workshop on Richly Annotated Corpora 19/48 How can circular references be notated? Winston was dreaming of his mother

LREC 2004 – Workshop on Richly Annotated Corpora 20/48 Representing circular references ST-ROOT ST-SEG Winston was dreaming of his mother SEG annotation

LREC 2004 – Workshop on Richly Annotated Corpora 21/48 Representing circular references ST-ROOT Winston was dreaming of his mother ST-VP VP annotation

LREC 2004 – Workshop on Richly Annotated Corpora 22/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP Winston was dreaming of his mother SEG refers into VP

LREC 2004 – Workshop on Richly Annotated Corpora 23/48 Representing circular references ST-ROOTST-VPST-SEG ST-VP-TO-SEG Winston was dreaming of his mother VP refers into SEG

LREC 2004 – Workshop on Richly Annotated Corpora 24/48 Representing circular references Winston was dreaming of his mother Keeping all references ST-ROOTST-VP ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP

LREC 2004 – Workshop on Richly Annotated Corpora 25/48 Representing circular references ST-ROOTST-VPST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP ST-ROOT ST-VP ST-SEG ST-SEG-VP Delete unnecessary layers

LREC 2004 – Workshop on Richly Annotated Corpora 26/48 In what conditions can a document interact with a hierarchy? Compatibility of names Matching of semantic relations

LREC 2004 – Workshop on Richly Annotated Corpora 27/48 In what conditions can a document interact with a hierarchy? Compatibility of names = tag and attribute names –simple translation –expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue”

LREC 2004 – Workshop on Richly Annotated Corpora 28/48 In what conditions can a document interact with a hierarchy? Matching of semantic relations –only by explicit declaration –automatic detection (intersection of attribute value ranges) is prone to errors

LREC 2004 – Workshop on Richly Annotated Corpora 29/48 Operations on the lattice: classification Automatic classification of a document on the lattice proceeds in two steps: –the witness-collection is formed: the document is parsed  tag declarations semantic-relations declaration in the header  ref declarations –the witness-collection is “classified” down the hierarchy

LREC 2004 – Workshop on Richly Annotated Corpora 30/48 Operations on the lattice: classification The “programming by classification” paradigm of Mellish&Reiter (1993) –the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection

LREC 2004 – Workshop on Richly Annotated Corpora 31/48 Operations on the lattice: classification Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 32/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 33/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 34/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 35/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline

LREC 2004 – Workshop on Richly Annotated Corpora 36/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline inferior borderline

LREC 2004 – Workshop on Richly Annotated Corpora 37/48 Automatic classification of a document on the lattice ST-NP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-SEG-NP-VP-1 ST-SEG-NP-VP Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 38/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 39/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 40/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 41/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEG ST-PAR ST-POS ST-VP ST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG Operations on the lattice: classification superior borderline

LREC 2004 – Workshop on Richly Annotated Corpora 42/48 Automatic classification of a document on the lattice ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-PP Operations on the lattice: classification

LREC 2004 – Workshop on Richly Annotated Corpora 43/48 ST-SEG-NP-VP ST-ROOT ST-TOK ST-NP ST-PAR ST-POSST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG ST-NP-SEG Operations on the lattice: merge ST-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 44/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 45/48 Operations on the lattice: extract ST-NP ST-SEG-NP-VP ST-ROOT ST-TOK ST-SEGST-PAR ST-POS ST-VPST-COREF ST-PAR-SEG-NP-VP ST-COREF-IN-SEG

LREC 2004 – Workshop on Richly Annotated Corpora 46/48 Conclusions Propose a data structure facilitating: –Definition and exploitation of annotation schemes –Visualization of the hierarchy –Representation of circular references –Concurrent annotations –Automatic classification –Operations initialize-hierarchy classify merge extract System developed in Java, freely available on request

LREC 2004 – Workshop on Richly Annotated Corpora 47/48 Acknowledgements The research presented in this paper has been partly supported by the EC IST Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research

LREC 2004 – Workshop on Richly Annotated Corpora 48/48 Thank you…