Relations between Data Categories

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
ISOcat introduction 19 June 20121CLARIN-NL ISOcat workshop.
Data Category specifications 19 June 20121CLARIN-NL 2012 ISOcat tutorial.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Metadata Component Framework Possible Standardization Work.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Features & Unification Ling 571 Deep Processing Techniques for NLP January 26, 2011.
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור עשר Chart Parsing (cont) Features.
Sag et al., Chapter 4 Complex Feature Values 10/7/04 Michael Mulyar.
Features and Unification
ISOcat: known issues 10 May /20111CLARIN-NL ISOcat workshop.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
4/20/2017.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 14, Feb 27, 2007.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 3 The Relational Model Transparencies Last Updated: Pebruari 2011 By M. Arief
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Document Type Definition.
Why XML ? Problems with HTML HTML design - HTML is intended for presentation of information as Web pages. - HTML contains a fixed set of markup tags. This.
CLARIN-NL Call 3 ISOcat follow-up 10/10/20121CLARIN-NL ISOcat Call 3 follow-up.
DC specifications or “Do’s and don’ts” when creating a DC.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
ISOcat: known issues 20 June 20131CLARIN-NL ISOcat workshop.
IV. SYNTAX. 1.1 What is syntax? Syntax is the study of how sentences are structured, or in other words, it tries to state what words can be combined with.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
CLARIN-NL Call 4 ISOcat follow-up 2/10/20131CLARIN-NL Call 4 ISOcat follow-up.
1 The Relational Database Model. 2 Learning Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical.
ISOcat introduction 20 June 20131CLARIN-NL ISOcat workshop.
ISOcat introduction 20 March 20121CLARIN-NL ISOcat workshop.
7. Parsing in functional unification grammar Han gi-deuc.
An OO schema language for XML SOX W3C Note 30 July 1999.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
11 CMDI/ISOcat And Semantic Operability Ineke Schuurman ISOcat content coördinator CLARIN-NL Menzo Windhouwer ISOcat system administrator Utrecht
Head-driven Phrase Structure Grammar (HPSG)
Linguistic Essentials
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Culture , Language and Communication
ISOcat: How to create a DC (including “do’s and don’ts”) 19 June 20121CLARIN-NL ISOcat tutorial.
Rules, Movement, Ambiguity
CSA2050 Introduction to Computational Linguistics Parsing I.
INFSY 547: WEB-Based Technologies Gayle J Yaverbaum, PhD Professor of Information Systems Penn State Harrisburg.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
Section 11.3 Features structures in the Grammar ─ Jin Wang.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
CLARIN Concept Registry: the new semantic registry Ineke Schuurman, Menzo Windhouwer, Oddrun Ohren, Daniel Zeman
Supertagging CMSC Natural Language Processing January 31, 2006.
Menzo Windhouwer.  The Typological Database System (TDS) provides integrated access to multiple, independently created typological databases.  Users.
1 ISOCAT Proposed solutions for Problems encountered in DUELME-LMF Jan Odijk Nijmegen 21 Sep 2010.
1 CLARIN? ISOCAT! Ineke Schuurman Hilversum,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
ISO TC 37/CLARIN DISCUSSION UTRECHT, DECEMBER 9/ Thinning Down a Bloated Cat SUE ELLEN WRIGHT DECEMBER 2013.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
ISOcat: How to create a DC (including “do’s and don’ts”) 20 June 20131CLARIN-NL ISOcat tutorial.
Chapter 11: Parsing with Unification Grammars Heshaam Faili University of Tehran.
Natural Language Processing Vasile Rus
ISOcat introduction 10 May /20111CLARIN-NL ISOcat workshop.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Marc Kemps-Snijders Menzo Windhouwer Sue Ellen Wright
Logical Database Design and the Rational Model
Statistical NLP: Lecture 3
Chapter 7: Entity-Relationship Model
Introduction to Computational Linguistics
ISOCAT ISOCAT Problems
Linguistic Essentials
Structure of a Lexicon Debasri Chakrabarti 13-May-19.
Presentation transcript:

Relations between Data Categories Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen

Relations between Data Categories Data Categories & ISOCAT Relations for Search Data Category Sets Relations for Mapping DCs Structured Elements Mapping & Structure Conclusions

Defining the Topic: Data Categories Data Categories (DC) are defined in a Registry (Some) Data Categories are made part of a standard (Contribution to) Semantic Interoperability by Using the standardized DCs, or Mapping one’s own DC to a standardized DC

Interoperability by Mapping DCs Example (simple, naïve): LR1: uses DC myDC LR2: uses DC yourDC Standard has ISODC myDC  ISODC yourDC  ISODC Therefore: myDC  yourDC  hisDC ISOCAT standardized DCs serve as a pivot (cf. interlingua) By IL 2n mappings needed instead of n*(n-1) Gain when n>3

Relations for Searching (Odijk 2009) Find closely related DCs in ISOCAT Grammatical Relation used in definition of transitive Grammatical Relation is not a DC in ISOCAT Grammatical Function is a DC in ISOCAT (DC-1296) syntacticFunction is a DC in ISOCAT (DC-1507) “Syntactic function” is a DC in ISOCAT (DC-2244) Dependency is a DC in ISOCAT (DC-2323) Problem: How do I find alternative names for the same concept in ISOCAT? How do I find closely related DCs in ISOCAT? It currently requires a linear manual search… , even across different profiles!! Grouping closely related concepts together would help  e.g. in (multiple) trees, implemented by relations between DCs

Data Category Sets For each coherent data category set a DC must exist to identify it. E.g. in the value domain of the DC morphosyntacticTagSet STTS Penn tagSet, CGN tagSet ISOCAT must represent/group them as a set Data Category Selections (DCS) appear suited for this They should be reusable by anyone But no PIDs are provided for DCS

Implicit Semantics: ‘Mime’-like approach Pragmatic option Resource/Tool 1 specifies: tagSet=STTS Resource/Tool 2 specifies tagSet=STTS Match is found  interoperability Semantics of STTS is left implicit identity of semantics suffices Occurs often, is simple and must be supported

Semantics Explicit: Mapping:where? Option 1: Directly in an XML file Schema “PID can be embedded in the schemata of linguistic resources” http://www.csc.fi/english/pages/neeri09/workshop/materials/windhouwer.pdf, slide 8 Will that allow complex mappings as given above? Option 2: in separate files Needed for commonly used coherent subsets (e.g. Penn Treebank Tagset, STTS, CGN Tagset, etc.) To avoid duplication, inconsistency, etc. Is that possible now?

Relations for Mapping DCs: Option 1: myDC, yourDC outside ISOCAT ISODC inside ISOCAT All ISOCAT DCs are part of the DC IL Option 2: myDC, yourDC inside ISOCAT ISODC in ISOCAT Only a subset of ISOCAT DCs are part of the DC IL

Relations for Mapping DCs: Option 1 is most natural but Option 2 is desirable for members from de facto standard data categories Mapping between ISOCAT DCs can be implemented by relations between ISOCAT DCs

Structured Elements (1) ISOCAT has no provisions for this except for Strings (sequences of Characters) REs over strings But many are actually in use: Attribute Value Pairs (AV-Pairs) Attribute is a DC Value must be of attribute DC type and from attribute DC Conceptual Domain Records/AV matrices Which AV-Pairs are possible/mandatory for noun, verb etc

Structured Elements (2) Lists e.g. HPSG SUBCAT attribute: [NPnom, NPacc] Trees/Tree Models E.g. DUELME database (Dutch Multiword Expressions) SAID (LDC2003T10 ) Treebanks Structured categories as in Categorial Grammar np\s/np, np/np, etc.

Structured Elements (3) Sets E.g. set of verbpatterns in Rosetta Subcat patterns Alpino: {intransitive, transitive, pc_pp(aan)} (breien ‘to knit’) Parameterized values E.g. Alpino: pc_pp(aan) i.e. prepositional complement of syntactic category PP with aan as head

Mapping & Structure (1) Mapping of DCs is actually mapping of DC combinations  often requires structure Structures are also needed if there is to be a pivot Examples Combination: Atomic DC  A-V pair combination: ISOCATRosetta Transitive  thetavp=vp120 & synvps=[synNP] & caseAssigner=True

Mapping & Structure (2) Penn TreebankISOCAT JJR  partOfSpeech=adjective & degree=comparative STTS Tagset=>ISOCAT VVIMP  partOfSpeech=verb & main verb & mood=imperative

Mapping & Structure (3) List: Atomic DC  List: ISOCAT, AlpinoHPSG Transitive  [NPnom’ NPacc] Combinationparameterized value Rosetta  Alpino synPREPNP in synvps & prepkey1=aan  pc_pp(aan) (in fact : subcats U= {pc_pp(aan)} )

Mapping & Structure (4) Union: German Adjectives Morphosyntactic features: Gender (3), Case (4), Number (2), Declensiontype (3) In theory 3*4*2*3=72 distinctions Gender is neutralized in plural So: 3*4*1*3 + 4*1*3=36+12=48 distinctions Only 5 forms are used: eForm, erForm, esForm, emForm, enForm are the corresponding tags

Mapping & Structure (4) Map enForm to a union of a combination of morphosyntactic features: enForm  m sg acc str V m sg gen str V n sg gen str V dat pl str V dat sg mixed V gen sg mixed V pl mixed V m sg acc mixed V dat sg weak V gen sg weak V pl weak V m sg acc weak (using underspecification for gender in some cases)

Mapping & Structure (5) Conclusion: One can often not map DCs in isolation But must map whole entry (record) to a new entry (set of entries) Entry= a combination of Data Categories lexicon entry or Annoted text entry Or even complexer: multiple entries  multiple sets of entries

Mapping & Structure (6) Questions: Additional means are needed to provide structures as pivot Does LMF provide part of this for lexicons? Does anything exist for ‘entries’ in text corpora? How can we specify relations between combinations of DCs

Conclusions Relations between DCs are needed for grouping synonymous/closely related DCs ( Easier Search) De facto standard DC sets must be included in ISOCAT Cf. Erhard Hinrichs 2009 A subset of ISOCAT DCs to be marked as member of IL Data Category Selections need a PID Mapping requires relations between DC combinations Mapping via IL requires Standardized lexicon entry model Standardized annotated text entry model