Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relations between Data Categories

Similar presentations


Presentation on theme: "Relations between Data Categories"— Presentation transcript:

1 Relations between Data Categories
Jan Odijk, CLARIN-NL/UiL-OTS January 8, 2010 MPI, Nijmegen

2 Relations between Data Categories
Data Categories & ISOCAT Relations for Search Data Category Sets Relations for Mapping DCs Structured Elements Mapping & Structure Conclusions

3 Defining the Topic: Data Categories
Data Categories (DC) are defined in a Registry (Some) Data Categories are made part of a standard (Contribution to) Semantic Interoperability by Using the standardized DCs, or Mapping one’s own DC to a standardized DC

4 Interoperability by Mapping DCs
Example (simple, naïve): LR1: uses DC myDC LR2: uses DC yourDC Standard has ISODC myDC  ISODC yourDC  ISODC Therefore: myDC  yourDC  hisDC ISOCAT standardized DCs serve as a pivot (cf. interlingua) By IL 2n mappings needed instead of n*(n-1) Gain when n>3

5 Relations for Searching (Odijk 2009)
Find closely related DCs in ISOCAT Grammatical Relation used in definition of transitive Grammatical Relation is not a DC in ISOCAT Grammatical Function is a DC in ISOCAT (DC-1296) syntacticFunction is a DC in ISOCAT (DC-1507) “Syntactic function” is a DC in ISOCAT (DC-2244) Dependency is a DC in ISOCAT (DC-2323) Problem: How do I find alternative names for the same concept in ISOCAT? How do I find closely related DCs in ISOCAT? It currently requires a linear manual search… , even across different profiles!! Grouping closely related concepts together would help  e.g. in (multiple) trees, implemented by relations between DCs

6 Data Category Sets For each coherent data category set a DC must exist to identify it. E.g. in the value domain of the DC morphosyntacticTagSet STTS Penn tagSet, CGN tagSet ISOCAT must represent/group them as a set Data Category Selections (DCS) appear suited for this They should be reusable by anyone But no PIDs are provided for DCS

7 Implicit Semantics: ‘Mime’-like approach
Pragmatic option Resource/Tool 1 specifies: tagSet=STTS Resource/Tool 2 specifies tagSet=STTS Match is found  interoperability Semantics of STTS is left implicit identity of semantics suffices Occurs often, is simple and must be supported

8 Semantics Explicit: Mapping:where?
Option 1: Directly in an XML file Schema “PID can be embedded in the schemata of linguistic resources” slide 8 Will that allow complex mappings as given above? Option 2: in separate files Needed for commonly used coherent subsets (e.g. Penn Treebank Tagset, STTS, CGN Tagset, etc.) To avoid duplication, inconsistency, etc. Is that possible now?

9 Relations for Mapping DCs:
Option 1: myDC, yourDC outside ISOCAT ISODC inside ISOCAT All ISOCAT DCs are part of the DC IL Option 2: myDC, yourDC inside ISOCAT ISODC in ISOCAT Only a subset of ISOCAT DCs are part of the DC IL

10 Relations for Mapping DCs:
Option 1 is most natural but Option 2 is desirable for members from de facto standard data categories Mapping between ISOCAT DCs can be implemented by relations between ISOCAT DCs

11 Structured Elements (1)
ISOCAT has no provisions for this except for Strings (sequences of Characters) REs over strings But many are actually in use: Attribute Value Pairs (AV-Pairs) Attribute is a DC Value must be of attribute DC type and from attribute DC Conceptual Domain Records/AV matrices Which AV-Pairs are possible/mandatory for noun, verb etc

12 Structured Elements (2)
Lists e.g. HPSG SUBCAT attribute: [NPnom, NPacc] Trees/Tree Models E.g. DUELME database (Dutch Multiword Expressions) SAID (LDC2003T10 ) Treebanks Structured categories as in Categorial Grammar np\s/np, np/np, etc.

13 Structured Elements (3)
Sets E.g. set of verbpatterns in Rosetta Subcat patterns Alpino: {intransitive, transitive, pc_pp(aan)} (breien ‘to knit’) Parameterized values E.g. Alpino: pc_pp(aan) i.e. prepositional complement of syntactic category PP with aan as head

14 Mapping & Structure (1) Mapping of DCs is actually mapping of DC combinations  often requires structure Structures are also needed if there is to be a pivot Examples Combination: Atomic DC  A-V pair combination: ISOCATRosetta Transitive  thetavp=vp120 & synvps=[synNP] & caseAssigner=True

15 Mapping & Structure (2) Penn TreebankISOCAT JJR 
partOfSpeech=adjective & degree=comparative STTS Tagset=>ISOCAT VVIMP  partOfSpeech=verb & main verb & mood=imperative

16 Mapping & Structure (3) List: Atomic DC  List: ISOCAT, AlpinoHPSG
Transitive  [NPnom’ NPacc] Combinationparameterized value Rosetta  Alpino synPREPNP in synvps & prepkey1=aan  pc_pp(aan) (in fact : subcats U= {pc_pp(aan)} )

17 Mapping & Structure (4) Union: German Adjectives
Morphosyntactic features: Gender (3), Case (4), Number (2), Declensiontype (3) In theory 3*4*2*3=72 distinctions Gender is neutralized in plural So: 3*4*1*3 + 4*1*3=36+12=48 distinctions Only 5 forms are used: eForm, erForm, esForm, emForm, enForm are the corresponding tags

18 Mapping & Structure (4) Map enForm to a union of a combination of morphosyntactic features: enForm  m sg acc str V m sg gen str V n sg gen str V dat pl str V dat sg mixed V gen sg mixed V pl mixed V m sg acc mixed V dat sg weak V gen sg weak V pl weak V m sg acc weak (using underspecification for gender in some cases)

19 Mapping & Structure (5) Conclusion:
One can often not map DCs in isolation But must map whole entry (record) to a new entry (set of entries) Entry= a combination of Data Categories lexicon entry or Annoted text entry Or even complexer: multiple entries  multiple sets of entries

20 Mapping & Structure (6) Questions:
Additional means are needed to provide structures as pivot Does LMF provide part of this for lexicons? Does anything exist for ‘entries’ in text corpora? How can we specify relations between combinations of DCs

21 Conclusions Relations between DCs are needed for grouping synonymous/closely related DCs ( Easier Search) De facto standard DC sets must be included in ISOCAT Cf. Erhard Hinrichs 2009 A subset of ISOCAT DCs to be marked as member of IL Data Category Selections need a PID Mapping requires relations between DC combinations Mapping via IL requires Standardized lexicon entry model Standardized annotated text entry model


Download ppt "Relations between Data Categories"

Similar presentations


Ads by Google