CMD and TEI CMDI interoperability workshop Utrecht Matej Ďurčo, ICLTT, Vienna
TEI at ICLTT AAC – Austrian Academy Corpus – diachronic corpus ~ 500 mil. tokens – being converted into TEI C4 – distributed corpus of german of 20 th century – Basel, Berlin, Bozen, Wien – harmonized format (TEI/teiHeader) Dict-Gate – TEI encoded multilingual lexicons (persian, arabic, german, english) – however described with LexicalResourceProfile Abacus – Austrian Baroque Corpus – 3 (5) historical texts encoded in TEI – elaborate teiHeader 2
TEI (and friends?) in CMD 3 ProjektAuthor, YearProfileComp/Elem/Datcatsinstances Deutsches Text Archiv ? teiHeader #clarin.eu:cr1:p_ (NOT in CompReg!) 56/82/10857 ICLTTDurco, 2010 teiHeader #clarin.eu:cr1:p_ /35/13 (7 dublincore, 6 isocat) 467 Leipzig Corpora Eckart, 2012 TEIDocumentDescription #clarin.eu:cr1:p_ /17/17 (isocat) ? NederlabZhang 2013 ? DBNL_Tekst #clarin.eu:cr1:p_ DBNL_Tekst_Onzelfstandig #clarin.eu:cr1:p_ (private) 20/38/15 20/47/21? overview of currently existing TEIish CMD-profiles
teiHeader (ICLTT) 4 size = reuse in other profiles
teiHeader (DTA) 5 size = count elements in instance data
datcats in teiHeader(DTA) 6
TEI and ISOcat a special DCS: TEi Header (2.1.0) – Windhouwer, 2012 – a datcat for every element of the teiHeader (135 datcats) – based on an ODD-file (ODD2DCIF.xsl and DCIF2ODD.xsl available) – owed to CLARIN-NL projects using TEI header a enriched schema was generated = annotated with these new data categories ( dcr:datcat -attribute) put in SCHEMAcat: define relations between TEI and other data categories in RELcat (the relation registry) 7
Next Step(s) ? create (or adapt existing) teiHeader profile – as a union of the existing profiles ? – based on the enriched schema – i.e. linking to the new TEI data categories – define a relation set in RELcat between TEI and ISOcat (and dublincore) data categories 8
profile: data (LINDAT) dublincore + metashare 9
profile: data (LINDAT) resourceInforesourceInfo-component 10
dublincore I 2 profiles with dc-terms (55 datacategories) 2 profiles with dc-elements (called „dc-terms“) as of
dublincore II currently ( ) 4 DCMI-terms profiles 4 DCMI-terms profiles 12
dublincore III 13 (almost) all datcats shared by all
dublincore IV 1 profile has extra component: DANS-DC-metadata example: language 14