Linking Etymological Database: A case study in Germanic Christian Chiarcos, Maria Sukhareva Goethe University Frankfurt am Main LDL – 2014, LREC Reykjavik,

Slides:



Advertisements
Similar presentations
Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
International Conference “Corpus linguistics – 2013” St. Petersburg, June 25–27, 2013 Roland Mittmann, M.A. Institute of Empirical Linguistics.
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY Matthew Williams
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
C SC 620 Advanced Topics in Natural Language Processing Lecture 22 4/15.
Towards an NLP `module’ The role of an utterance-level interface.
Brian A. Carlsen Apelon, Inc. Tools For Classification Integration Networked Knowledge Organization Systems/Services Workshop June 28, 2001.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Bilingual Lexical Acquisition From Comparable Corpora Andrea Mulloni.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved Digital Enterprise Research Institute Ontologies & Natural Language.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Processing Expectation Maximization.
Ontology Lexicalisation In collaboration with John McCrae, Philipp Cimiano (CITEC, Univ. of Bielefeld) Elena Montiel-Ponsado (Universidad Politecnica Madrid)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.
E-Meld Workshop on Digitization of lexical Information 3-5 August 2002, EMU, Ypsilanti Working Group on Lexicon Macrostructures Chairman’s Report Dafydd.
June 20, 2006E-MELD 2006, MSU1 Toward Implementation of Best Practice: Anthony Aristar, Wayne State University Other E-MELD Outcomes.
1 How to Compute the Meaning of Natural Language Utterances Patrick Hanks, Research Institute of Information and Language Processing, University of Wolverhampton.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
© Copyright 2008 STI INNSBRUCK NLP Interchange Format José M. García.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical.
1 Ontology-based Semantic Annotatoin of Process Template for Reuse Yun Lin, Darijus Strasunskas Depart. Of Computer and Information Science Norwegian Univ.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY WITHIN THE (SEMANTIC) WEB Matthew Williams
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
Computational Linguistics. The Subject Computational Linguistics is a branch of linguistics that concerns with the statistical and rule-based natural.
ISO-PWI Lexical ontology some loose remarks Thierry Declerck, DFKI GmbH.
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
TMF - Terminological Markup Framework Laurent Romary Laboratoire LORIA (CNRS, INRIA, Universités de Nancy) ISO meeting London, 14 August 2000.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Slide 1 SDTSSDTS FGDC CWG SDTS Revision Project ANSI INCITS L1 Project to Update SDTS FGDC CWG September 2, 2003.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
A Semantic Web Approach for the Third Provenance Challenge Tetherless World Rensselaer Polytechnic Institute James Michaelis, Li Ding,
+ Karin Becker Instituto de Informática - Federal University of Rio Grande do Sul, Brazil Shiva Jahangiri, Craig A. Knoblock Information Sciences Institute,
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Converting an Existing Taxonomic Data Resource to Employ an Ontology and LSIDS Jessie Kennedy Rob Gales, Robert Kukla.
Knowledge Support for Modeling and Simulation Michal Ševčenko Czech Technical University in Prague.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Extended Metadata Registries and Semantics (Part 2: Implementation) Karlo Berket Ecoterm IV Environmental Terminology Workshop April 18, 2007 Diplomatic.
© University of Manchester Creative Commons Attribution-NonCommercial 3.0 unported 3.0 license Quality Assurance, Ontology Engineering, and Semantic Interoperability.
BBY 464 Semantic Information Management (Spring 2016) Semantic Query Languages Yaşar Tonta & Orçun Madran [yasartonta, Hacettepe.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
ECLI and Beyond: Improving online access to court decisions
LOD reference architecture
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Linking Etymological Database: A case study in Germanic Christian Chiarcos, Maria Sukhareva Goethe University Frankfurt am Main LDL – 2014, LREC Reykjavik, Iceland 27th May 2014

Overview 1.Background 2.Linked Etymological Dictionaries 3.Enriching of Linked Etymological Dictionaries 4.Application 5.Conclusion

Background

ACoLi Lab TITUS DDD Referenzkorpus Althochdeutsch Background 1.Empirical Linguistics Thesaurus of Indo-European Text and Language Materials (TITUS) 2.ACoLi Lab (Applied Computational Linguistics) 3.LOEWE Cluster “Digital Humanities” 4.DFG-funded Old German Reference Corpus (DDD) Processing of Old Germanic Languages at Goethe University Frankfurt, in collaboration between:

Linked Etymological Data

Linkability: representation of relations within and beyond lexicons Interoperability: (meta)data representation through community-maintained vocabularies (lexvo, Glottolog, OLiA, lemon) Inference: filling the logical gaps of the original XML representation – Symmetric closure of cross-references Conversion of etymological dictionaries to RDF

Linked Etymological Data lemonet:translates a relation between lemon:LexicalEntrys lemonet:etym links between languages, transitive and symmetric. Subproperty of lemon:lexicalVariant all language identifiers were mapped from the original abbreviations and assigned ISO codes wherever possible.

Linked Etymological Data Original XML (lemma) RDF Triples Symmetric closure of etymological relations generated by SPARQL pattern Links to external resources

Enriching Etymological Dictionaries

(parentheses indicate marginal fragments with less than 50,000 tokens) Germanic parallel Bible corpus

Enriching Etymological Dictionaries

Application

Thematical Alignment of Bible paraphrases – E.g., cross references within the Bible and between the Bible and gospel harmonies an interlinked index of thematically similar sections in the gospels and OS/OHG gospel harmonies – OS Heliand and OHG Tatian section level alignment (Sievers, 1872) has been digitized – 4560 inter-text groups based on the Eusebian canon Basis for a more fine-grained level of alignment

Application Character-based similarity measures: – GEOMETRY: δ = difference between the relative positions of w OS and w OHG – IDENTITY: δ(w OS ;w OHG ) = 1 iff w OHG = w OS (0 otherwise); – ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability (Neubig et al., 2012) – NORMALIZATION: norm(w OS ;w OHG ) = δ(w’ OS ;w OHG ), with w’ OS being the OHG ‘normalization’ (Bollmann et al., 2011) – COOCCURRENCES: δ(w OS ;w OHG ) = P(w OS |w OHG) P(w OHG |w OS ) similarity metrics δ(w OS ;w OHG ) for every OS word w OS and its potential OHG cognate w OHG Lexicon-based similarity measures: δ lex (w OS ;w OHG ) = 1 iff w OHG 2 W (0 otherwise) where W is a set of possible OHG translations for w OS suggested by a lexicon, i.e., either:  ETYM: etymological link in (the symmetric closure of the etymological dictionaries,  ETYM-INDIRECT: shared German gloss in the etymological dictionaries,  TRANSLATIONAL DIRECT: link in the translational dictionaries,  TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

Application Character-based similarity measures: – GEOMETRY: δ = difference between the relative positions of w OS and w OHG – IDENTITY: δ(w OS ;w OHG ) = 1 iff w OHG = w OS (0 otherwise); – ORTHOGRAPHY: relative Levenshtein distance & statistical character replacement probability (Neubig et al., 2012) – NORMALIZATION: norm(w OS ;w OHG ) = δ(w’ OS ;w OHG ), with w’ OS being the OHG ‘normalization’ (Bollmann et al., 2011) – COOCCURRENCES: δ(w OS ;w OHG ) = P(w OS |w OHG) P(w OHG |w OS ) similarity metrics δ(w OS ;w OHG ) for every OS word w OS and its potential OHG cognate w OHG Lexicon-based similarity measures: δ lex (w OS ;w OHG ) = 1 iff w OHG 2 W (0 otherwise) where W is a set of possible OHG translations for w OS suggested by a lexicon, i.e., either:  ETYM: etymological link in (the symmetric closure of the etymological dictionaries,  ETYM-INDIRECT: shared German gloss in the etymological dictionaries,  TRANSLATIONAL DIRECT: link in the translational dictionaries,  TRANSLATIONAL INDIRECT: indirectly linked in the translational dictionaries through a third language.

Conclusion & Discussion

Conclusion 1.Application of Linked Data Paradigm to modeling of etymological dictionaries 2.Adopting of Lemon core model 3.Representation of Köbler’s dictionary in a machine-readable format 4.Enriching etymological dictionaries by automatically obtained translation pairs 5.Initial experiment on usage of dictionaries for quasi-parallel alignment

lemon & etymology: A square peg for a round hole ? lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L!

lemon & etymology: A square peg for a round hole ? lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. … but many of these resources are created by (or for) linguists rather than ontologists. The original motivation for lemon was to lexicalize ontologies. Quite a different problem from the inter- operability issues that linguists are trying to solve by using it. L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L! L!L!L!L!

lemon & etymology: A square peg for a round hole ? lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. But obviously, our usage of lemon is slightly abusive. 1.Etymological and translational links between WordForms ? 2.No external ontology to ground senses ? 3.No word senses at all ? But that is symptomatic for linguistic resources in a strict sense 4. Similar problems observed by Cysouw & Moran on multilingual dictionaries for South American indigeneous languages.

lemon & etymology: A square peg for a round hole ? lemon gained a lot of popularity as a shared vocabulary for lexical resources in the LLOD. But obviously, our usage of lemon is slightly abusive. 1.Etymological and translational links between word forms ? 2.No external ontology to ground senses ? 3.No word senses at all ? But that is symptomatic for linguistic resources in a strict sense What can we do about this state of affairs ? Would there have been alternative ways to model our data ? Shall we extend/abandon/replace/adjust lemon?

Takk fyrir!