Computational Lexicography Frank Van Eynde Centre for Computational Linguistics.

Slides:



Advertisements
Similar presentations
A centralized approach to language resources Piek Vossen S&T Forum on Multilingualism, Luxembourg, June 6th 2005.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Example queries for Federated search Jan Odijk CLARIN Federated Search Workshop Copenhagen, 24 Apr
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Information Provided in Adult- Child Discourse about the Meaning of Adjectives Roberta Corrigan University of Wisconsin- Milwaukee.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Semantic Frames: FrameNet. What is FrameNet? FrameNet is an ongoing project at the International Computer Science Institute located in Berkeley California.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
A STUDY ON THE KNOWLEDGE SOURCES OF TURKISH EFL LEARNERS IN LEXICAL INFERENCING İlknur İSTİFÇİ Anadolu University Eskişehir, TURKEY Eskişehir, TURKEY.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Deny A. Kwary Internal Structures of Dictionary Entries.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Memory Strategy – Using Mental Images
Claudia Marzi Institute for Computational Linguistics (ILC) National Research Council (CNR) - Italy.
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
EMELD Workshop on Digitizing Lexical Information Modeling Lexical Entries in Bilingual Dictionaries —Or— Exegeting the UML Model Mike Maxwell Linguistic.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Paper Review by Utsav Sinha August, 2015 Part of assignment in CS 671: Natural Language Processing, IIT Kanpur.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
INTRODUCTION: RESEARCH AREA 1. Chinese Semantics 2. Semantic difference related to syntax 3. Module Attribute Representation of Verbal Semantics (MARVS)
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
From Allesandro Lenci. Linguistic Ontologies Mikrokosmos (Nirenburg, Mahesh et al.) Generalized Upper Model (Bateman et al.)Generalized Upper Model WordNet.
SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Wordnet - A lexical database for the English Language.
Artificial Intelligence: Natural Language
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Learners' Dictionaries Oxford1948 Longman1978 Collins COBUILD1987 Macmillan2002 Macmillan2008 (bilingualized) Merriam-Webster2008 Jackson, Howard
Zdroje jazykových dat Word senses Sense tagged corpora.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Knowledge Structure Vijay Meena ( ) Gaurav Meena ( )
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
Building (on) a few dictionaries from Asia & the Pacific Alexandre François — CNRS–LACITO, Paris.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
VISUAL WORD RECOGNITION. What is Word Recognition? Features, letters & word interactions Interactive Activation Model Lexical and Sublexical Approach.
SENSEVAL: Evaluating WSD Systems
Linguistic knowledge for Speech recognition
ENGLISH MORPHOLOGY Week 1.
WordNet: A Lexical Database for English
Informatique et Phonétique
Semantics Going beyond syntax.
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Word phoneme SENTENCE PHRASE SUFFIX prefix PHRASE CLAUSE UTTERANCE PART OF SPEECH MICRO-LINGUISTICS Macro-linguistics Language dictionary LEXICON allophone.
Presentation transcript:

Computational Lexicography Frank Van Eynde Centre for Computational Linguistics

OUTLINE 1. The token/type distinction 2. Lexicographic practice 3. Computational lexica 4. Lexical databases 5. Lexical knowledge acquisition 6. The use of lexica in text-to-speech

1. Tokens vs. types (1) The girl gave the flowers to the athlete. - 3 tokens the : properties are context specific - 1 type : properties are generalizations over the various uses Heracleitos vs. Plato (2) The sooner they come, the better it is. vs. NL de, het vs. NL hoe

1. Tokens vs. types (3) I do not think that the dog of that girl is really that dangerous. vs. FR que vs. FR ce/cette vs. FR si (4) Je ne pense pas que le chien de cette fille est vraiment si dangereux.

1. Tokens vs. types The abstraction problem: given a word W, how many types do we have to distinguish? (5) It is not far from here. (6) We didn't go far. (7) He's living in the Far West. (8) Paris is far more expensive than Dublin. vs. NL ver vs. NL veel

1. Tokens vs. types (9) De bal van de finale wordt verkocht op het bal van de FIFA. IT palla IT ballo (10) La palla del finale sarà venduta al ballo della FIFA.

1. Tokens vs. types (11) That girl has been very lucky. (12) That girl has a lot of hair. IT avere/essere IT avere

(13) The pen is in my pocket. (14) The pig is in the pen. NL pen NL hok 1. Tokens vs. types

2. Lexicographic practice The entries of pen and peg in the Oxford Advanced Learner's Dictionary of Current English. Homonymy vs. polysemy Problem: for any given ORTH, how many n and how many m does one have to distinguish?

2. Lexicographic practice The entries of pen and peg in the Collins Cobuild Dictionary of the English Language. There is no 1 to 1 correspondence between the senses in both dictionaries

3. Computational Lexica Dictionaries are made for people who already understand (much of) the language. Computational lexica are made for machines that do not understand (anything of) the language Consequence: an NLP system can only make sense of information which is presented in the notation (or format) which it employs for processing the language.

3. Computational Lexica POS tagger The entry for ik in Van Dale The entry for ik in the lexicon of the Spoken Dutch Corpus

4. Lexical databases Computational lexica are often task-specific and application-dependent. The need for reusability, maintainability, extensibility Creation of a lexical database which is sufficiently general and abstract to be reusable, maintainable and easily extensible Two aspects of abstractness: theory-neutral and level-independent

4. Lexical databases Lexical knowledge representation languages DATR (Gazdar and Evans)‏ Typed feature structures (HPSG)‏ The number of lexical entries for any given natural language is enormous. The information to be captured in each lexical entry is detailed and complex.

4. Lexical databases WordNet English nouns, verbs, adjectives and adverbs Inspired by psycholinguistic and computational theories of human lexical memory Organized into synonym sets, each representing one underlying concept Example: call Extension to other languages: EuroWordNet Application to Dutch: Cornetto Other initiatives: FrameNet and VerbNet

5. Lexical knowledge acquisition from scratch from a machine-readable dictionary from an agency for the distribution of resources (TST, ELRA and LDC)‏ inductive: from a partial lexicon and a corpus

6. Lexica in text-to-speech written text  text normalisation expanded graphemic representation  tagging & syntactic analysis graphemic representation with prosody  grapheme-to-phoneme sequence of phonemes, incl. lexical stress  speech synthesis fluent speech