The contribution of NLP Corpus processing Ontologies and terminologies Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies What is NLP? Natural Language Processing natural language vs. computer languages Other names Computational Linguistics emphasizes scientific not technological Language Engineering Language Technology Kivik 2013 NLP. Corpus processing, Ontologies
NLP and linguistics L I N N L G P supply ideas interpret results test theories expose gaps plus turn into technology Kivik 2013 NLP. Corpus processing, Ontologies
Example: regular morphology LINGUISTICS: Rules: stems -> inflected forms NLP: program the rules apply rules to a lexicon of stems Is the output correct? Errors? refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Applications web search Basic search Filtering results spelling and grammar checking machine translation (MT) talk to computers speech processing as well information extraction finding facts in a database of documents answering questions Kivik 2013 NLP. Corpus processing, Ontologies
How can NLP make better dictionaries? By pre-processing a corpus: tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive . Kivik 2013 NLP. Corpus processing, Ontologies
Automatic tokenization Western writing systems easy! space is separator Chinese, Japanese, some other writing systems do not use word-separator hard like POS-tagging (below) Kivik 2013 NLP. Corpus processing, Ontologies
Why isn't space=separator enough (even for English)? what is a space linebreaks, paragraph breaks, tabs Punctuation characters do not form parts of words but may be attached to words (with no spaces) brackets, quotation marks Hyphenation is co-op one word or two? is well-managed? Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Sentence splitting to: <s> He did n’t arrive . </s> “identifying the sentences” from: he didn't arrive. to: He did n’t arrive . Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help help (v) helps help (v) helping help (v) helped help (v) . Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpings helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. . Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Lemmatization Dictionary entries are for lemmas Match between text-word and dictionary-word lemmatization Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Lemmatization Searching by lemma English: little inflection French: 36 forms per verb Finno-Ugric: 2000. Not always wanted: English royalty singular: kings and queens plural royalties: payments to authors Kivik 2013 NLP. Corpus processing, Ontologies
Automatic lemmatization Write rules: if word ends in "ing", delete "ing"; if the remainder is verb lemma, add to list of possible lemmas If detailed grammar available, use it full lemma list is also required Often available from dictionary companies Kivik 2013 NLP. Corpus processing, Ontologies
Part-of-speech (POS) tagging He PNP pers pronoun did VVD past tense verb n’t XNOT not arrive VV base form of verb . C punctuation </s> “identifying parts of speech” from: he didn't arrive. to: … . Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Tagsets The set of part-of-speech tags to choose between Basic: noun, verb, pronoun … Advanced: examples - CLAWS English tagset NN2 plural noun VVG -ing form of lexical verb Based on linguistics of the language. Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies POS-tagging: why? Use grammar when searching Nouns modified by buckle Verbs that buckle is object of Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies POS-tagging: how? Big topic for computational linguistics well understood taggers available for major languages Some taggers use lemmatized input, others do not Methods constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB Statistical: Machine learning from tagged corpus Various methods Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Parsing Find the structure: Phrase structure (trees) The cat sat on the mat Dependency structure (links) The cat sat on the mat Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Automatic parsing Big topic see Jurafsky and Martin or other NLP textbook Many methods too slow for large corpora Sketch Engine usually uses “shallow parsing” Patterns of POS-tags Regular expressions Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Summary What is NLP? How can it help? Tokenizing Sentence splitting Lemmatizing POS-tagging Parsing Kivik 2013 NLP. Corpus processing, Ontologies
Ontologies and Terminology and how they relate to lexicography Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Terminology Contains terms for the objects and concepts in a domain organized according to relations between objects Different language Same objects, so Same organization Different terms Kivik 2013 NLP. Corpus processing, Ontologies
Ontology Artificial Intelligence Like terminology with reasoning Tweety is-a swallow A swallow is-a bird Birds fly Inference----------------------- Tweety flies the rationalist dream of automated reasoning Bird flies swallow robin … Tweety Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Ontology Chris is-a dentist Chris has-practice in Lancing Chris works 9am-3pm Mon-Fri … You live-near Lancing You want-to-visit dentist You are-available … Inference--------------------------------------------------------- Appointment, you, Chris, Lancing, 10am, Thursday Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Items in an ontology Defined by relations in ontology Labelled (only) by words/phrases in various languages X1 EN: bird FR: oiseau X2 EN: swallow FR: hirondelle … Ontology/things: language independent Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Mismatches and gaps Y1 EN: body parts SP: … Y2 SP: dedo Y5 EN: arm SP: bras Y3 EN: finger Y4 EN: toe Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Thesaurus (eg Roget) Looks like a simple ontology hierarchy only supports inference? usually fudged Language independent? Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies WordNet Princeton Univ project, from ca 1990 Thesaurus Synonym sets or synsets Hyponyms/hyperonyms, antonyms, part-of, other lexical relations Free, online and available for download Very widely used Replicated for many languages, Global WN Assn Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Lexicon/dictionary About words Organized by words Language specific Kivik 2013 NLP. Corpus processing, Ontologies
Rationalists Empiricists Structure Depth Logic Semantic Web Terminology Data Breadth Statistics Google Lexicography Kivik 2013 NLP. Corpus processing, Ontologies
Terminology Lexicography What is the thing called in languages x, y, z What kind of thing is it Is-a link Its place in ontology Well-structured hierarchy How does the word behave? what does it denote? Where does it occur? Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Synthesis Thesis Ontology, terminology, taxonomical lexicography Semantic web, Roget, WordNets Antithesis Corpus lexicography Synthesis: integrating language-independent structure language-specific word/phrase behaviour Corpus-based terminology FrameNet Kivik 2013 NLP. Corpus processing, Ontologies
NLP. Corpus processing, Ontologies Summary words Lexicon Thesaurus/Terminology Ontology things Kivik 2013 NLP. Corpus processing, Ontologies