Download presentation
Published byNorah Harmon Modified over 9 years ago
1
The contribution of NLP Corpus processing Ontologies and terminologies
Kivik 2013 NLP. Corpus processing, Ontologies
2
NLP. Corpus processing, Ontologies
What is NLP? Natural Language Processing natural language vs. computer languages Other names Computational Linguistics emphasizes scientific not technological Language Engineering Language Technology Kivik 2013 NLP. Corpus processing, Ontologies
3
NLP and linguistics L I N N L G P supply ideas interpret results
test theories expose gaps plus turn into technology Kivik 2013 NLP. Corpus processing, Ontologies
4
Example: regular morphology
LINGUISTICS: Rules: stems -> inflected forms NLP: program the rules apply rules to a lexicon of stems Is the output correct? Errors? refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. Kivik 2013 NLP. Corpus processing, Ontologies
5
NLP. Corpus processing, Ontologies
Applications web search Basic search Filtering results spelling and grammar checking machine translation (MT) talk to computers speech processing as well information extraction finding facts in a database of documents answering questions Kivik 2013 NLP. Corpus processing, Ontologies
6
How can NLP make better dictionaries?
By pre-processing a corpus: tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors Kivik 2013 NLP. Corpus processing, Ontologies
7
NLP. Corpus processing, Ontologies
Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive . Kivik 2013 NLP. Corpus processing, Ontologies
8
Automatic tokenization
Western writing systems easy! space is separator Chinese, Japanese, some other writing systems do not use word-separator hard like POS-tagging (below) Kivik 2013 NLP. Corpus processing, Ontologies
9
Why isn't space=separator enough (even for English)?
what is a space linebreaks, paragraph breaks, tabs Punctuation characters do not form parts of words but may be attached to words (with no spaces) brackets, quotation marks Hyphenation is co-op one word or two? is well-managed? Kivik 2013 NLP. Corpus processing, Ontologies
10
NLP. Corpus processing, Ontologies
Sentence splitting to: <s> He did n’t arrive . </s> “identifying the sentences” from: he didn't arrive. to: He did n’t arrive . Kivik 2013 NLP. Corpus processing, Ontologies
11
NLP. Corpus processing, Ontologies
Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help help (v) helps help (v) helping help (v) helped help (v) . Kivik 2013 NLP. Corpus processing, Ontologies
12
NLP. Corpus processing, Ontologies
Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpings helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. . Kivik 2013 NLP. Corpus processing, Ontologies
13
NLP. Corpus processing, Ontologies
Lemmatization Dictionary entries are for lemmas Match between text-word and dictionary-word lemmatization Kivik 2013 NLP. Corpus processing, Ontologies
14
NLP. Corpus processing, Ontologies
Lemmatization Searching by lemma English: little inflection French: 36 forms per verb Finno-Ugric: 2000. Not always wanted: English royalty singular: kings and queens plural royalties: payments to authors Kivik 2013 NLP. Corpus processing, Ontologies
15
Automatic lemmatization
Write rules: if word ends in "ing", delete "ing"; if the remainder is verb lemma, add to list of possible lemmas If detailed grammar available, use it full lemma list is also required Often available from dictionary companies Kivik 2013 NLP. Corpus processing, Ontologies
16
Part-of-speech (POS) tagging
He PNP pers pronoun did VVD past tense verb n’t XNOT not arrive VV base form of verb . C punctuation </s> “identifying parts of speech” from: he didn't arrive. to: … . Kivik 2013 NLP. Corpus processing, Ontologies
17
NLP. Corpus processing, Ontologies
Tagsets The set of part-of-speech tags to choose between Basic: noun, verb, pronoun … Advanced: examples - CLAWS English tagset NN2 plural noun VVG -ing form of lexical verb Based on linguistics of the language. Kivik 2013 NLP. Corpus processing, Ontologies
18
NLP. Corpus processing, Ontologies
POS-tagging: why? Use grammar when searching Nouns modified by buckle Verbs that buckle is object of Kivik 2013 NLP. Corpus processing, Ontologies
19
NLP. Corpus processing, Ontologies
POS-tagging: how? Big topic for computational linguistics well understood taggers available for major languages Some taggers use lemmatized input, others do not Methods constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB Statistical: Machine learning from tagged corpus Various methods Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. Kivik 2013 NLP. Corpus processing, Ontologies
20
NLP. Corpus processing, Ontologies
Parsing Find the structure: Phrase structure (trees) The cat sat on the mat Dependency structure (links) The cat sat on the mat Kivik 2013 NLP. Corpus processing, Ontologies
21
NLP. Corpus processing, Ontologies
Automatic parsing Big topic see Jurafsky and Martin or other NLP textbook Many methods too slow for large corpora Sketch Engine usually uses “shallow parsing” Patterns of POS-tags Regular expressions Kivik 2013 NLP. Corpus processing, Ontologies
22
NLP. Corpus processing, Ontologies
Summary What is NLP? How can it help? Tokenizing Sentence splitting Lemmatizing POS-tagging Parsing Kivik 2013 NLP. Corpus processing, Ontologies
23
Ontologies and Terminology
and how they relate to lexicography Kivik 2013 NLP. Corpus processing, Ontologies
24
NLP. Corpus processing, Ontologies
Terminology Contains terms for the objects and concepts in a domain organized according to relations between objects Different language Same objects, so Same organization Different terms Kivik 2013 NLP. Corpus processing, Ontologies
25
Ontology Artificial Intelligence Like terminology with reasoning
Tweety is-a swallow A swallow is-a bird Birds fly Inference Tweety flies the rationalist dream of automated reasoning Bird flies swallow robin … Tweety Kivik 2013 NLP. Corpus processing, Ontologies
26
NLP. Corpus processing, Ontologies
Ontology Chris is-a dentist Chris has-practice in Lancing Chris works 9am-3pm Mon-Fri … You live-near Lancing You want-to-visit dentist You are-available … Inference Appointment, you, Chris, Lancing, 10am, Thursday Kivik 2013 NLP. Corpus processing, Ontologies
27
NLP. Corpus processing, Ontologies
Items in an ontology Defined by relations in ontology Labelled (only) by words/phrases in various languages X1 EN: bird FR: oiseau X2 EN: swallow FR: hirondelle … Ontology/things: language independent Kivik 2013 NLP. Corpus processing, Ontologies
28
NLP. Corpus processing, Ontologies
Mismatches and gaps Y1 EN: body parts SP: … Y2 SP: dedo Y5 EN: arm SP: bras Y3 EN: finger Y4 EN: toe Kivik 2013 NLP. Corpus processing, Ontologies
29
NLP. Corpus processing, Ontologies
Thesaurus (eg Roget) Looks like a simple ontology hierarchy only supports inference? usually fudged Language independent? Kivik 2013 NLP. Corpus processing, Ontologies
30
NLP. Corpus processing, Ontologies
WordNet Princeton Univ project, from ca 1990 Thesaurus Synonym sets or synsets Hyponyms/hyperonyms, antonyms, part-of, other lexical relations Free, online and available for download Very widely used Replicated for many languages, Global WN Assn Kivik 2013 NLP. Corpus processing, Ontologies
31
NLP. Corpus processing, Ontologies
Lexicon/dictionary About words Organized by words Language specific Kivik 2013 NLP. Corpus processing, Ontologies
32
Rationalists Empiricists
Structure Depth Logic Semantic Web Terminology Data Breadth Statistics Google Lexicography Kivik 2013 NLP. Corpus processing, Ontologies
33
Terminology Lexicography
What is the thing called in languages x, y, z What kind of thing is it Is-a link Its place in ontology Well-structured hierarchy How does the word behave? what does it denote? Where does it occur? Kivik 2013 NLP. Corpus processing, Ontologies
34
NLP. Corpus processing, Ontologies
Synthesis Thesis Ontology, terminology, taxonomical lexicography Semantic web, Roget, WordNets Antithesis Corpus lexicography Synthesis: integrating language-independent structure language-specific word/phrase behaviour Corpus-based terminology FrameNet Kivik 2013 NLP. Corpus processing, Ontologies
35
NLP. Corpus processing, Ontologies
Summary words Lexicon Thesaurus/Terminology Ontology things Kivik 2013 NLP. Corpus processing, Ontologies
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.