Download presentation
Presentation is loading. Please wait.
Published byConrad Goodman Modified over 9 years ago
1
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing
2
Auckland 2012Kilgarriff: NLP and Corpus Processing2 What is NLP? Natural Language Processing –natural language vs. computer languages Other names –Computational Linguistics emphasizes scientific not technological –Language Engineering official European Union term, ca 1996-99 –Language Technology
3
Auckland 2012Kilgarriff: NLP and Corpus Processing3 NLP and linguistics LINGLING NLPNLP supply ideas interpret results test theories expose gaps plus turn into technology
4
Auckland 2012Kilgarriff: NLP and Corpus Processing4 Example: regular morphology LINGUISTICS: –Rules: stems -> inflected forms NLP: –program the rules –apply rules to a lexicon of stems –Is the output correct? Errors? LINGUISTICS: –refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc.
5
Auckland 2012Kilgarriff: NLP and Corpus Processing5 Applications web search –Basic search –Filtering results spelling and grammar checking machine translation (MT) talking to computers – speech processing as well information extraction (IE) –finding facts in a database of documents; populating a database, answering questions
6
Auckland 2012Kilgarriff: NLP and Corpus Processing6 How can NLP make better dictionaries? By pre-processing a corpus: tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors
7
Auckland 2012Kilgarriff: NLP and Corpus Processing7 Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive.
8
Auckland 2012Kilgarriff: NLP and Corpus Processing8 Automatic tokenization Western writing systems –easy! space is separator Chinese, Japanese – do not use word-separator –hard like POS-tagging (below)
9
Auckland 2012Kilgarriff: NLP and Corpus Processing9 Why isn't space=separator enough (even for English)? –what is a space Line breaks, paragraph breaks, tabs –Punctuation No space between it and word –brackets, quotation marks –Hyphenation co-op? well-managed?
10
Auckland 2012Kilgarriff: NLP and Corpus Processing10 Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive. to: He did n’t arrive.
11
Auckland 2012Kilgarriff: NLP and Corpus Processing11 Lemmatization Mapping from text-word to lemma help (verb) text-word to lemma help help (v) helps help (v) helping help (v) helped help (v).
12
Auckland 2012Kilgarriff: NLP and Corpus Processing12 Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word to lemma help help (v), help (n) helps help (v), helps (n)** helping help (v), helping (n) helped help (v) helpingshelping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending..
13
Auckland 2012Kilgarriff: NLP and Corpus Processing13 Lemmatization Dictionary entries are for lemmas Match between text-word and dictionary-word lemmatization
14
Auckland 2012Kilgarriff: NLP and Corpus Processing14 Lemmatization Searching by lemma –English: little inflection –French: 36 forms per verb –Finno-Ugric: 2000. Not always wanted: –English royalty singular: kings and queens plural royalties: payments to authors
15
Auckland 2012Kilgarriff: NLP and Corpus Processing15 Automatic lemmatization Write rules: –if word ends in "ing", delete "ing"; –if the remainder is verb lemma, add to list of possible lemmas If detailed grammar available, use it full lemma list is also required –Often available from dictionary companies
16
Auckland 2012Kilgarriff: NLP and Corpus Processing16 Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive.. to: HePNP pers pronoun didVVD past tense verb n’t XNOT not arriveVV base form of verb.C punctuation
17
Auckland 2012Kilgarriff: NLP and Corpus Processing17 Tagsets The set of part-of-speech tags to choose between –Basic: noun, verb, pronoun … –Advanced: examples - CLAWS English tagset NN2 plural noun VVG -ing form of lexical verb Based on linguistics of the language.
18
Auckland 2012Kilgarriff: NLP and Corpus Processing18 POS-tagging: why? Use grammar when searching –Nouns modified by buckle –Verbs that buckle is object of
19
Auckland 2012Kilgarriff: NLP and Corpus Processing19 POS-tagging: how? Big topic for computational linguistics –well understood –taggers available for major languages Some taggers use lemmatized input, others do not Methods –constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB –Statistical: Machine learning from tagged corpus Various methods Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999.
20
Auckland 2012Kilgarriff: NLP and Corpus Processing20 Parsing Find the structure: –Phrase structure (trees) The cat sat on the mat –Dependency structure (links) – The cat sat on the mat
21
Auckland 2012Kilgarriff: NLP and Corpus Processing21 Automatic parsing Big topic –see Jurafsky and Martin or other NLP textbook Many methods too slow for large corpora Sketch Engine usually uses “shallow parsing” –Patterns of POS-tags –Regular expressions
22
Auckland 2012Kilgarriff: NLP and Corpus Processing22 Summary What is NLP? How can it help? –Tokenizing –Sentence splitting –Lemmatizing –POS-tagging –Parsing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.