Ambiguity At last, a computer that understands you like your mother.

Slides:



Advertisements
Similar presentations
What is Word Study? PD Presentation: Union 61 Revised ELA guide Supplement (and beyond)
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
The Study Of Language Unit 7 Presentation By: Elham Niakan Zahra Ghana’at Pisheh.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
1 Linguistics week 11 Finish assimilation; start morphology.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Morphology I. Basic concepts and terms Derivational processes
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Introduction to English Morphology Finite State Transducers
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
The study of the structure of words.  Words are an integral part of language ◦ Vocabulary is a dynamic system  How many words do we know? ◦ Infinite.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 2 26 July 2007.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Lecture 2, 7/22/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 2 22 July 2005.
Phonological Rules Rules about how sounds may or may not go together in a language English: Words may not start with two stop consonants German: Devoicing.
9/8/20151 Natural Language Processing Lecture Notes 1.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Ch4 – Features Consider the following data from Mokilese
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Formal Properties of Language. Grammar Morphology Syntax Semantics.
WEEK3- MORPHOLOGY Dr. Monira I. Al-Mohizea. What is this?
Introduction to CL & NLP CMSC April 1, 2003.
Formal Properties of Language: Talk is achieved through the interdependent components of sounds, words, sentences, and meanings.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
WHAT IS LANGUAGE?. INTRODUCTION In order to interact,human beings have developed a language which distinguishes them from the rest of the animal world.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY definition; variability among languages.
SYNTAX.
3 Phonology: Speech Sounds as a System No language has all the speech sounds possible in human languages; each language contains a selection of the possible.
Slang. Informal verbal communication that is generally unacceptable for formal writing.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
MORPHOLOGY The study of word forms.
عمادة التعلم الإلكتروني والتعليم عن بعد
Approaches to Machine Translation
Università di Cagliari
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Introduction to Linguistics
Statistical NLP: Lecture 3
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Natural Language Processing (NLP)
Chapter 6 Morphology.
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Approaches to Machine Translation
CS4705 Natural Language Processing
Língua Inglesa - Aspectos Morfossintáticos
Levels of Linguistic Analysis
Natural Language Processing (NLP)
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Introduction to English morphology
Natural Language Processing (NLP)
Presentation transcript:

Introduction to Natural Language Processing and Text Mining and The basic building blocks

Ambiguity At last, a computer that understands you like your mother. -- 1985 McDonnell-Douglas Ad Different interpretations: The computer understands you as well as your mother understands you. The computer understands that you like your mother. The computer understands you as well as it understands your mother. Speech : ….. a computer that understands your lie cured mother …

Why is NLP difficult? Natural Language is highly ambiguous. Syntactic ambiguity The president spoke to the nation about the problem of drug use in the schools from one coast to the other. has 720 parses. Ex: “to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb  6 places “from one coast” has 5 places to attach …

Why is NLP difficult? Word category ambiguity Word sense ambiguity book --> verb? or noun? Word sense ambiguity bank --> financial institution? building? or river side? Words can mean more than their sum of parts make up a story Fictitious worlds People on mars can fly. Defining scope People like ice-cream. Does this mean that all (or some?) people like ice cream? Language is changing and evolving I’ll email you my answer. This new S.U.V. has a compartment for your mobile phone. Googling, …

Why is NLP hard? Natural language is Why Text is tough? Highly ambiguous at all levels Complex Probabilistic, fuzzy Involves reasoning about the world Deals with complex social interactions Why Text is tough? Abstract concepts are difficult to represent Countless combinations of subtle, abstract relationships among concepts Many ways to represent similar concepts Concepts are difficult to visualize High dimensionality - Tens or hundreds of thousands of features

How is NLP doable? But in some senses NLP is quite easy Rough text features good enough for many useful tasks Why Text is easy? Highly redundant data Just about any simple algorithm can get “good” results for simple tasks: Pull out “important” phrases Find “meaningfully” related words Create some sort of summary from documents

Levels of Text Processing Word Level Words Properties Stop-Words Stemming Frequent N-Grams Thesaurus (WordNet) Sentence Level Document Level Document-Collection Level Linked-Document-Collection Level Application Level

Models and Algorithms Models: formalisms used to capture the various kinds of linguistic structure. State machines (fsa, transducers, markov models) Formal rule systems (context-free grammars, feature systems) Logic (predicate calculus, inference) Probabilistic versions of all of these + others (gaussian mixture models, probabilistic relational models, etc etc) Algorithms used to manipulate representations to create structure. Search (A*, dynamic programming) EM Supervised learning, etc etc

Language Processing Pipeline speech text Phonetic/Phonological Analysis OCR/Tokenization POS tagging Morphological and lexical analysis WSD Shallow parsing Syntactic analysis Deep Parsing Semantic Interpretation Anaphora resolution Discourse Processing Integration

The Big Picture Source Language Speech Signal Target Language Speech Signal Speech recognition Speech Synthesis Source text Analysis Target text Generation

Some Building Blocks Source Language Analysis Target Language Generation Text Normalization Text Rendering Morphological Analysis Morphological Synthesis POS Tagging Phrase Generation Parsing Role Ordering Semantic Analysis Lexical Choice Discourse Analysis Discourse Planning

Two Approaches Symbolic Statistical Encode all the necessary knowledge Good when annotated data is not available Allows steady development The development can be monitored Fits well with logic and reasoning in AI Statistical Learn language from its usage Supervised learning require large collections manually annotated with meta-tags Development is almost blind Few ways to check the correctness Debugging is very frustrating

Resolve Ambiguities We will introduce models and algorithms to resolve ambiguities at different levels. part-of-speech tagging -- Deciding whether duck is verb or noun. word-sense disambiguation -- Deciding whether make is create or cook. lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation. syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing.

Languages Languages: 39,000 languages and dialects (22,000 dialects in India alone) Top languages: Chinese/Mandarin (885M), Spanish (332M), English (322M), Bengali (189M), Hindi (182M), Portuguese (170M), Russian (170M), Japanese (125M) Source: www.sil.org/ethnologue, www.nytimes.com Internet: English (128M), Japanese (19.7M), German (14M), Spanish (9.4M), French (9.3M), Chinese (7.0M) Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%) Source: www.computereconomics.com

Tokenization Segmentation Stemming/ lemmatization

Morphology Morphology is the field of linguistics that studies the internal structure of words How words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions Prefix: un-, anti-, etc (a- ati- pra- etc) Suffix: -ity, -ation, etc ( -taa, -ke, -ka etc) Infix: are inserted inside the stem Tagalog: um + hingi humingi Circumfixes – precede and follow the stem Turkish can have words with a lot of suffixes (agglutinative language) Many indian languages also have agglutinative suffixes

Examples (English) “unladylike” “dogs” 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns

Examples (Bengali) “chhelederTaakei” “atipraakrritake” 5 morphemes chhele ‘boy’ -der ‘plural genitive’ -Taa ‘classifier’ -ke ‘dative’ -i ‘emphasizer’ Can’t break any of these down further without distorting the meaning of the units “atipraakrritake” ati- praakrrita -ke

Inflectional & Derivational Morphology We can also divide morphology up into two broad classes Inflectional Derivational Inflectional morphology is grammatical number, tense, case, gender Derivational morphology concerns word building part-of-speech derivation words with related meaning

Inflectional Morphology Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. Doesn’t change the word class Usually produces a predictable, nonidiosyncratic change of meaning. Eg, may add tense, number, person, mood, aspect Serves a grammatical/semantic purpose different from the original Highly systematic, though there may be irregularities and exceptions Simplifies lexicon, only exceptions need to be listed Unknown words may be guessable After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. eat / eats pencil / pencils helaa / khele / khelchhila bai / baiTAke / baiyera

Derivational Morphology The formation of a new word or inflectable stem from another word or stem. After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. compute / computer do / undo friend / friendly Uygar / uygarlaş kapı / kapıcı udaara (J) / udaarataa (N) bhadra / abhadra baayu / baayabiiya Irregular changes may happen with derivational affixes. Fairly systematic, and predictable up to a point Simplifies description of lexicon: regularly derived words need not be listed Unknown words may be guessable But … Apparent derivations have specialised meaning Some derivations missing

Morphological processes Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut) Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi

Concatenative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing  hoping hop  hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized Say (has) said

Morphophonemics Morphemes and allomorphs Morphophonemic variation eg {plur}: +(e)s, vowel change, yies, fves, um a, , ... Morphophonemic variation Affixes and stems may have variants which are conditioned by context eg +ing in lifting, swimming, boxing, raining, hoping, hopping Rules may be generalisable across morphemes eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses Applies to both {plur} (nouns) and {3rd sing pres} (verbs)

Templatic Morphology Roots and Patterns Example: Hebrew verbs Root: Consists of 3 consonants CCC Carries basic meaning Template: Gives the ordering of consonants and vowels Specifies semantic information about the verb Active, passive, middle voice Example: lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) Psycholinguistic reality format  فرمت farmat

Syntax and Morphology Phrase-level agreement Subject-Verb John studies hard (STUDY+3SG) Noun-Adjective Achchhi Ladki In some languages like Sanskrit, morphology contains a lot of information about structure

Morphology in NLP Analysis vs synthesis Analysis Synthesis what does dogs mean? vs what is the plural of dog? Analysis Need to identify lexeme Tokenization To access lexical information Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) Morphology can be ambiguous May need other process to disambiguate (eg German –en) Synthesis Need to generate appropriate inflections from underlying representation