1 Pacific University Sheldon Liang, Ph D Computer Science Department.

Slides:



Advertisements
Similar presentations
Introduction to Computational Linguistics
Advertisements

INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
For Friday No reading Homework –Chapter 23, exercises 1, 13, 14, 19 –Not as bad as it sounds –Do them IN ORDER – do not read ahead here.
Natural Language Processing Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Fall 2013.
1 Lecture 35 Brief Introduction to Main AI Areas (cont’d) Overview  Lecture Objective: Present the General Ideas on the AI Branches Below  Introduction.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Advanced AI - Part II Luc De Raedt University of Freiburg WS 2004/2005 Many slides taken from Helmut Schmid.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes April 3and5, 2012.
Fall 2004 Natural Language Processing Rada Mihalcea.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
9/8/20151 Natural Language Processing Lecture Notes 1.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
CSCI 4410 Introduction to Artificial Intelligence.
Natural Language Processing Rada Mihalcea Fall 2008.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
1 Computational Linguistics Ling 200 Spring 2006.
Natural Language Processing Introduction. 2 Natural Language Processing We’re going to study what goes into getting computers to perform useful and interesting.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Natural Language Processing Introduction. Any Light at The End of The Tunnel ? Yahoo, Google, Microsoft  Information Retrieval Monster.com, HotJobs.com.
Introduction to CL & NLP CMSC April 1, 2003.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
CSA2050 Introduction to Computational Linguistics Lecture 1 Overview.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Introduction Chapter 1 Foundations of statistical natural language processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Natural Language Processing (NLP)
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Natural Language Processing Tasneem Ghnaimat Spring 2013.
INTRODUCTION TO APPLIED LINGUISTICS
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
COSC 6336 Natural Language Processing
Natural Language Processing
Natural Language Processing (NLP)
CS246: Information Retrieval
Natural Language Processing (NLP)
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

1 Pacific University Sheldon Liang, Ph D Computer Science Department

2 Pacific University Sense, Communicate, Actuate

3 Natural?  Natural Language?  Refers to the language spoken by people, e.g. English, Chinese, Swahili, as opposed to artificial languages, like C++, Java, etc.  Natural Language Processing  Applications that deal with natural language in a way or another and it is the subfield of Artificial Intelligence  Computational Linguistics  Doing linguistics on computers  More on the linguistic side than NLP, but closely related  Natural Language?  Refers to the language spoken by people, e.g. English, Chinese, Swahili, as opposed to artificial languages, like C++, Java, etc.  Natural Language Processing  Applications that deal with natural language in a way or another and it is the subfield of Artificial Intelligence  Computational Linguistics  Doing linguistics on computers  More on the linguistic side than NLP, but closely related Pacific University

4 What is Artificial Intelligence?  The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)  AI is the study of how to do things which at the moment people do better (Rich & Knight)  AI is the science of making machines do things that would require intelligence if done by men. (Minsky)  The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden)  AI is the study of how to do things which at the moment people do better (Rich & Knight)  AI is the science of making machines do things that would require intelligence if done by men. (Minsky) Pacific University

5 Why Natural Language Processing? Why Natural Language Processing?  kJfmmfj mmmvvv nnnffn333  Uj iheale eleee mnster vensi credur  Baboi oi cestnitze  Coovoel2^ ekk; ldsllk lkdf vnnjfj?  Fgmflmllk mlfm kfre xnnn!  kJfmmfj mmmvvv nnnffn333  Uj iheale eleee mnster vensi credur  Baboi oi cestnitze  Coovoel2^ ekk; ldsllk lkdf vnnjfj?  Fgmflmllk mlfm kfre xnnn! Pacific University

6 Computers Lack Knowledge!  Computers “see” text in English the same you have seen the previous text!  People have no trouble understanding language  Common sense knowledge  Reasoning capacity  Experience  Computers have  No common sense knowledge  No reasoning capacity Unless we teach them!  Computers “see” text in English the same you have seen the previous text!  People have no trouble understanding language  Common sense knowledge  Reasoning capacity  Experience  Computers have  No common sense knowledge  No reasoning capacity Unless we teach them! Pacific University

7 Why Natural Language Processing?  Huge amounts of data  Internet = at least 8 billion pages  Intranet  Applications for processing large amounts of texts  Require NLP expertise  Huge amounts of data  Internet = at least 8 billion pages  Intranet  Applications for processing large amounts of texts  Require NLP expertise  Classify text into categories  Index and search large texts  Automatic translation  Speech understanding  Understand phone conversations  Information extraction  Extract useful information from resumes  Automatic summarization  Condense 1 book into 1 page  Question answering  Knowledge acquisition  Text generations / dialogs Pacific University

8 Where does it fit in the CS taxonomy? Computers & Applications Artificial Intelligence AlgorithmsDatabasesNetworking Robotics Search Natural Language Processing Information Retrieval Machine Translation Language Analysis SemanticsParsing Pacific University

9 Situating NLP computer science psychology/cognitive science linguistics math/statistics philosophy communication NLP Pacific University

10 Theoretical foundations  math: statistics, calculus, algebra, modeling  computational paradigms: connectionist, rule- based, cognitively plausible  linguistics: LFG, HPSG, GB, OT, CG, etc.  architectures: stacks, automata, networks, compilers  math: statistics, calculus, algebra, modeling  computational paradigms: connectionist, rule- based, cognitively plausible  linguistics: LFG, HPSG, GB, OT, CG, etc.  architectures: stacks, automata, networks, compilers Pacific University

11 Some areas of research  Corpora, tools, resources, standards  Language/grammar engineering  Machine (assisted) translation, tools  Language modeling  Lexicography  Speech  Corpora, tools, resources, standards  Language/grammar engineering  Machine (assisted) translation, tools  Language modeling  Lexicography  Speech Pacific University

12 Linguistics Essentials Pacific University

13 The Description of Language  Language = Words and Rules  Dictionary (vocabulary) + Grammar  Dictionary set of words defined in the language open (dynamic)  Traditional paper based  Electronic machine readable dictionaries; can be obtained from paper-based  Grammar set of rules which describe what is allowable in a language  Classic Grammars meant for humans who know the language  definitions and rules are mainly supported by examples  no (or almost no) formal description tools; cannot be programmed  Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)  Language = Words and Rules  Dictionary (vocabulary) + Grammar  Dictionary set of words defined in the language open (dynamic)  Traditional paper based  Electronic machine readable dictionaries; can be obtained from paper-based  Grammar set of rules which describe what is allowable in a language  Classic Grammars meant for humans who know the language  definitions and rules are mainly supported by examples  no (or almost no) formal description tools; cannot be programmed  Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts) Pacific University

14 Linguistics Levels of Analysis  Speech  Written language  Phonology: sounds / letters / pronunciation  Morphology: the structure of words  Syntax: how these sequences are structured  Semantics: meaning of the strings  Interaction between levels where each level has an input and an output.  Speech  Written language  Phonology: sounds / letters / pronunciation  Morphology: the structure of words  Syntax: how these sequences are structured  Semantics: meaning of the strings  Interaction between levels where each level has an input and an output. Pacific University

15 Phonetics/Orthography  Input:  acoustic signal (phonetics) / text (orthography)  Output:  phonetic alphabet (phonetics) / text (orthography)  Deals with:  Phonetics:  consonant & vowel (& others) formation in the vocal tract  classification of consonants, vowels,... in relation to frequencies, shape & position of the tongue and various muscles  intonation  Orthography: normalization, punctuation, etc.  Input:  acoustic signal (phonetics) / text (orthography)  Output:  phonetic alphabet (phonetics) / text (orthography)  Deals with:  Phonetics:  consonant & vowel (& others) formation in the vocal tract  classification of consonants, vowels,... in relation to frequencies, shape & position of the tongue and various muscles  intonation  Orthography: normalization, punctuation, etc. Pacific University

16 Phonology -- pronunciation  Input:  sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]  Output:  sequence of phonemes (~ (lexical) letters; in an abstract alphabet)  Deals with:  relation between sounds and phonemes (units which might have some function on the upper level)  e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)  Input:  sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes]  Output:  sequence of phonemes (~ (lexical) letters; in an abstract alphabet)  Deals with:  relation between sounds and phonemes (units which might have some function on the upper level)  e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies) Pacific University

17 Morphology -- the structure of words  Input: sequence of phonemes (~ (lexical) letters)  Output:  sequence of pairs (lemma, (morphological) tag)  Deals with:  composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)  e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.  Input: sequence of phonemes (~ (lexical) letters)  Output:  sequence of pairs (lemma, (morphological) tag)  Deals with:  composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)  e.g. quotations ~ quote/V + -ation(der.V->N) + NNS. Pacific University

18...and Beyond  Input:  sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions)  Output:  logical form, which can be evaluated (true/false)  Deals with:  assignment of objects from the real world to the nodes of the sentence structure  e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])  Input:  sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions)  Output:  logical form, which can be evaluated (true/false)  Deals with:  assignment of objects from the real world to the nodes of the sentence structure  e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...]) Pacific University

19 Phonology  (Surface « Lexical) Correspondence  “symbol-based” (no complex structures)  Ex.: (stem-final change)  lexical: b a b y + s (+ denotes start of ending)  surface: b a b i e s (phonetic-related: b é b ì 0s )  Arabic: (interfixing, inside-stem doubling)  lexical: kTb+uu+CVCCVC (CVCC...vowel/consonant pattern)  surface: kuttub  (Surface « Lexical) Correspondence  “symbol-based” (no complex structures)  Ex.: (stem-final change)  lexical: b a b y + s (+ denotes start of ending)  surface: b a b i e s (phonetic-related: b é b ì 0s )  Arabic: (interfixing, inside-stem doubling)  lexical: kTb+uu+CVCCVC (CVCC...vowel/consonant pattern)  surface: kuttub Pacific University

20 Phonology Examples  German (umlaut) (satz ~ sentence)  lexical: s A t z + e (A denotes “umlautable” a)  surface: s ä t z e (phonetic: z æ c e, vs. zac )  Turkish ( vowel harmony )  lexical: e v + l A r (~house)  surface: e v l e r  German (umlaut) (satz ~ sentence)  lexical: s A t z + e (A denotes “umlautable” a)  surface: s ä t z e (phonetic: z æ c e, vs. zac )  Turkish ( vowel harmony )  lexical: e v + l A r (~house)  surface: e v l e r Pacific University

21 Morphology: Morphemes & Order  Scientific study of forms of words  Grouping of phonemes into morphemes  sequence deliverables  deliver, able and s (3 units)  could as well be some “ID” numbers:  e.g. deliver ~ 23987, s ~ 12, able ~ 3456  Morpheme Combination  certain combinations/sequencing possible, other not:  deliver+able+s, but not able+derive+s; noun+s, but not noun+ing  typically fixed (in any given language)  Scientific study of forms of words  Grouping of phonemes into morphemes  sequence deliverables  deliver, able and s (3 units)  could as well be some “ID” numbers:  e.g. deliver ~ 23987, s ~ 12, able ~ 3456  Morpheme Combination  certain combinations/sequencing possible, other not:  deliver+able+s, but not able+derive+s; noun+s, but not noun+ing  typically fixed (in any given language) Pacific University

22 The Dictionary (or Lexicon)  Repository of information about words:  Morphological:  description of morphological “behavior”: inflection patterns/classes  Syntactic:  Part of Speech  relations to other words:  subcategorization (or “surface valency frames”)  Semantic:  semantic features  frames ...and any other! (e.g., translation)  Repository of information about words:  Morphological:  description of morphological “behavior”: inflection patterns/classes  Syntactic:  Part of Speech  relations to other words:  subcategorization (or “surface valency frames”)  Semantic:  semantic features  frames ...and any other! (e.g., translation) Pacific University

23 Pacific University Sense, Communicate, Actuate

24 (Surface) Syntax  Input:  sequence of pairs (lemma, (morphological) tag)  Output:  sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms  Deals with:  the relation between lemmas & morphological categories and the sentence structure  uses syntactic categories such as Subject, Verb, Object,...  e.g.: I/PP1 see/VB a/DT dog/NN ~  ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S  Input:  sequence of pairs (lemma, (morphological) tag)  Output:  sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms  Deals with:  the relation between lemmas & morphological categories and the sentence structure  uses syntactic categories such as Subject, Verb, Object,...  e.g.: I/PP1 see/VB a/DT dog/NN ~  ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S Pacific University

25 Issues in Syntax Issues in Syntax “the dog ate my homework” - Who did what? 1.Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% Can be improved! Part of speech tagging on other languages almost inexistent 2. Identify collocations mother in law, hot dog Compositional versus non-compositional collocates “the dog ate my homework” - Who did what? 1.Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% Can be improved! Part of speech tagging on other languages almost inexistent 2. Identify collocations mother in law, hot dog Compositional versus non-compositional collocates Pacific University

26 Issues in Syntax Issues in Syntax  Shallow parsing: “the dog chased the bear” “the dog” “chased the bear” subject - predicate Identify basic structures NP-[the dog] VP-[chased the bear] Shallow parsing on new languages Shallow parsing with little training data  Shallow parsing: “the dog chased the bear” “the dog” “chased the bear” subject - predicate Identify basic structures NP-[the dog] VP-[chased the bear] Shallow parsing on new languages Shallow parsing with little training data Pacific University

27 Issues in Syntax Issues in Syntax  Full parsing: John loves Mary Current precisions: 85-88% Help figuring out (automatically) questions like: Who did what and when? Pacific University

28 Meaning (semantics) Meaning (semantics)  Input:  sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)  Output:  sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions)  Deals with:  relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s  e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~  (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)  Input:  sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions)  Output:  sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions)  Deals with:  relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s  e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~  (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) Pacific University

29 Issues in Semantics  Understand language! How?  “plant” = industrial plant  “plant” = living organism  Words are ambiguous  Importance of semantics?  Machine Translation: wrong translations  Information Retrieval: wrong information  Anaphora Resolution: wrong referents  Understand language! How?  “plant” = industrial plant  “plant” = living organism  Words are ambiguous  Importance of semantics?  Machine Translation: wrong translations  Information Retrieval: wrong information  Anaphora Resolution: wrong referents Pacific University

30  The sea is at the home for billions of factories and animals  The sea is home to million of plants and animals  English  French [commercial MT system]  Le mer est a la maison de billion des usines et des animaux  French  English  The sea is at the home for billions of factories and animals  The sea is home to million of plants and animals  English  French [commercial MT system]  Le mer est a la maison de billion des usines et des animaux  French  English Why Semantics? Pacific University

31 Issues in Semantics  How to learn the meaning of words?  From dictionaries: plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles") plant, flora, plant life -- (a living organism lacking the power of locomotion) They are producing about 1,000 automobiles in the new plant The sea flora consists in 1,000 different plant species The plant was close to the farm of animals.  How to learn the meaning of words?  From dictionaries: plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles") plant, flora, plant life -- (a living organism lacking the power of locomotion) They are producing about 1,000 automobiles in the new plant The sea flora consists in 1,000 different plant species The plant was close to the farm of animals. Pacific University

32 Issues in Semantics  Learn from annotated examples:  Assume 100 examples containing “plant” previously tagged by a human  Train a learning algorithm  Precisions in the range 60%-70%-(80%) How to choose the learning algorithm? How to obtain the 100 tagged examples?  Learn from annotated examples:  Assume 100 examples containing “plant” previously tagged by a human  Train a learning algorithm  Precisions in the range 60%-70%-(80%) How to choose the learning algorithm? How to obtain the 100 tagged examples? Pacific University

33 Issues in Learning Semantics  Learning?  Assume a (large) amount of annotated data = training  Assume a new text not annotated = test  Learn from previous experience (training) to classify new data (test)  Decision trees, memory based learning, neural networks  Machine Learning Which one performs best?  Learning?  Assume a (large) amount of annotated data = training  Assume a new text not annotated = test  Learn from previous experience (training) to classify new data (test)  Decision trees, memory based learning, neural networks  Machine Learning Which one performs best? Pacific University

34 Issues in Semantics  Automatic annotation of data  Active learning  Identify only the hard examples  Co-training  Identify the examples where several techniques agree on the semantic tag  Collecting from Web users  Open Mind Word Expert  Automatic annotation of data  Active learning  Identify only the hard examples  Co-training  Identify the examples where several techniques agree on the semantic tag  Collecting from Web users  Open Mind Word Expert Pacific University

35 Problems faced by Natural Language-Understanding Systems Pacific University

36 Key NLP problem: ambiguity Pacific University Human Language is highly ambiguous at all levels acoustic level recognize speech vs. wreck a nice beach morphological level saw: to see (past), saw (noun), to saw (present, inf) syntactic level I saw the man on the hill with a telescope semantic level One book has to be read by every student

37 Key NLP problem: Ambiguity Pacific University Human Language is highly ambiguous at all levels acoustic level recognize speech vs. wreck a nice beach morphological level saw: to see (past), saw (noun), to saw (present, inf) syntactic level I saw the man on the hill with a telescope semantic level One book has to be read by every student

38 Language Model Pacific University  A formal model about language  Two types  Non-probabilistic  Allows one to compute whether a certain sequence (sentence or part thereof) is possible  Often grammar based  Probabilistic  Allows one to compute the probability of a certain sequence  Often extends grammars with probabilities  A formal model about language  Two types  Non-probabilistic  Allows one to compute whether a certain sequence (sentence or part thereof) is possible  Often grammar based  Probabilistic  Allows one to compute the probability of a certain sequence  Often extends grammars with probabilities

39 Example of Bad Language Model Pacific University

40 Example of Bad Language Model Pacific University

41 Example of Bad Language Model Pacific University

42 A Good Language Model Pacific University  Non-Probabilistic  “I swear to tell the truth” is possible  “I swerve to smell de soup” is impossible  Probabilistic  P(I swear to tell the truth) ~.0001  P(I swerve to smell de soup) ~ 0  Non-Probabilistic  “I swear to tell the truth” is possible  “I swerve to smell de soup” is impossible  Probabilistic  P(I swear to tell the truth) ~.0001  P(I swerve to smell de soup) ~ 0

43 Language Model Application Pacific University  Spelling correction  Mobile phone texting  Speech recognition  Handwriting recognition  Disabled users  …  Spelling correction  Mobile phone texting  Speech recognition  Handwriting recognition  Disabled users  …

44 Speech & Text segmentation  In spoken language, sounds representing succesive letters blend into each other  This makes the conversion of the analog signal to discrete characters very difficult  Regarding Text Segmentation, Some written languages like chinese, japanese and thai don’t have signal word boundaries.  So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks  In spoken language, sounds representing succesive letters blend into each other  This makes the conversion of the analog signal to discrete characters very difficult  Regarding Text Segmentation, Some written languages like chinese, japanese and thai don’t have signal word boundaries.  So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks Pacific University

45 Word sense disambiguation  Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.  Sense Inventory usually comes from a dictionary or thesaurus.  Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches  Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.  Unsupervised techniques  Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities.  Sense Inventory usually comes from a dictionary or thesaurus.  Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches  Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory.  Unsupervised techniques Pacific University

46 Word sense disambiguation Computers versus Humans  Polysemy – most words have many possible meanings.  A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human…  Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…  Polysemy – most words have many possible meanings.  A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human…  Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases… Pacific University

47 Word sense disambiguation Ambiguity for a Computer  The fisherman jumped off the bank and into the water.  The bank down the street was robbed!  Back in the day, we had an entire bank of computers devoted to this problem.  The bank in that road is entirely too steep and is really dangerous.  The plane took a bank to the left, and then headed off towards the mountains.  The fisherman jumped off the bank and into the water.  The bank down the street was robbed!  Back in the day, we had an entire bank of computers devoted to this problem.  The bank in that road is entirely too steep and is really dangerous.  The plane took a bank to the left, and then headed off towards the mountains. Pacific University

48 Syntactic ambiguity Syntactic ambiguity  There are often multiple possible parse trees for a given sentence.  Choosing the most appropriate one usually requires semantic and contextual information.  Specific problem components here are: 1.Sentence boundary disambiguation 2.Imperfect input 3.Foreign or regional accents etc.  There are often multiple possible parse trees for a given sentence.  Choosing the most appropriate one usually requires semantic and contextual information.  Specific problem components here are: 1.Sentence boundary disambiguation 2.Imperfect input 3.Foreign or regional accents etc. Pacific University

49 Syntactic ambiguity Syntactic ambiguity Pacific University

50 Statistical NLP  Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP  Methods for disambiguation of an involve the use of corpora & Markov models.  Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data.  Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP  Methods for disambiguation of an involve the use of corpora & Markov models.  Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data. Pacific University

51 Statistical NLP -- Corpus Pacific University  Corpus: text collection for linguistic purposes  Tokens How many words are contained in Tom Sawyer?   Types How many different words are contained in T.S.?   Hapax Legomena words appearing only once

52 Statistical NLP – Word Counts Pacific University  The most frequent words are function words wordfreqwordfreq the3332in906 and2972that877 a1775he877 to1725I783 of1440his772 was1161you686 it1027Tom679

53 Major Tasks in NLP  Speech Recognition  Natural Language Generation  Machine Translation  Information Retrieval  Information Extraction  Text Simplification  Automatic summarization  Foreign Language Reading & writing aid  Speech Recognition  Natural Language Generation  Machine Translation  Information Retrieval  Information Extraction  Text Simplification  Automatic summarization  Foreign Language Reading & writing aid Pacific University

54 Speech Recognition  It is the process of converting a speech signal to a sequence of words, by means of an algorithm (as computer program).  Applications are : 1.Voice dialing 2.Call routing 3.Simple data entry 4.Preparation of structure documents  It is the process of converting a speech signal to a sequence of words, by means of an algorithm (as computer program).  Applications are : 1.Voice dialing 2.Call routing 3.Simple data entry 4.Preparation of structure documents Pacific University

55 Natural Language generation  It is a task of generating Natural Language from a machine representation system such as a knowledge base or a logical form. Ex: Choose randomly among outputs: – Visitant which came into the place where it will be Japanese has admired that there was Mount Fuji.  Top 10 outputs according to bigram probabilities: – Visitors who came in Japan admire Mount Fuji. – Visitors who came in Japan admires Mount Fuji. – Visitors who arrived in Japan admire Mount Fuji. – A visitor who came in Japan admire Mount Fuji. – The visitor who came in Japan admire Mount Fuji. – Visitors who came in Japan admire Mount Fuji. – The visitor who came in Japan admires Mount Fuji. – Mount Fuji is admired by a visitor who came in Japan.  It is a task of generating Natural Language from a machine representation system such as a knowledge base or a logical form. Ex: Choose randomly among outputs: – Visitant which came into the place where it will be Japanese has admired that there was Mount Fuji.  Top 10 outputs according to bigram probabilities: – Visitors who came in Japan admire Mount Fuji. – Visitors who came in Japan admires Mount Fuji. – Visitors who arrived in Japan admire Mount Fuji. – A visitor who came in Japan admire Mount Fuji. – The visitor who came in Japan admire Mount Fuji. – Visitors who came in Japan admire Mount Fuji. – The visitor who came in Japan admires Mount Fuji. – Mount Fuji is admired by a visitor who came in Japan. Pacific University

56 Conclusion Pacific University  Overview of some probabilistic and machine learning methods for NLP  Also very relevant to bioinformatics !  Analogy between parsing  A sentence  A biological string (DNA, protein, mRNA, …)  Overview of some probabilistic and machine learning methods for NLP  Also very relevant to bioinformatics !  Analogy between parsing  A sentence  A biological string (DNA, protein, mRNA, …)

57 Pacific University Sense, Communicate, Actuate

58 Machine Translations Machine Translation or MT is a sub-field of computational linguistics that investigates usage of computer software to translate text or speech from one natural language to another Machine Translation or MT is a sub-field of computational linguistics that investigates usage of computer software to translate text or speech from one natural language to another Pacific University

59 Issues in Machine Translations  Text to Text Machine Translations  Speech to Speech Machine Translations  Most of the work has addressed pairs of widely spread languages like English-French, English-Chinese  How to translate text?  Learn from previously translated data  Need parallel corpora  French-English, Chinese-English have the Hansards  Reasonable translations  Chinese-Hindi – no such tools available today!  Text to Text Machine Translations  Speech to Speech Machine Translations  Most of the work has addressed pairs of widely spread languages like English-French, English-Chinese  How to translate text?  Learn from previously translated data  Need parallel corpora  French-English, Chinese-English have the Hansards  Reasonable translations  Chinese-Hindi – no such tools available today! Pacific University

60 Issues in Machine Translations  How to obtain parallel texts?  From the Web! How?  From Web users! How?  Once we have the texts, how to get most out of them?  Word alignments  Obtain lexicons  Import knowledge from well studied languages  How to obtain parallel texts?  From the Web! How?  From Web users! How?  Once we have the texts, how to get most out of them?  Word alignments  Obtain lexicons  Import knowledge from well studied languages Pacific University

61 Information Extraction  It’s a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine readable documents.  Its significance is determined by the growing amount of information available in unstructured form, for instance on the Internet.  It’s a type of information retrieval whose goal is to automatically extract structured or semi structured information from unstructured machine readable documents.  Its significance is determined by the growing amount of information available in unstructured form, for instance on the Internet. Pacific University

62 Issues in Information Extraction  “There was a group of about 8-9 people close to the entrance on Highway 75”  Who? “8-9 people”  Where? “highway 75”  Extract information  Detect new patterns:  Detect hacking / hidden information / etc.  Gov./mil. puts lots of money put into IE research  “There was a group of about 8-9 people close to the entrance on Highway 75”  Who? “8-9 people”  Where? “highway 75”  Extract information  Detect new patterns:  Detect hacking / hidden information / etc.  Gov./mil. puts lots of money put into IE research Pacific University

63 Information Retrieval Information Retrieval (IR) is a science of searching  for information in documents,  for documents themselves,  for metadata or  searching with in databases (any kind). Information Retrieval (IR) is a science of searching  for information in documents,  for documents themselves,  for metadata or  searching with in databases (any kind). Pacific University

64 Issues in Information Retrieval  Index meaning  Search for plant (=living organism) should not retrieve texts with plant (=industrial plant)  But should retrieve documents including “flora” or other related terms  Index parsed relations  Index meaning  Search for plant (=living organism) should not retrieve texts with plant (=industrial plant)  But should retrieve documents including “flora” or other related terms  Index parsed relations Pacific University

65 Issues in Information Retrieval  Retrieve specific information  Question Answering  “What is the height of mount Everest?”  11,000 feet  Current state-of-the-art 40-50% Improve precision with the use of more common sense knowledge Perform domain specific question answering  Retrieve specific information  Question Answering  “What is the height of mount Everest?”  11,000 feet  Current state-of-the-art 40-50% Improve precision with the use of more common sense knowledge Perform domain specific question answering Pacific University

66 Issues in Information Retrieval  Find information across languages!  Cross Language Information Retrieval  “What is the minimum age requirement for car rental in Italy?”  Search also Italian texts for “eta minima per noleggio macchine”  Integrate large number of languages  Integrate into performant IR engines  Find information across languages!  Cross Language Information Retrieval  “What is the minimum age requirement for car rental in Italy?”  Search also Italian texts for “eta minima per noleggio macchine”  Integrate large number of languages  Integrate into performant IR engines Pacific University

67 Automatic Summarization  It is the creation of a shortened version of a text by a computer program.  As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google.  Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing –style and syntax to make a useful summary.  It is the creation of a shortened version of a text by a computer program.  As access to data has increased so has interest in automatic summarization. An example of the use of summarization technology is search engines such as Google.  Technologies that can make a coherent summary, of any kind of text, need to take into account several variables such as length, writing –style and syntax to make a useful summary. Pacific University

68 Foreign Language Writing Aid Foreign Language Writing Aid  It is a computer program that assists a non-native language user in their target language.  Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks.  Assisted aspects of writing include: Lexical syntax, Lexical semantics, idiomatic expression transfer, etc.  On-line dictionaries can also be considered as a type of foreign language writing aid.  It is a computer program that assists a non-native language user in their target language.  Assistive operations can be classified into two categories: on-the-fly prompts and post-writing checks.  Assisted aspects of writing include: Lexical syntax, Lexical semantics, idiomatic expression transfer, etc.  On-line dictionaries can also be considered as a type of foreign language writing aid. Pacific University

69 Language & speech technology have advanced rapidly in the last decades. Pacific University

70 It is EveR-2 Muse, a robot version of a Korean woman in her twenties (Eve+R for robot), can hold a conversation or sing a song, make eye contact, and express anger, sorrow and joy. But according to her creator, most Koreans found her homely in comparison to her predecessor Pacific University

71 Achievements of AI/ NLP  Sphinx can recognise continuous speech.  Deep Thought is an international grand master chess player. Without training for each speaker, it operates in near real time using a vocabulary of 1000 words and has 94% word accuracy.  Navlab is a truck that can drive along a road at 55mph in normal traffic.  Carlton and United Breweries use an AI planning system to plan production of their beer.  Natural language interfaces to databases can be obtained on a PC.  Machine Learning methods have been used to build expert systems.  Expert systems are used regularly in finance, medicine, manufacturing, and agriculture  Sphinx can recognise continuous speech.  Deep Thought is an international grand master chess player. Without training for each speaker, it operates in near real time using a vocabulary of 1000 words and has 94% word accuracy.  Navlab is a truck that can drive along a road at 55mph in normal traffic.  Carlton and United Breweries use an AI planning system to plan production of their beer.  Natural language interfaces to databases can be obtained on a PC.  Machine Learning methods have been used to build expert systems.  Expert systems are used regularly in finance, medicine, manufacturing, and agriculture Pacific University

72 If this dream comes alive…  Even a person who is ignorant of computer knowledge can interact with it through a colloquial interaction.  Almost all systems will be automated.  Many problems will have found a solution.  No one needs to learn computer languages any more, instead they can interact with the computer in their natural (regional) languages themselves.  It would be a matter of jubilance for the world as a whole…..  Even a person who is ignorant of computer knowledge can interact with it through a colloquial interaction.  Almost all systems will be automated.  Many problems will have found a solution.  No one needs to learn computer languages any more, instead they can interact with the computer in their natural (regional) languages themselves.  It would be a matter of jubilance for the world as a whole….. Pacific University

73 So lets await that wonderful day & work in this direction…. Pacific University