NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.

Slides:



Advertisements
Similar presentations
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.
ADGEN USC/ISI ADGEN: Advanced Generation for Question Answering Kevin Knight and Daniel Marcu USC/Information Sciences Institute.
Text Categorization Karl Rees Ling 580 April 2, 2001.
Text Similarity David Kauchak CS457 Fall 2011.
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
A* Search Uses evaluation function f (n)= g(n) + h(n) where n is a node. – g is a cost function Total cost incurred so far from initial state at node n.
N-Grams and Corpus Linguistics
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Scalable Text Mining with Sparse Generative Models
CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.
Natural Language Processing Prof: Jason Eisner Webpage: syllabus, announcements, slides, homeworks.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Prelude Amount of information is growing exponentially Majority of Information is stored in text documents: journals, web pages, s, reports, memos,
Natural Language Processing Language Model. Language Models Formal grammars (e.g. regular, context free) give a hard “binary” model of the legal sentences.
Magda Delroy Dwayne Beth Joyce Michelle EDUCATIONAL ARCHITECTS, Inc.
Text Analysis Everything Data CompSci Spring 2014.
Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Why Text Mining? Adapted from Text Mining: Finding Nuggets in Mountains of Textual Data.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
C++ Basics C++ is a high-level, general purpose, object-oriented programming language.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical Machine Translation Raghav Bashyal. Statistical Machine Translation Uses pre-translated text (copora) Compare translated text to original.
Estimating N-gram Probabilities Language Modeling.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Data Mining and Decision Support
Natural Language Processing Statistical Inference: n-grams
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שישי Language Modeling אורן גליקמן.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Language Model for Machine Translation Jang, HaYoung.
A Simple Approach for Author Profiling in MapReduce
4/19/ :02 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
A Straightforward Author Profiling Approach in MapReduce
Tools for Natural Language Processing Applications
School of Computer Science & Engineering
N-Grams and Corpus Linguistics
N-Grams and Corpus Linguistics
Language-Model Based Text-Compression
Introduction to Information Retrieval
Statistical n-gram David ling.
INF 141: Information Retrieval
Information Retrieval and Web Design
Introduction to Sentiment Analysis
CS249: Neural Language Model
Presentation transcript:

NATURAL LANGUAGE PROCESSING

Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment analysis ( product reviews )  Information retrieval ( web search )  Question answering ( web search, IBM ’ s Watson )  Machine translation ( english to spanish )  Speech recognition ( Siri )

Language Models  Two ways to think about modeling language : 1. Sequences of letters / words  probabilistic, word based, learned 2. Tree - based grammar models  Logical, boolean, often hand coded

Bag - of - words Model Transform documents into sparse numeric vectors and then deal with them with linear algebra operations

Bag - of - words Model  One of the most common ways to deal with documents  Forgets everything about the linguistic structure within the text  Useful for classification, clustering, visualization etc.

Similarity between document vectors

Bag of Words with Word Weighting The word is more important if it appears several times in a target document The word is more important if it appears in less documents

Example document and its vector representation TRUMP MAKES BID FOR CONTROL OF RESORTS Casino owner and real estate Donald Trump has offered to acquire all Class B common shares of Resorts International Inc, a spokesman for Trump said. The estate of late Resorts chairman James M. Crosby owns 340,783 of the 752,297 Class B shares. Resorts also has about 6,432,000 Class A common shares outstanding. Each Class B share has 100 times the voting power of a Class A share, giving the Class B stock about 93 pct of Resorts' voting power. [ RESORTS :0.624] [ CLASS :0.487] [ TRUMP :0.367] [ VOTING :0.171] [ ESTATE :0.166] [ POWER :0.134] [ CROSBY :0.134] [ CASINO :0.119] [ DEVELOPER :0.118] [ SHARES :0.117] [ OWNER :0.102] [ DONALD :0.097] [ COMMON :0.093] [ GIVING :0.081] [ OWNS :0.080] [ MAKES :0.078] [ TIMES :0.075] [ SHARE :0.072] [ JAMES :0.070] [ REAL :0.068] [ CONTROL :0.065] [ ACQUIRE :0.064] [ OFFERED :0.063] [ BID :0.063] [ LATE :0.062] [ OUTSTANDING :0.056] [ SPOKESMAN :0.049] [ CHAIRMAN :0.049] [ INTERNATIONAL :0.041] [ STOCK :0.035] [ YORK :0.035] [ PCT :0.022] [ MARCH :0.011] Original text Bag-of-Words representation (high dimensional sparse vector)

What happens if some words do not appear in the training corpus ?  Smoothing : assigning very low, but greater than 0, probabilities to previously unseen words

Wouldn ’ t it be helpful to reason about word order ?

 Probabilistic language model based on a contiguous sequence of n items  n =1 is a unigram model (“ bag of words ”)  n =2 is a bigram model  n =3 is a trigram model  … etc

 Source text : to be or not to be  Unigrams : to, be, or, not, to, be  Bigrams : to be, be or, or not, not to, to be  Trigrams : to be or, be or not, or not to, not to be

 In September 2006 Google announced availability of n - gram corpus :  http :// googleresearch. blogspot. com /2006/08/ all - our - n - gram - are - belong - to - you. html # links http :// googleresearch. blogspot. com /2006/08/ all - our - n - gram - are - belong - to - you. html # links  Some statistics of the corpus : File sizes : approx. 24 GB compressed ( gzip ' ed ) text files Number of tokens : 1,024,908,267,229 Number of sentences : 95,119,665,584 Number of unigrams : 13,588,391 Number of bigrams : 314,843,401 Number of trigrams : 977,069,902 Number of fourgrams : 1,313,818,354 Number of fivegrams : 1,176,470,663

 ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection, 144 ceramics collection. 247 ceramics collection 120 ceramics collection and 43 ceramics collection at 52 ceramics collection is 68 ceramics collection of 76 ceramics collection | 59 ceramics collections, 66 ceramics collections. 60 ceramics combined with 46 ceramics come from 69 ceramics comes from 660 ceramics community, 109 ceramics community. 212 ceramics community for 61 ceramics companies. 53 ceramics companies consultants 173 ceramics company ! 4432 ceramics company, 133 ceramics company. 92 ceramics company 41 ceramics company facing 145 ceramics company in 181 ceramics company started 137 ceramics company that 87 ceramics component ( 76 ceramics composed of 85  serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234 serve as the industrial 52 serve as the industry 607 serve as the info 42 serve as the informal 102 serve as the information 838 serve as the informational 41 serve as the infrastructure 500 serve as the initial 5331 serve as the initiating 125 serve as the initiation 63 serve as the initiator 81 serve as the injector 56 serve as the inlet 41 serve as the inner 87 serve as the input 1323 serve as the inputs 189 serve as the insertion 49 serve as the insourced 67 serve as the inspection 43 serve as the inspector 66 serve as the inspiration 1390 serve as the installation 136 serve as the institute 187

 Unigrams :  Every enter now severally so, let  Hill he late speaks; or! a more to leg less first you enter  Bigrams :  What means, sir. I confess she? then all sorts, he is trim, captain.  Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.  Trigrams :  Sweet prince, Falstaff shall die.  This shall forbid it should be branded, if renown made it empty.

 Quadrigrams  What! I will go seek the traitor Gloucester.  Will you not tell me who I am?  Note : As we increase the value of N, the accuracy of an n - gram model increases, since choice of next word becomes increasingly constrained