1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under:

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Naïve-Bayes Classifiers Business Intelligence for Managers.
Review of Probability. Definitions (1) Quiz 1.Let’s say I have a random variable X for a coin, with event space {H, T}. If the probability P(X=H) is.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
Chapter 4 Probability and Probability Distributions
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.
Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Review: Probability Random variables, events Axioms of probability
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Bayesian Networks. Male brain wiring Female brain wiring.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Independence and Bernoulli.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Special Topics. General Addition Rule Last time, we learned the Addition Rule for Mutually Exclusive events (Disjoint Events). This was: P(A or B) = P(A)
Previous Lecture: Data types and Representations in Molecular Biology.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin A Survey of Probability Concepts Chapter 5.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Classification Techniques: Bayesian Classification
Independence and Bernoulli Trials. Sharif University of Technology 2 Independence  A, B independent implies: are also independent. Proof for independence.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Mathematics Conditional Probability Science and Mathematics Education Research Group Supported by UBC Teaching and Learning Enhancement Fund
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Section 3.2 Notes Conditional Probability. Conditional probability is the probability of an event occurring, given that another event has already occurred.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Conditional Probability and Independence. Learning Targets 1. I can use the multiplication rule for independent events to compute probabilities. 2. I.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.
Introduction Remember that probability is a number from 0 to 1 inclusive or a percent from 0% to 100% inclusive that indicates how likely an event is to.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
CHAPTER 5 Probability: What Are the Chances?
Naive Bayes Classifier
Lecture Slides Elementary Statistics Twelfth Edition
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Machine Learning. k-Nearest Neighbor Classifiers.
Lecture Slides Elementary Statistics Twelfth Edition
Hidden Markov Models Part 2: Algorithms
Naive Bayes Classifier
Independence and Counting
Independence and Counting
Independence and Counting
Presentation transcript:

1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York

2 Suppose we have a single search word "egg." All documents on in our corpus (e.g. web pages) are organized into the following categories: dishwasher, poultry and pregnancy. What is the likelihood that the keyword is intended to "key into" each of the 3 categories of documents? That is, which category would be the best prediction for a search engine to make? Can easily be based on word frequencies in a bag of words approach. Say, for example: In documents classified asthe word "egg" appears dishwasher related379 times poultry related1,617 times pregnancy related824 times

3 Clearly, without any other words, poultry would be the best prediction. Let's formalize this a bit: We want argmax p( c | "egg") c That is, the category c, that maximizes the probability of c given "egg." Take an example. What's p(poultry|"egg")? Take all occurrences of "egg" from all documents in our collection (in our example that would be 2,820 (= , ) and partition them into their categories , "egg" # occurrences of "egg" in documents in pregnancy category # occurrences of "egg" in documents in dishwasher category # occurrences of "egg" in documents in poultry category

4 p(dishwasher | egg) = 379/2,820 p(poultry | egg) = 1,617/2,820 p(pregnancy | egg) = 824/2,820 In fact, since denominator is the same, when calculating argmax, we just drop it, and calculate simply the max number of occurrences of "egg" in each category. That is, we want: argmax count ( "egg" in category c ). c Unfortunately, calculating this is quite expensive. We have to go through EVERY document in every category. So instead we apply Baye's Rule: p(B | A) P (B ) p( A | B) = p(A) or in our example: p( word | category) p(category) p( category | word ) = p(word) But since we are finding the maximum probability, we can drop the denominator: argmax p( c | word ) = argmax p( word | c ) p( c ) c  categories c  categories

5 Easy to extend a single word to multiple words, and we get the basic version of the NAIVE BAYES algorithm: (1) argmax p( words | c ) p( c ) c  categories p( c ) is simply the probability of category c being chosen independent of any words. For example by the formula: total number of words in all documents categorized as c total number of words in the entire corpus (BTW, Why is (1) easier to compute than argmax p( c | words ) ? c  categories Because in order to compute the second equation we would need to compute 2 n entries, where n = number of words, to obtain a joint probability distribution. See next slide. The more computery students, see me after class or me if interested in further discussion).

, "egg" Pregnancy Poultry Dishwashers p(Poultry | "egg") = 1,617 / ( , )

7 In any event, the best predictor of a category, given a bunch of words, is given by (1) above. A final equation. If words contain the words: w 1, w 2,..., w n then (1) can be rewritten as assuming the words are independently likely to occur! - pretty naive but it works fairly well in practice: (2) argmax [ p(w 1 |c) * p(w 2 |c) *... * p(w n |c) ] * p( c ) c  categories And this is the way the implementation works. A) A corpus of documents are feed to the learner, the words are counted up and stored so that the probabilities in (2) can be effectively calculated. B)An unseen document is given to the learner, (2) is calculated where w 1, w 2,..., w n are the words in the document and the category that maximizes (2) is returned by the learner.

8 The vector space model. speechlanguageprocessing Doc 1601 Doc 2051 Doc 3121 Shorter notation:Doc 1 = Doc 2 = Doc 3 = speech language processing But need to normalize. For example, should be considered very similar to. Easy to do.