CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation of Part of Speech (PoS) Tagging Problem

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 2 Techniques for PoS Tagging Statistical – Use some probabilistic methods Rule-Based – Use some linguistic/machine learnt rules for tagging

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 3 Uses of PoS Tagging Parsing Machine Translation Question Answering Text-to-Speech System –Homography – same orthography (spelling) but different pronunciation. Ex – lead as verb and noun

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 4 Noisy Channel Based Modeling word tagsequence W C C* = best tag sequence = argmax P(C|W) C Noisy

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 5 Applying Bayes Theorem C* = argmax P(C|W) C = argmax P(C). P(W|C) C priorlikelihood

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 6 Prior - B igram Probability P(C) = P(C 1 |C 0 ).P(C 2 |C 1 C 0 ).P(C 3 |C 2 C 1 C 0 )……P(C n |C n-1 C n- 2 …) k-gram approximation (Markov’s assumption) k = 2; bigram assumption P(C) =  P(C i |C i-1 ) i=1 to n

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 8 Tagging Situation Input – “Humans are fond of animals and birds. They keep pets at home” Output – Humans_NNS are_VBP fond_JJ of_IN animals_NNS and_CC birds_NNS._. They_PRNS keep_VB pets_NNS at_IN home_NNP._. Note: The tags are PEN TAGS.

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 9 Formulating the Problem Humans are fond of animals C’ k1 C’ k2 C’ k3 C’ k4 C’ k5 C’ k6 C’ k7 C’ k8 C’ k9 C’ k10 Let C’ ki be the possible tags for the corresponding words

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 11 Calculating Probabilities We calculate the probabilities by ‘counting’. P(NNS|C 0 ) = #NNS followed C 0 #C 0 P(Humans|NNS) =#Humans out of NNS #NNS

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 12 Languages – Rich and Poor Rich languages have annotated corpora, tools, language knowledge bases etc. Poor languages do not have the above stated things.

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 13 Theoretical Foundations Hidden Markov Model (HMM) – It is a non-deterministic finite state machine with probabilities associated with each arc. Viterbi Algorithm – Will be covered in the coming lectures S0S0 S0S0 a: 0.1 a: 0.2 b: 0.5 b: 0.2 a: 0.4 b: 0.3 a: 0.2 b: 0.1

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 14 What is ‘Hidden’ in HMM Given an output sequence, we do not know which states the machine has transited through. Let the sequence of alphabets is ‘aaba’ - S 0 a a S 0 S 1 a a a a S 0 S 1 and so and so forth…

10/01/06Prof. Pushpak Bhattacharyya, IIT Bombay 15 HMM and PoS Tagging In PoS Tagging, Alphabets correspond to words States correspond to tags After seeing the alphabet sequence (Humans are fond of animals), find the state sequence that generated it (PoS tag sequence)

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.

Similar presentations

Presentation on theme: "CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.

Similar presentations

Presentation on theme: "CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation."— Presentation transcript:

Similar presentations

About project

Feedback