Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.

Similar presentations


Presentation on theme: "Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter."— Presentation transcript:

1 Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter Chapter 6, 7.1 February 25, 2013 CSCE 771 Natural Language Processing

2 – 2 – CSCE 771 Spring 2013 Overview Last Time Confusion Matrix Brill Demo NLTK Ch 6 - Text ClassificationToday Confusion Matrix Brill Demo NLTK Ch 6 - Text ClassificationReadings NLTK Ch 6

3 – 3 – CSCE 771 Spring 2013 Evaluation of classifiers again Last time RecallPrecision F value Accuracy

4 – 4 – CSCE 771 Spring 2013 Reuters Data set 21578 documents 118 categories document can be in multiple classes 118 binary classifiers

5 – 5 – CSCE 771 Spring 2013 Confusion matrix C ij – documents that are really C i that are classified as C j. C ii – documents that are really C i that correctly classified

6 – 6 – CSCE 771 Spring 2013 Micro averaging vs Macro Averaging Macro Averaging – average performance of individual classifiers (average of averages) Micro averaging sum up all correct and all fp and fn

7 – 7 – CSCE 771 Spring 2013 Training, Development and Test Sets

8 – 8 – CSCE 771 Spring 2013 Code_consecutive_pos_tagger.py revisited to trace history development def pos_features(sentence, i, history): # [_consec-pos-tag-features] if debug == 1 : print "pos_features *********************************" if debug == 1 : print "pos_features *********************************" if debug == 1 : print " sentence=", sentence if debug == 1 : print " sentence=", sentence if debug == 1 : print " i=", i if debug == 1 : print " i=", i if debug == 1 : print " history=", history if debug == 1 : print " history=", history features = {"suffix(1)": sentence[i][-1:], features = {"suffix(1)": sentence[i][-1:], "suffix(2)": sentence[i][-2:], "suffix(2)": sentence[i][-2:], "suffix(3)": sentence[i][-3:]} "suffix(3)": sentence[i][-3:]} if i == 0: if i == 0: features["prev-word"] = " " features["prev-word"] = " " features["prev-tag"] = " " features["prev-tag"] = " " else: else: features["prev-word"] = sentence[i-1] features["prev-word"] = sentence[i-1] features["prev-tag"] = history[i-1] features["prev-tag"] = history[i-1] if debug == 1 : print "pos_features features=", features if debug == 1 : print "pos_features features=", features return features return features

9 – 9 – CSCE 771 Spring 2013 Trace of one sentence - SIGINT to interrupt sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved', 'Hyde', 'at', 'the', 'start', 'of', 'the', 'ninth', 'and', 'gave', 'up', 'the', "A's", 'fifth', 'tally', 'on', 'a', 'walk', 'to', 'second', 'baseman', 'Dick', 'Howser', ',', 'a', 'wild', 'pitch', ',', 'and', 'Frank', "Cipriani's", 'single', 'under', 'Shortstop', 'Jerry', "Adair's", 'glove', 'into', 'center', '.'] i= 0 sentence= ['Rookie', 'southpaw', 'George', 'Stepanovich', 'relieved', 'Hyde', 'at', 'the', 'start', 'of', 'the', 'ninth', 'and', 'gave', 'up', 'the', "A's", 'fifth', 'tally', 'on', 'a', 'walk', 'to', 'second', 'baseman', 'Dick', 'Howser', ',', 'a', 'wild', 'pitch', ',', 'and', 'Frank', "Cipriani's", 'single', 'under', 'Shortstop', 'Jerry', "Adair's", 'glove', 'into', 'center', '.'] i= 0 history= [ ] history= [ ] pos_features features= {'suffix(3)': 'kie', 'prev-word': ' ', 'suffix(2)': 'ie', 'prev-tag': ' ', 'suffix(1)': 'e'}

10 – 10 – CSCE 771 Spring 2013 Trace continued pos_features ************************************* sentence= ['Rookie', …'.'] sentence= ['Rookie', …'.'] i= 1 i= 1 history= ['NN'] history= ['NN'] pos_features features= {'suffix(3)': 'paw', 'prev-word': 'Rookie', 'suffix(2)': 'aw', 'prev-tag': 'NN', 'suffix(1)': 'w'} pos_features ************************************* sentence= ['Rookie', 'southpaw', … '.'] sentence= ['Rookie', 'southpaw', … '.'] i= 2 i= 2 history= ['NN', 'NN'] history= ['NN', 'NN'] pos_features features= {'suffix(3)': 'rge', 'prev-word': 'southpaw', 'suffix(2)': 'ge', 'prev-tag': 'NN', 'suffix(1)': 'e'}

11 – 11 – CSCE 771 Spring 2013 nltk.tag Classes AffixTagger BigramTagger BrillTagger BrillTaggerTrainer DefaultTagger FastBrillTaggerTrainer HiddenMarkovModelTagger HiddenMarkovModelTrainer NgramTagger RegexpTagger TaggerI TrigramTagger UnigramTagger AffixTagger BigramTagger BrillTagger BrillTaggerTrainer DefaultTagger FastBrillTaggerTrainer HiddenMarkovModelTagger HiddenMarkovModelTrainer NgramTagger RegexpTagger TaggerI TrigramTagger UnigramTaggerFunctions batch_pos_tag pos_tag untag batch_pos_tag pos_tag untag

12 – 12 – CSCE 771 Spring 2013 Module nltk.tag.hmm Source Code for Module nltk.tag.hmm Module nltk.tag.hmmModule nltk.tag.hmm import nltk nltk.tag.hmm.demo()nltk.tag.hmm.demo_pos()nltk.tag.hmm.demo_pos_bw()

13 – 13 – CSCE 771 Spring 2013 HMM demo import nltk nltk.tag.hmm.demo()nltk.tag.hmm.demo_pos()nltk.tag.hmm.demo_pos_bw()

14 – 14 – CSCE 771 Spring 2013 Common Suffixes from nltk.corpus import brown suffix_fdist = nltk.FreqDist() for word in brown.words(): word = word.lower() word = word.lower() suffix_fdist.inc(word[-1:]) suffix_fdist.inc(word[-1:]) suffix_fdist.inc(word[-2:]) suffix_fdist.inc(word[-2:]) suffix_fdist.inc(word[-3:]) suffix_fdist.inc(word[-3:]) common_suffixes = suffix_fdist.keys()[:100] print common_suffixes

15 – 15 – CSCE 771 Spring 2013 rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] extractor = nltk.RTEFeatureExtractor(rtepair) print extractor.text_words set(['Russia', 'Organisation', 'Shanghai', … print extractor.hyp_words set(['member', 'SCO', 'China']) print extractor.overlap('word') set([ ]) print extractor.overlap('ne') set(['SCO', 'China']) print extractor.hyp_extra('word') set(['member'])

16 – 16 – CSCE 771 Spring 2013 tagged_sents = list(brown.tagged_sents(categories='news')) random.shuffle(tagged_sents) size = int(len(tagged_sents) * 0.1) train_set, test_set = tagged_sents[size:], tagged_sents[:size] file_ids = brown.fileids(categories='news') size = int(len(file_ids) * 0.1) train_set = brown.tagged_sents(file_ids[size:]) test_set = brown.tagged_sents(file_ids[:size]) train_set = brown.tagged_sents(categories='news') test_set = brown.tagged_sents(categories='fiction') classifier = nltk.NaiveBayesClassifier.train(train_set)

17 – 17 – CSCE 771 Spring 2013 Traceback (most recent call last): File "C:\Users\mmm\Documents\Courses\771\Python771\ ch06\ch06d.py", line 80, in File "C:\Users\mmm\Documents\Courses\771\Python771\ ch06\ch06d.py", line 80, in classifier = nltk.NaiveBayesClassifier.train(train_set) classifier = nltk.NaiveBayesClassifier.train(train_set) File "C:\Python27\lib\site- packages\nltk\classify\naivebayes.py", line 191, in train File "C:\Python27\lib\site- packages\nltk\classify\naivebayes.py", line 191, in train for featureset, label in labeled_featuresets: for featureset, label in labeled_featuresets: ValueError: too many values to unpack

18 – 18 – CSCE 771 Spring 2013 from nltk.corpus import brown brown_tagged_sents = brown.tagged_sents(categories='news') size = int(len(brown_tagged_sents) * 0.9) train_sents = brown_tagged_sents[:size] test_sents = brown_tagged_sents[size:] t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1)

19 – 19 – CSCE 771 Spring 2013 def tag_list(tagged_sents): return [tag for sent in tagged_sents for (word, tag) in sent] return [tag for sent in tagged_sents for (word, tag) in sent] def apply_tagger(tagger, corpus): return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] gold = tag_list(brown.tagged_sents(categories='editorial')) test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) cm = nltk.ConfusionMatrix(gold, test) print cm.pp(sort_by_count=True, show_percents=True, truncate=9)

20 – 20 – CSCE 771 Spring 2013 | N | | N | | N I A J N V N | | N I A J N V N | | N N T J. S, B P | | N N T J. S, B P |----+----------------------------------------------------------------+ NN | 0.0%. 0.2%. 0.0%. 0.3% 0.0% | NN | 0.0%. 0.2%. 0.0%. 0.3% 0.0% | IN | 0.0%... 0.0%... | IN | 0.0%... 0.0%... | AT |........ | AT |........ | JJ | 1.7%..... 0.0% 0.0% | JJ | 1.7%..... 0.0% 0.0% |. |........ |. |........ | NNS | 1.5%...... 0.0% |, |........ |, |........ | VB | 0.9%.. 0.0%.... | VB | 0.9%.. 0.0%.... | NP | 1.0%.. 0.0%.... | NP | 1.0%.. 0.0%.... | ----+----------------------------------------------------------------+ ----+----------------------------------------------------------------+ (row = reference; col = test)

21 – 21 – CSCE 771 Spring 2013 Entropy import math def entropy(labels): freqdist = nltk.FreqDist(labels) freqdist = nltk.FreqDist(labels) probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] return -sum([p * math.log(p,2) for p in probs]) return -sum([p * math.log(p,2) for p in probs])

22 – 22 – CSCE 771 Spring 2013 print entropy(['male', 'male', 'male', 'male']) -0.0 print entropy(['male', 'female', 'male', 'male']) 0.811278124459 print entropy(['female', 'male', 'female', 'male']) 1.0 print entropy(['female', 'female', 'male', 'female']) 0.811278124459 print entropy(['female', 'female', 'female', 'female']) -0.0

23 – 23 – CSCE 771 Spring 2013 The Rest of NLTK Chapter 06 6.5 Naïve Bayes Classifiers 6.6 Maximum Entropy Classifiers nltk.classify.maxent.BinaryMaxentFeatureEncoding(l abels, mapping, unseen_features=False, alwayson_features=False)nltk.classify.maxent.BinaryMaxentFeatureEncoding(l abels, mapping, unseen_features=False, alwayson_features=False) 6.7 Modeling Linguistic Patterns 6.8 Summary But no more Code?!?

24 – 24 – CSCE 771 Spring 2013 Maximum Entropy Models (again) features are elements of evidence that connect observations d with categories c f: C X D  R Example feature f(c,d) = { c = LOCATION & w -1 = IN & is Capitalized(w)} An “input-feature” is a property of an unlabeled token. A “joint-feature” is a property of a labeled token.

25 – 25 – CSCE 771 Spring 2013 Feature-Based Liner Classifiers p(c |d, lambda)=

26 – 26 – CSCE 771 Spring 2013 Maxent Model revisited

27 – 27 – CSCE 771 Spring 2013 Maximum Entropy Markov Models (MEMM) repeatedly use Maxent classifier to iteratively apply to a sequence

28 – 28 – CSCE 771 Spring 2013

29 – 29 – CSCE 771 Spring 2013 Named Entity Recognition (NER) enities – 1.a : being, existence; especially : independent, separate, or self-contained existence beingexistencebeingexistence b: the existence of a thing as contrasted with its attributes 2.: something that has separate and distinct existence and objective or conceptual reality 3.: an organization (as a business or governmental unit) that has an identity separate from those of its members one of those with a name http://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtml

30 – 30 – CSCE 771 Spring 2013 Classes of Named Entities Person (PERS) Location (LOC) Organization (ORG) DATE Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: Jim bought 300 shares of Acme Corp. in 2006. Jim bought 300 shares of Acme Corp. in 2006. http://nlp.stanford.edu/software/CRF-NER.shtml

31 – 31 – CSCE 771 Spring 2013 IOB tagging B – beginning a chunk, e.g., B LOC I – in a chunk O – outside chunk Example text = ''' he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP,, O

32 – 32 – CSCE 771 Spring 2013.

33 – 33 – CSCE 771 Spring 2013 Chunking - partial parsing

34 – 34 – CSCE 771 Spring 2013 NLTK ch07.py def ie_preprocess(document): sentences = nltk.sent_tokenize(document) sentences = nltk.sent_tokenize(document) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] grammar = "NP: { ? * }" # [_chunkex-grammar] cp = nltk.RegexpParser(grammar) result = cp.parse(sentence) print result

35 – 35 – CSCE 771 Spring 2013 (S (NP the/DT little/JJ yellow/JJ dog/NN) (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD barked/VBD at/IN at/IN (NP the/DT cat/NN)) (NP the/DT cat/NN))(S (NP the/DT little/JJ yellow/JJ dog/NN) (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD barked/VBD at/IN at/IN (NP the/DT cat/NN)) (NP the/DT cat/NN)) (S (NP money/NN market/NN) fund/NN)

36 – 36 – CSCE 771 Spring 2013 chunkex-draw grammar = "NP: { ? * }" # [_chunkex-grammar] cp = nltk.RegexpParser(grammar) # [_chunkex-cp] result = cp.parse(sentence) # [_chunkex-test] print result # [_chunkex-print] result.draw()

37 – 37 – CSCE 771 Spring 2013 Chunk two consecutive nouns nouns = [("money", "NN"), ("market", "NN"), ("fund", "NN")] grammar = "NP: { } # Chunk two consecutive nouns" cp = nltk.RegexpParser(grammar) print cp.parse(nouns) (S (NP money/NN market/NN) fund/NN)

38 – 38 – CSCE 771 Spring 2013 cp = nltk.RegexpParser('CHUNK: { }') brown = nltk.corpus.brown for sent in brown.tagged_sents(): tree = cp.parse(sent) tree = cp.parse(sent) for subtree in tree.subtrees(): for subtree in tree.subtrees(): if subtree.node == 'CHUNK': print subtree if subtree.node == 'CHUNK': print subtree (CHUNK combined/VBN to/TO achieve/VB) … (CHUNK serve/VB to/TO protect/VB) (CHUNK wanted/VBD to/TO wait/VB) …

39 – 39 – CSCE 771 Spring 2013 nltk.chunk.accuracy example from nltk.corpus import conll2000 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print nltk.chunk.accuracy(cp, test_sents) 0.41745994892

40 – 40 – CSCE 771 Spring 2013 First attempt ?!? from nltk.corpus import conll2000 cp = nltk.RegexpParser("") test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print cp.evaluate(test_sents) ChunkParse score: IOB Accuracy: 43.4% IOB Accuracy: 43.4% Precision: 0.0% Precision: 0.0% Recall: 0.0% Recall: 0.0% F-Measure: 0.0% F-Measure: 0.0%

41 – 41 – CSCE 771 Spring 2013 from nltk.corpus import conll2000 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) from nltk.corpus import conll2000 test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print nltk.chunk.accuracy(cp, test_sents) 0.41745994892 Carlyle NNP B-NP Group NNP I-NP,, O a DT B-NP merchant NN I-NP banking NN I-NP concern NN I-NP.. O ''' nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() from nltk.corpus import conll2000 print conll2000.chunked_sents('train.txt')[99]Group NNP I-NP,, O a DT B-NP merchant NN I-NP banking NN I-NP concern NN I-NP.. O ''' nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw() from nltk.corpus import conll2000 print conll2000.chunked_sents('train.txt')[99]

42 – 42 – CSCE 771 Spring 2013 Chunking using connll2000 text = ''' he PRP B-NP accepted VBD B-VP the DT B-NP position NN I-NP of IN B-PP vice NN B-NP chairman NN I-NP ….. O ''' nltk.chunk.conllstr2tree( text, chunk_types=['NP']).draw() from nltk.corpus import conll2000 print conll2000.chunked_sents('train.tx t')[99]

43 – 43 – CSCE 771 Spring 2013. (S (PP Over/IN) (PP Over/IN) (NP a/DT cup/NN) (NP a/DT cup/NN) (PP of/IN) (PP of/IN) (NP coffee/NN) (NP coffee/NN),/,,/, (NP Mr./NNP Stone/NNP) (NP Mr./NNP Stone/NNP) (VP told/VBD) (VP told/VBD) (NP his/PRP$ story/NN) (NP his/PRP$ story/NN)./.)./.)

44 – 44 – CSCE 771 Spring 2013 A Real Attempt grammar = r"NP: { +}" cp = nltk.RegexpParser(grammar) print cp.evaluate(test_sents) ChunkParse score: IOB Accuracy: 87.7% IOB Accuracy: 87.7% Precision: 70.6% Precision: 70.6% Recall: 67.8% Recall: 67.8% F-Measure: 69.2% F-Measure: 69.2%

45 – 45 – CSCE 771 Spring 2013 Information extraction Step towards understanding Find named entities Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction

46 – 46 – CSCE 771 Spring 2013 Outline of natural language processing 1 What is 1 What is NLP ? ? 1 What is ? 2 Prerequisite technologies 2 Prerequisite technologies 3 Subfields of 3 Subfields of NLP 3 Subfields of 4 Related fields 4 Related fields 5 Processes of NLP5 Processes of NLP: Applications, Components Applications 5 Processes of NLP Applications 6 History of NLP 6 History of NLP 6.1 Timeline of NLP software 6.1 Timeline of NLP software 7 General NLP concepts 7 General NLP concepts 8 NLP software 8 NLP software 8.1 Chatterbots8.1 Chatterbots8.1 Chatterbots8.1 Chatterbots 8.2 NLP toolkits8.2 NLP toolkits8.2 NLP toolkits8.2 NLP toolkits 8.3 Translation software8.3 Translation software8.3 Translation software8.3 Translation software 9 NLP organizations 9 NLP organizations 10 NLP publications10 NLP publications: Books, Journals Books Journals 10 NLP publications Books Journals 11 Persons 12 See also11 Persons 12 See also 13 References 14 External links 13 References14 External links 11 Persons 12 See also13 References14 External links http://en.wikipedia.org/wiki/Outline_of_natural_language_processing

47 – 47 – CSCE 771 Spring 2013 Persons influential in NLP Alan TuringAlan Turing – originator of the Turing Test. Turing Test Alan Turing Test Noam ChomskyNoam Chomsky – author of the seminal work Syntactic Structures, which revolutionized Linguistics with 'universal grammar', a rule based system of syntactic structures. [15] Syntactic Structuresuniversal grammar [15] Noam ChomskySyntactic Structuresuniversal grammar [15] Daniel BobrowDaniel Bobrow – Daniel Bobrow Joseph WeizenbaumJoseph Weizenbaum – author of the ELIZA chatterbot. ELIZAchatterbot Joseph WeizenbaumELIZAchatterbot Roger SchankRoger Schank – introduced the conceptual dependency theory for natural language understanding. [16] conceptual dependency theory [16] Roger Schank conceptual dependency theory [16]– Terry WinogradTerry Winograd – Terry Winograd Kenneth ColbyKenneth Colby – Kenneth Colby Rollo CarpenterRollo Carpenter – Rollo Carpenter David FerrucciDavid Ferrucci – principal investigator of the team that created Watson, IBM's AI computer that won the quiz show Jeopardy! Watson David FerrucciWatson William Aaron Woods William Aaron Woods


Download ppt "Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter."

Similar presentations


Ads by Google