Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.

Similar presentations


Presentation on theme: "Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter."— Presentation transcript:

1 Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter Chapter 6, 7.1 February 25, 2013 CSCE 771 Natural Language Processing

2 – 2 – CSCE 771 Spring 2013 Overview Last Time Confusion Matrix Brill Demo NLTK Ch 6 - Text ClassificationToday Confusion Matrix Brill Demo NLTK Ch 6 - Text ClassificationReadings NLTK Ch 6

3 – 3 – CSCE 771 Spring 2013 Evaluation of classifiers again Last time RecallPrecision F value Accuracy

4 – 4 – CSCE 771 Spring 2013 Reuters Data set 21578 documents 118 categories document can be in multiple classes 118 binary classifiers

5 – 5 – CSCE 771 Spring 2013 Confusion matrix C ij – documents that are really C i that are classified as C j. C ii – documents that are really C i that correctly classified

6 – 6 – CSCE 771 Spring 2013 Micro averaging vs Macro Averaging Macro Averaging – average performance of individual classifiers (average of averages) Micro averaging sum up all correct and all fp and fn

7 – 7 – CSCE 771 Spring 2013 Training, Development and Test Sets

8 – 8 – CSCE 771 Spring 2013 nltk.tag Classes AffixTagger BigramTagger BrillTagger BrillTaggerTrainer DefaultTagger FastBrillTaggerTrainer HiddenMarkovModelTagger HiddenMarkovModelTrainer NgramTagger RegexpTagger TaggerI TrigramTagger UnigramTagger AffixTagger BigramTagger BrillTagger BrillTaggerTrainer DefaultTagger FastBrillTaggerTrainer HiddenMarkovModelTagger HiddenMarkovModelTrainer NgramTagger RegexpTagger TaggerI TrigramTagger UnigramTaggerFunctions batch_pos_tag pos_tag untag batch_pos_tag pos_tag untag

9 – 9 – CSCE 771 Spring 2013 Module nltk.tag.hmm Source Code for Module nltk.tag.hmm Module nltk.tag.hmmModule nltk.tag.hmm import nltk nltk.tag.hmm.demo()nltk.tag.hmm.demo_pos()nltk.tag.hmm.demo_pos_bw()

10 – 10 – CSCE 771 Spring 2013 HMM demo import nltk nltk.tag.hmm.demo()nltk.tag.hmm.demo_pos()nltk.tag.hmm.demo_pos_bw()

11 – 11 – CSCE 771 Spring 2013 Common Suffixes from nltk.corpus import brown suffix_fdist = nltk.FreqDist() for word in brown.words(): word = word.lower() word = word.lower() suffix_fdist.inc(word[-1:]) suffix_fdist.inc(word[-1:]) suffix_fdist.inc(word[-2:]) suffix_fdist.inc(word[-2:]) suffix_fdist.inc(word[-3:]) suffix_fdist.inc(word[-3:]) common_suffixes = suffix_fdist.keys()[:100] print common_suffixes

12 – 12 – CSCE 771 Spring 2013 rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33] extractor = nltk.RTEFeatureExtractor(rtepair) print extractor.text_words set(['Russia', 'Organisation', 'Shanghai', … print extractor.hyp_words set(['member', 'SCO', 'China']) print extractor.overlap('word') set([ ]) print extractor.overlap('ne') set(['SCO', 'China']) print extractor.hyp_extra('word') set(['member'])

13 – 13 – CSCE 771 Spring 2013 tagged_sents = list(brown.tagged_sents(categories='news')) random.shuffle(tagged_sents) size = int(len(tagged_sents) * 0.1) train_set, test_set = tagged_sents[size:], tagged_sents[:size] file_ids = brown.fileids(categories='news') size = int(len(file_ids) * 0.1) train_set = brown.tagged_sents(file_ids[size:]) test_set = brown.tagged_sents(file_ids[:size]) train_set = brown.tagged_sents(categories='news') test_set = brown.tagged_sents(categories='fiction') classifier = nltk.NaiveBayesClassifier.train(train_set)

14 – 14 – CSCE 771 Spring 2013 Traceback (most recent call last): File "C:\Users\mmm\Documents\Courses\771\Python771\ ch06\ch06d.py", line 80, in File "C:\Users\mmm\Documents\Courses\771\Python771\ ch06\ch06d.py", line 80, in classifier = nltk.NaiveBayesClassifier.train(train_set) classifier = nltk.NaiveBayesClassifier.train(train_set) File "C:\Python27\lib\site- packages\nltk\classify\naivebayes.py", line 191, in train File "C:\Python27\lib\site- packages\nltk\classify\naivebayes.py", line 191, in train for featureset, label in labeled_featuresets: for featureset, label in labeled_featuresets: ValueError: too many values to unpack

15 – 15 – CSCE 771 Spring 2013 from nltk.corpus import brown brown_tagged_sents = brown.tagged_sents(categories='news') size = int(len(brown_tagged_sents) * 0.9) train_sents = brown_tagged_sents[:size] test_sents = brown_tagged_sents[size:] t0 = nltk.DefaultTagger('NN') t1 = nltk.UnigramTagger(train_sents, backoff=t0) t2 = nltk.BigramTagger(train_sents, backoff=t1)

16 – 16 – CSCE 771 Spring 2013 def tag_list(tagged_sents): return [tag for sent in tagged_sents for (word, tag) in sent] return [tag for sent in tagged_sents for (word, tag) in sent] def apply_tagger(tagger, corpus): return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] return [tagger.tag(nltk.tag.untag(sent)) for sent in corpus] gold = tag_list(brown.tagged_sents(categories='editorial')) test = tag_list(apply_tagger(t2, brown.tagged_sents(categories='editorial'))) cm = nltk.ConfusionMatrix(gold, test) print cm.pp(sort_by_count=True, show_percents=True, truncate=9)

17 – 17 – CSCE 771 Spring 2013 | N | | N | | N I A J N V N | | N I A J N V N | | N N T J. S, B P | | N N T J. S, B P |----+----------------------------------------------------------------+ NN | 0.0%. 0.2%. 0.0%. 0.3% 0.0% | NN | 0.0%. 0.2%. 0.0%. 0.3% 0.0% | IN | 0.0%... 0.0%... | IN | 0.0%... 0.0%... | AT |........ | AT |........ | JJ | 1.7%..... 0.0% 0.0% | JJ | 1.7%..... 0.0% 0.0% |. |........ |. |........ | NNS | 1.5%...... 0.0% |, |........ |, |........ | VB | 0.9%.. 0.0%.... | VB | 0.9%.. 0.0%.... | NP | 1.0%.. 0.0%.... | NP | 1.0%.. 0.0%.... | ----+----------------------------------------------------------------+ ----+----------------------------------------------------------------+ (row = reference; col = test)

18 – 18 – CSCE 771 Spring 2013 Entropy import math def entropy(labels): freqdist = nltk.FreqDist(labels) freqdist = nltk.FreqDist(labels) probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] probs = [freqdist.freq(l) for l in nltk.FreqDist(labels)] return -sum([p * math.log(p,2) for p in probs]) return -sum([p * math.log(p,2) for p in probs])

19 – 19 – CSCE 771 Spring 2013 print entropy(['male', 'male', 'male', 'male']) -0.0 print entropy(['male', 'female', 'male', 'male']) 0.811278124459 print entropy(['female', 'male', 'female', 'male']) 1.0 print entropy(['female', 'female', 'male', 'female']) 0.811278124459 print entropy(['female', 'female', 'female', 'female']) -0.0

20 – 20 – CSCE 771 Spring 2013 The Rest of NLTK Chapter 06 6.5 Naïve Bayes Classifiers 6.6 Maximum Entropy Classifiers nltk.classify.maxent.BinaryMaxentFeatureEncoding(l abels, mapping, unseen_features=False, alwayson_features=False)nltk.classify.maxent.BinaryMaxentFeatureEncoding(l abels, mapping, unseen_features=False, alwayson_features=False) 6.7 Modeling Linguistic Patterns 6.8 Summary But no more Code?!?

21 – 21 – CSCE 771 Spring 2013 Maximum Entropy Models (again) features are elements of evidence that connect observations d with categories c f: C X D  R Example feature f(c,d) = { c = LOCATION & w -1 = IN & is Capitalized(w)} An “input-feature” is a property of an unlabeled token. A “joint-feature” is a property of a labeled token.

22 – 22 – CSCE 771 Spring 2013 Feature-Based Liner Classifiers p(c |d, lambda)=

23 – 23 – CSCE 771 Spring 2013 Maxent Model revisited

24 – 24 – CSCE 771 Spring 2013 Maximum Entropy Markov Models (MEMM) repeatedly use Maxent classifier to iteratively apply to a sequence

25 – 25 – CSCE 771 Spring 2013

26 – 26 – CSCE 771 Spring 2013 Named Entity Recognition (NER) enities – 1.a : being, existence; especially : independent, separate, or self-contained existence beingexistencebeingexistence b: the existence of a thing as contrasted with its attributes 2.: something that has separate and distinct existence and objective or conceptual reality 3.: an organization (as a business or governmental unit) that has an identity separate from those of its members one of those with a name http://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtml

27 – 27 – CSCE 771 Spring 2013 Classes of Named Entities Person (PERS) Location (LOC) Organization (ORG) DATE Example: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text, such as this one: Jim bought 300 shares of Acme Corp. in 2006. Jim bought 300 shares of Acme Corp. in 2006. http://nlp.stanford.edu/software/CRF-NER.shtml

28 – 28 – CSCE 771 Spring 2013 IOB tagging

29 – 29 – CSCE 771 Spring 2013.

30 – 30 – CSCE 771 Spring 2013 Chunking - partial parsing

31 – 31 – CSCE 771 Spring 2013 NLTK ch07.py def ie_preprocess(document): sentences = nltk.sent_tokenize(document) sentences = nltk.sent_tokenize(document) sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.word_tokenize(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] sentences = [nltk.pos_tag(sent) for sent in sentences] sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), # [_chunkex-sent] ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")] grammar = "NP: { ? * }" # [_chunkex-grammar] cp = nltk.RegexpParser(grammar) result = cp.parse(sentence) print result

32 – 32 – CSCE 771 Spring 2013 (S (NP the/DT little/JJ yellow/JJ dog/NN) (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD barked/VBD at/IN at/IN (NP the/DT cat/NN)) (NP the/DT cat/NN))(S (NP the/DT little/JJ yellow/JJ dog/NN) (NP the/DT little/JJ yellow/JJ dog/NN) barked/VBD barked/VBD at/IN at/IN (NP the/DT cat/NN)) (NP the/DT cat/NN)) (S (NP money/NN market/NN) fund/NN)

33 – 33 – CSCE 771 Spring 2013 (CHUNK combined/VBN to/TO achieve/VB) (CHUNK continue/VB to/TO place/VB) (CHUNK serve/VB to/TO protect/VB) (CHUNK wanted/VBD to/TO wait/VB)

34 – 34 – CSCE 771 Spring 2013 from nltk.corpus import conll2000 print conll2000.chunked_sents('train.txt')[99] print " B********************************************" print conll2000.chunked_sents('train.txt', chunk_types=['NP'])[99] print " C********************************************" from nltk.corpus import conll2000 cp = nltk.RegexpParser("") test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print cp.evaluate(test_sents)

35 – 35 – CSCE 771 Spring 2013 Information extraction Step towards understanding Find named entities Figure out what is being said about them; actually just relations of named entities http://en.wikipedia.org/wiki/Information_extraction

36 – 36 – CSCE 771 Spring 2013 Outline of natural language processing http://en.wikipedia.org/wiki/Natural_language_processing_toolkits


Download ppt "Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter."

Similar presentations


Ads by Google