590 Web Scraping – NLTK IE - II

Slides:



Advertisements
Similar presentations
Feature Forest Models for Syntactic Parsing Yusuke Miyao University of Tokyo.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Modeling the Evolution of Product Entities Priya Radhakrishnan 1, Manish Gupta 1,2, Vasudeva Varma 1 1 Search and Information Extraction Lab, IIIT-Hyderabad,
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Semantic Role Labeling Abdul-Lateef Yussiff
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Applications of Sequence Learning CMPT 825 Mashaal A. Memon
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
Semantic Role Labeling
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.
LING 581: Advanced Computational Linguistics Lecture Notes February 12th.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Shallow Parsing and Chunking Python and NLTK NLTK Exercises.
Ling 570 Day 17: Named Entity Recognition Chunking.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
HW7 Extracting Arguments for % Ang Sun March 25, 2012.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Lecture 14 Relation Extraction
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
October 2005CSA3180: Text Processing II1 CSA3180: Natural Language Processing Text Processing 2 Python and NLTK Shallow Parsing and Chunking NLTK Lite.
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
LING 581: Advanced Computational Linguistics Lecture Notes February 24th.
Coping with Problems in Grammars Automatically Extracted from Treebanks Carlos A. Prolo Computer and Info. Science Dept. University of Pennsylvania.
Lecture 9 NLTK POS Tagging Part 2
Introduction to Machine Learning and Text Mining
CKY Parser 0Book 1 the 2 flight 3 through 4 Houston5 6/19/2018
University of Computer Studies, Mandalay
CSCE 590 Web Scraping – Information Retrieval
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
Dept. of Computer Science University of Liverpool
CSCE 590 Web Scraping - NLTK
CKY Parser 0Book 1 the 2 flight 3 through 4 Houston5 11/16/2018
LING/C SC 581: Advanced Computational Linguistics
Improving an Open Source Question Answering System
LING 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
Parsing and More Parsing
CS : Language Technology For The Web/Natural Language Processing
The CoNLL-2014 Shared Task on Grammatical Error Correction
Chunk Parsing CS1573: AI Application Development, Spring 2003
Natural Language Processing
CSCI 5832 Natural Language Processing
CSCE 590 Web Scraping - NLTK
CSCE 590 Web Scraping – NLTK IE
Progress report on Semantic Role Labeling
Knowledge and Information Retrieval
Part-of-Speech Tagging Using Hidden Markov Models
590 Web Scraping – Test 2 Review
LING/C SC/PSYC 438/538 Lecture 3 Sandiway Fong.
Presentation transcript:

590 Web Scraping – NLTK IE - II Topics NLTK – Information Extraction (IE) again Readings: Online Book– http://www.nltk.org/book/chapter7 - sections 7.3 on March 30, 2017

Information Extraction Architecture

Ch07_3.2_Baseline.py from nltk.corpus import conll2000 cp = nltk.RegexpParser("") test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) print(cp.evaluate(test_sents))

Ch07_3.2_BaselineII.py import nltk from nltk.corpus import conll2000 grammar = r"NP: {<[CDJNP].*>+}" cp = nltk.RegexpParser(grammar) print(cp.evaluate(test_sents)) """ Results ChunkParse score: IOB Accuracy: 87.7% Precision: 70.6% Recall: 67.8% F-Measure: 69.2% """

Unigram class UnigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): [1] train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.UnigramTagger(train_data) [2] def parse(self, sentence): [3] pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags)

>>> test_sents = conll2000. chunked_sents('test >>> test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) >>> train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) >>> unigram_chunker = UnigramChunker(train_sents) >>> print(unigram_chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 92.9% Precision: 79.9% Recall: 86.8% F-Measure: 83.2%

The Unigram tagger mapping >>> postags = sorted(set(pos for sent in train_sents ... for (word,pos) in sent.leaves())) >>> print(unigram_chunker.tagger.tag(postags)) [('#', 'B-NP'), ('$', 'B-NP'), ("''", 'O'), ('(', 'O'), (')', 'O'), (',', 'O'), ('.', 'O'), (':', 'O'), ('CC', 'O'), ('CD', 'I-NP'), ('DT', 'B-NP'), ('EX', 'B-NP'), ('FW', 'I-NP'), ('IN', 'O'), ('JJ', 'I-NP'), ('JJR', 'B-NP'), ('JJS', 'I-NP'), ('MD', 'O'), ('NN', 'I-NP'), ('NNP', 'I-NP'), ('NNPS', 'I-NP'), ('NNS', 'I-NP'), ('PDT', 'B-NP'), ('POS', 'B-NP'), ('PRP', 'B-NP'), ('PRP$', 'B-NP'), ('RB', 'O'), ('RBR', 'O'), ('RBS', 'B-NP'), ('RP', 'O'), ('SYM', 'O'), ('TO', 'O'), ('UH', 'O'), ('VB', 'O'), ('VBD', 'O'), ('VBG', 'O'), ('VBN', 'O'), ('VBP', 'O'), ('VBZ', 'O'), ('WDT', 'B-NP'), ('WP', 'B-NP'), ('WP$', 'B-NP'), ('WRB', 'O'), ('``', 'O')]

Bigram import nltk from nltk.corpus import conll2000 class BigramChunker(nltk.ChunkParserI): def __init__(self, train_sents): train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = nltk.BigramTagger(train_data) def parse(self, sentence): pos_tags = [pos for (word,pos) in sentence] tagged_pos_tags = self.tagger.tag(pos_tags) chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags] conlltags = [(word, pos, chunktag) for ((word,pos),chunktag) in zip(sentence, chunktags)] return nltk.chunk.conlltags2tree(conlltags)

test_sents = conll2000. chunked_sents('test test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP']) train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP']) bigram_chunker = BigramChunker(train_sents) print(bigram_chunker.evaluate(test_sents))

>>> bigram_chunker = BigramChunker(train_sents) >>> print(bigram_chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 93.3% Precision: 82.3% Recall: 86.8% F-Measure: 84.5%

The Bigram tagger mapping

Maximum Entropy Models Entropy may be understood as a measure of disorder within a macroscopic system. The Shannon entropy is the expected value (average) of the information contained in each message The principle of maximum entropy states that, subject to precisely stated prior data (such as a proposition that expresses testable information), the probability distribution which best represents the current state of knowledge is the one with largest entropy. https://en.wikipedia.org/wiki/Principle_of_maximum_entropy

ConsecutiveNPChunkTagger NP tagger based on consecutive tags xx """ @author: http://www.nltk.org/book/ch07.html import nltk from nltk.corpus import conll2000

ConsecutiveNPChunkTagger class ConsecutiveNPChunkTagger(nltk.TaggerI): def __init__(self, train_sents): train_set = [] for tagged_sent in train_sents: untagged_sent = nltk.tag.untag(tagged_sent) history = [] for i, (word, tag) in enumerate(tagged_sent): featureset = npchunk_features(untagged_sent, i, history) train_set.append( (featureset, tag) ) history.append(tag) self.classifier = nltk.MaxentClassifier.train( train_set, algorithm='megam', trace=0)

ConsecutiveNPChunkTagger def tag(self, sentence): history = [] for i, word in enumerate(sentence): featureset = npchunk_features(sentence, i, history) tag = self.classifier.classify(featureset) history.append(tag) return zip(sentence, history)

ConsecutiveNPChunkTagger class ConsecutiveNPChunker(nltk.ChunkParserI): def __init__(self, train_sents): tagged_sents = [[((w,t),c) for (w,t,c) in nltk.chunk.tree2conlltags(sent)] for sent in train_sents] self.tagger = ConsecutiveNPChunkTagger(tagged_sents) def parse(self, sentence): tagged_sents = self.tagger.tag(sentence) conlltags = [(w,t,c) for ((w,t),c) in tagged_sents] return nltk.chunk.conlltags2tree(conlltags)

ConsecutiveNPChunkTagger Feature Extractor – What are we using Baseline: feature = POS tag def npchunk_features(sentence, i, history): word, pos = sentence[i] return {"pos": pos} chunker = ConsecutiveNPChunker(train_sents) print(chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 92.9% Precision: 79.9% Recall: 86.7% F-Measure: 83.2%

Next variation – include prev pos tag def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] return {"pos": pos, "prevpos": prevpos} chunker = ConsecutiveNPChunker(train_sents) print(chunker.evaluate(test_sents))

ChunkParse score: IOB Accuracy: 93. 6% Precision: 81. 9% Recall: 87 ChunkParse score: IOB Accuracy: 93.6% Precision: 81.9% Recall: 87.2% F-Measure: 84.5%

adding a feature for the current word def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] return {"pos": pos, "prevpos": prevpos} chunker = ConsecutiveNPChunker(train_sents) print(chunker.evaluate(test_sents))

ChunkParse score: IOB Accuracy: 94. 5% Precision: 84. 2% Recall: 89 ChunkParse score: IOB Accuracy: 94.5% Precision: 84.2% Recall: 89.4% F-Measure: 86.7%

One more set of Features return dictionary of features "pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "tags-since-dt": tags_since_dt(sentence, i)}

Feature Extractor def npchunk_features(sentence, i, history): word, pos = sentence[i] if i == 0: prevword, prevpos = "<START>", "<START>" else: prevword, prevpos = sentence[i-1] if i == len(sentence)-1: nextword, nextpos = "<END>", "<END>" nextword, nextpos = sentence[i+1] return {"pos": pos, "word": word, "prevpos": prevpos, "nextpos": nextpos, "prevpos+pos": "%s+%s" % (prevpos, pos), "pos+nextpos": "%s+%s" % (pos, nextpos), "tags-since-dt": tags_since_dt(sentence, i)}

def tags_since_dt(sentence, i): tags = set() for word, pos in sentence[:i]: if pos == 'DT': else: tags.add(pos) return '+'.join(sorted(tags))

chunker = ConsecutiveNPChunker(train_sents) print(chunker chunker = ConsecutiveNPChunker(train_sents) print(chunker.evaluate(test_sents)) ChunkParse score: IOB Accuracy: 96.0% Precision: 88.6% Recall: 91.0% F-Measure: 89.8%

Cascaded Chunkers Recursive

Ch07_4_cascadedChunkers.py grammar = r""" NP: {<DT|JJ|NN.*>+} # Chunk seq of DT, JJ, NN PP: {<IN><NP>} # Chunk prep followed by NP VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and args CLAUSE: {<NP><VP>} # Chunk NP, VP """ cp = nltk.RegexpParser(grammar) sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")] print(cp.parse(sentence))

But this misses “saw …” (S (NP Mary/NN) saw/VBD (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

And it misses “more” >>> sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"), ... ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ... ("on", "IN"), ("the", "DT"), ("mat", "NN")] >>> print(cp.parse(sentence)) (S (NP John/NNP) thinks/VBZ (NP Mary/NN) saw/VBD # [_saw-vbd] (CLAUSE (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

Loop over patterns cp = nltk.RegexpParser(grammar, loop=2) (S sentence = [("John", "NNP"), ("thinks", "VBZ"), ("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"), ("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")] (S (NP John/NNP) thinks/VBZ (CLAUSE (NP Mary/NN) (VP saw/VBD (NP the/DT cat/NN) (VP sit/VB (PP on/IN (NP the/DT mat/NN)))))))

Limitations of Cascading parsers This cascading process enables us to create deep structures. However, creating and debugging a cascade is difficult, and there comes a point where it is more effective to do full parsing. Also, the cascading process can only produce trees of fixed depth (no deeper than the number of stages in the cascade), and this is insufficient for complete syntactic analysis.

4.2 Tree structures revisited (S (NP Alice) (VP (V chased) (NP (Det the) (N rabbit))))

Ch07_4_tree.py tree1 = nltk.Tree('NP', ['Alice']) print(tree1) tree2 = nltk.Tree('NP', ['the', 'rabbit']) print(tree2) tree3 = nltk.Tree('VP', ['chased', tree2]) tree4 = nltk.Tree('S', [tree1, tree3]) print(tree4) print(tree4[1]) print(tree4[1].label()) print(tree4.leaves()) print(tree4[1][1][1]) tree3.draw()

Ch07_4_treeTraversal.py def traverse(t): try: t.label() except AttributeError: print(t, end=" ") else: # Now we know that t.node is defined print('(', t.label(), end=" ") for child in t: traverse(child) print(')', end=" ")

treeTraversal tree1 = nltk.Tree('NP', ['Alice']) tree2 = nltk.Tree('NP', ['the', 'rabbit']) tree3 = nltk.Tree('VP', ['chased', tree2]) tree4 = nltk.Tree('S', [tree1, tree3]) traverse(tree4) ( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) )

5 Named Entities NE Type Examples ORGANIZATION Georgia-Pacific Corp., WHO PERSON Eddy Bonte, President Obama LOCATION Murray River, Mount Everest DATE June, 2008-06-29 TIME two fifty a m, 1:30 p.m. MONEY 175 million Canadian Dollars, GBP 10.40 PERCENT twenty pct, 18.75 % FACILITY Washington Monument, Stonehenge GPE South East Asia, Midlothian

Gazateers A gazetteer is a geographical dictionary or directory used in conjunction with a map or atlas.[1] They typically contain information concerning the geographical makeup, social statistics and physical features of a country, region, or continent.

Blind Gazateers – blind use of …

Ch07_5_neChunker.py import nltk from nltk.corpus import conll2000 sent = nltk.corpus.treebank.tagged_sents()[22] print(nltk.ne_chunk(sent, binary=True)) print(nltk.ne_chunk(sent))

>>> print(nltk. ne_chunk(sent)) (S The/DT (GPE U. S >>> print(nltk.ne_chunk(sent)) (S The/DT (GPE U.S./NNP) is/VBZ one/CD ... according/VBG to/TO (PERSON Brooke/NNP T./NNP Mossman/NNP) ...)

NLTK Ch07 sect 6 - Relation Extraction

import nltk import re from nltk. corpus import conll2000 sent = nltk import nltk import re from nltk.corpus import conll2000 sent = nltk.corpus.treebank.tagged_sents()[22] print(nltk.ne_chunk(sent)) IN = re.compile(r'.*\bin\b(?!\b.+ing)') for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'): for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN): print(nltk.sem.rtuple(rel))

NLTK Book Ch 07 section 8 Further Reading Extra materials for this chapter are posted at http://nltk.org/, including links to freely available resources on the web. For more examples of chunking with NLTK, please see the Chunking HOWTO at http://nltk.org/howto. The popularity of chunking is due in great part to pioneering work by Abney e.g., (Church, Young, & Bloothooft, 1996). Abney's Cass chunker is described in http://www.vinartus.net/spa/97a.pdf. The word chink initially meant a sequence of stopwords, according to a 1975 paper by Ross and Tukey (Church, Young, & Bloothooft, 1996). The IOB format (or sometimes BIO Format) was developed for NP chunking by (Ramshaw & Marcus, 1995), and was used for the shared NP bracketing task run by the Conference on Natural Language Learning (CoNLL) in 1999. The same format was adopted by CoNLL 2000 for annotating a section of Wall Street Journal text as part of a shared task on NP chunking. Section 13.5 of (Jurafsky & Martin, 2008) contains a discussion of chunking. Chapter 22 covers information extraction, including named entity recognition. For information about text mining in biology and medicine, see (Ananiadou & McNaught, 2006).