CSCE 590 Web Scraping - NLTK Topics Introduction to NLTK Parsing with the NLTK Readings: Online book February 21, 2017
http://www.nltk.org/book/ 0. Preface 1. Language Processing and Python 2. Accessing Text Corpora and Lexical Resources 3. Processing Raw Text 4. Writing Structured Programs 5. Categorizing and Tagging Words (minor fixes still required) 6. Learning to Classify Text 7. Extracting Information from Text 8. Analyzing Sentence Structure 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (minor fixes still required) 11. Managing Linguistic Data (minor fixes still required) 12. Afterword: Facing the Language Challenge Bibliography Term Index http://textminingonline.com/dive-into-nltk-part-i-getting-started-with-nltk
Installing NLTK Install Setuptools: http://pypi.python.org/pypi/setuptools Install Pip: run sudo easy_install pip Install Numpy (optional): run sudo pip install -U numpy Install PyYAML and NLTK: run sudo pip install -U pyyaml nltk Test installation: run python then type import nltk
Installing NLTK Data >>> import nltk >>> nltk.download()
Test NLTK Installation 1) Test Brown Corpus: >>> from nltk.corpus import brown >>> brown.words()[0:10] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] >>> brown.tagged_words()[0:10] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] >>> len(brown.words()) 1161192
Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging: >>> from nltk import sent_tokenize, word_tokenize, pos_tag >>> text = "Machine learning …” >>> sents = sent_tokenize(text) >>> sents >>> tokens = word_tokenize(text) >>> tokens
Part of Speech Tagging >>> len(tokens) 161 >>> tagged_tokens = pos_tag(tokens) >>> tagged_tokens [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …
Parsing
Recursive Descent Paring with NLTK Parsers nltk.parse_cfg( grammar) # build cfg nltk.ChartParser(g) nltk.RecursiveDescentParser(g) # build parser from grammar nltk.app.rdparser_app.RecursiveDescentApp nltk.app.srparser_app.ShiftReduceApp Imports import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import *
Groucho Grammar groucho_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' """)
The ChartParser program sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] print sent parser = nltk.ChartParser(groucho_grammar) trees = parser.nbest_parse(sent) for tree in trees: print tree
Groucho Output ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] (S (NP I) (VP (V shot) (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) (VP (V shot) (NP (Det an) (N elephant))) (PP (P in) (NP (Det my) (N pajamas)))))
Loading grammars # NLTK - mygrammar.cfg - to illustrate loading of grammars # grammar1 = nltk.data.load('file:mygrammar.cfg') S -> NP VP VP -> V NP NP -> N | DET N N -> 'Mary' | 'Bob' | 'dog' V -> 'saw' DET -> 'the' | 'a'
Example loading “mygrammar.cfg” grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 sent = "Mary saw Bob".split() print sent rd_parser = nltk.RecursiveDescentParser(grammar1) for tree in rd_parser.nbest_parse(sent): print tree
Checking the grammar # to dump the grammar grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 # or you can iterate through the productions for p in grammar1.productions(): print p
Extending the grammar sent = 'Mary saw a cat'.split() for t in rd_parser.nbest_parse(sent): print t Traceback (most recent call last): File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parse self._grammar.check_coverage(tokens) File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) ValueError: Grammar does not cover some of the input words: "'cat'".
Tracing rd_parser = nltk.RecursiveDescentParser(grammar1, 2) Parsing 'Mary saw a dog' [ * S ] E [ * NP VP ] E [ * N VP ] E [ * 'Mary' VP ] M [ 'Mary' * VP ] E [ 'Mary' * V NP ] E [ 'Mary' * 'saw' NP ] M [ 'Mary' 'saw' * NP ] E [ 'Mary' 'saw' * N ] E [ 'Mary' 'saw' * 'Mary' ] E [ 'Mary' 'saw' * 'Bob' ] E [ 'Mary' 'saw' * 'dog' ] E [ 'Mary' 'saw' * DET N ] E [ 'Mary' 'saw' * 'the' N ] … (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog)))) RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text.
Example grammar L0 based on the ATIS corpus S -> NP VP NP -> Pronoun | Proper-noun | Det Nominal Nominal -> Nominal Noun VP -> Verb | Verb NP | Verb NP PP | Verb PP PP -> Preposition NP
Lexicon for L0 Noun -> flights | breeze | trip | morning Verb -> is | prefer | like | need | want | fly …
nltk.app.rdparser_app Lines 864-886 -def app(): """ Create a recursive descent parser demo, using a simple grammar and text. """ from nltk import parse_cfg grammar = parse_cfg(""" # Grammatical productions. S -> NP VP NP -> Det N PP | Det N VP -> V NP PP | V NP | V PP -> P NP # Lexical productions. NP -> 'I' Det -> 'the' | 'a' N -> 'man' | 'park' | 'dog' | 'telescope' V -> 'ate' | 'saw' P -> 'in' | 'under' | 'with' """)
Example nltk.app.rdparser import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * sent = 'the dog saw a man in the park'.split() RecursiveDescentApp(grammar, sent).mainloop()
Example nltk.app.srparser #import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * from nltk import parse_cfg from nltk.app import * nltk.app.srparser()