CSCE 590 Web Scraping – NLTK Topics The Natural Language Tool Kit (NLTK) Readings: Online Book– http://www.nltk.org/book/ March 23, 2017
Natural Language Tool Kit (NLTK) Part of speech taggers Statistical libraries Parsers corpora
Installing NLTK http://www.nltk.org/ Mac/Unix Install NLTK: run sudo pip install -U nltk Install Numpy (optional): run sudo pip install -U numpy Test installation: run python then type import nltk For older versions of Python it might be necessary to install setuptools (see http://pypi.python.org/pypi/setuptools) and to install pip (sudo easy_install pip).
nltk.download() >>> import nltk >>> nltk.download()
Test of download >>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> len(brown.words()) 1161192
Examples from the NLTK Book Loading text1, ..., text9 and sent1, ..., sent9 Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3364-3367). O'Reilly Media. Kindle Edition.
Simple Statistical analysis using NLTK > > > len( text6)/ len( set( text6)) 7.833333333333333 > > > from nltk import FreqDist > > > fdist = FreqDist( text6) > > > fdist.most_common( 10) [(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3 19), (']', 312), (' the', 299), (' I', 255), (' ARTHUR', 225)] > > > fdist[" Grail"] 34 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3375-3385). O'Reilly Media. Kindle Edition.
Bigrams - ngrams from nltk.book import * from nltk import ngrams fourgrams = ngrams( text6, 4) for fourgram in fourgrams: if fourgram[ 0] = = "coconut": print( fourgram) Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3407-3412). O'Reilly Media. Kindle Edition.
nltkFreqDist.py – BeautifulSoup + NLTK example from nltk import FreqDist from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://www.pythonscraping.com/exercises/exercise1.html") bsObj = BeautifulSoup(html.read(), "lxml") #print(bsObj.h1) mytext = bsObj.get_text() fdist = FreqDist(mytext) print(fdist.most_common(10))
FreqDist of ngrams (bigrams) > > > from nltk import ngrams > > > fourgrams = ngrams( text6, 4) > > > fourgramsDist = FreqDist( fourgrams) > > > fourgramsDist[(" father", "smelt", "of", "elderberries")] 1 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations 3398-3403). O'Reilly Media. Kindle Edition.
Penn Tree Bank Tagging (default)
POS tagging
NltkAnalysis.py from nltk import word_tokenize, sent_tokenize, pos_tag sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.") nouns = ['NN', 'NNS', 'NNP', 'NNPS'] for sentence in sentences: if "google" in sentence.lower(): taggedWords = pos_tag(word_tokenize(sentence)) for word in taggedWords: if word[0].lower() == "google" and word[1] in nouns: print(sentence)