Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 590 Web Scraping – NLTK

Similar presentations


Presentation on theme: "CSCE 590 Web Scraping – NLTK"— Presentation transcript:

1 CSCE 590 Web Scraping – NLTK
Topics The Natural Language Tool Kit (NLTK) Readings: Online Book– March 23, 2017

2

3 Natural Language Tool Kit (NLTK)
Part of speech taggers Statistical libraries Parsers corpora

4 Installing NLTK Mac/Unix Install NLTK: run sudo pip install -U nltk Install Numpy (optional): run sudo pip install -U numpy Test installation: run python then type import nltk For older versions of Python it might be necessary to install setuptools (see and to install pip (sudo easy_install pip).

5 nltk.download() >>> import nltk >>> nltk.download()

6 Test of download >>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> len(brown.words())

7 Examples from the NLTK Book
Loading text1, ..., text9 and sent1, ..., sent9 Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

8 Simple Statistical analysis using NLTK
> > > len( text6)/ len( set( text6)) > > > from nltk import FreqDist > > > fdist = FreqDist( text6) > > > fdist.most_common( 10) [(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3 19), (']', 312), (' the', 299), (' I', 255), (' ARTHUR', 225)] > > > fdist[" Grail"] 34 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

9 Bigrams - ngrams from nltk.book import * from nltk import ngrams fourgrams = ngrams( text6, 4) for fourgram in fourgrams: if fourgram[ 0] = = "coconut": print( fourgram) Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

10 nltkFreqDist.py – BeautifulSoup + NLTK example
from nltk import FreqDist from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" bsObj = BeautifulSoup(html.read(), "lxml") #print(bsObj.h1) mytext = bsObj.get_text() fdist = FreqDist(mytext) print(fdist.most_common(10))

11 FreqDist of ngrams (bigrams)
> > > from nltk import ngrams > > > fourgrams = ngrams( text6, 4) > > > fourgramsDist = FreqDist( fourgrams) > > > fourgramsDist[(" father", "smelt", "of", "elderberries")] 1 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.

12 Penn Tree Bank Tagging (default)

13 POS tagging

14 NltkAnalysis.py from nltk import word_tokenize, sent_tokenize, pos_tag sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.") nouns = ['NN', 'NNS', 'NNP', 'NNPS'] for sentence in sentences: if "google" in sentence.lower(): taggedWords = pos_tag(word_tokenize(sentence)) for word in taggedWords: if word[0].lower() == "google" and word[1] in nouns: print(sentence)

15

16


Download ppt "CSCE 590 Web Scraping – NLTK"

Similar presentations


Ads by Google