Download presentation
Presentation is loading. Please wait.
1
CSCE 590 Web Scraping – NLTK
Topics The Natural Language Tool Kit (NLTK) Readings: Online Book– March 23, 2017
3
Natural Language Tool Kit (NLTK)
Part of speech taggers Statistical libraries Parsers corpora
4
Installing NLTK Mac/Unix Install NLTK: run sudo pip install -U nltk Install Numpy (optional): run sudo pip install -U numpy Test installation: run python then type import nltk For older versions of Python it might be necessary to install setuptools (see and to install pip (sudo easy_install pip).
5
nltk.download() >>> import nltk >>> nltk.download()
6
Test of download >>> from nltk.corpus import brown >>> brown.words() ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] >>> len(brown.words())
7
Examples from the NLTK Book
Loading text1, ..., text9 and sent1, ..., sent9 Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.
8
Simple Statistical analysis using NLTK
> > > len( text6)/ len( set( text6)) > > > from nltk import FreqDist > > > fdist = FreqDist( text6) > > > fdist.most_common( 10) [(':', 1197), ('.', 816), ('!', 801), (',', 731), ("'", 421), ('[', 3 19), (']', 312), (' the', 299), (' I', 255), (' ARTHUR', 225)] > > > fdist[" Grail"] 34 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.
9
Bigrams - ngrams from nltk.book import * from nltk import ngrams fourgrams = ngrams( text6, 4) for fourgram in fourgrams: if fourgram[ 0] = = "coconut": print( fourgram) Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.
10
nltkFreqDist.py – BeautifulSoup + NLTK example
from nltk import FreqDist from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen(" bsObj = BeautifulSoup(html.read(), "lxml") #print(bsObj.h1) mytext = bsObj.get_text() fdist = FreqDist(mytext) print(fdist.most_common(10))
11
FreqDist of ngrams (bigrams)
> > > from nltk import ngrams > > > fourgrams = ngrams( text6, 4) > > > fourgramsDist = FreqDist( fourgrams) > > > fourgramsDist[(" father", "smelt", "of", "elderberries")] 1 Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Locations ). O'Reilly Media. Kindle Edition.
12
Penn Tree Bank Tagging (default)
13
POS tagging
14
NltkAnalysis.py from nltk import word_tokenize, sent_tokenize, pos_tag sentences = sent_tokenize("Google is one of the best companies in the world. I constantly google myself to see what I'm up to.") nouns = ['NN', 'NNS', 'NNP', 'NNPS'] for sentence in sentences: if "google" in sentence.lower(): taggedWords = pos_tag(word_tokenize(sentence)) for word in taggedWords: if word[0].lower() == "google" and word[1] in nouns: print(sentence)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.