Download presentation
Presentation is loading. Please wait.
Published byAriel French Modified over 6 years ago
1
NLTK http://www.nltk.org
Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.
2
NLTK Open source python modules and linguistic data for Natural Language Processing application Developed by group of people, project leaders: Steven Bird, Edward Loper, Ewan Klein
3
NLP Natural Language Processing: field of computer science and linguistics woks out interactions between computers and human (natural) languages. Main uses: machine translation, automatic summarization, information extraction, transliteration, question answering, opinion mining Basic tasks: stemming, POS tagging, chunking, parsing, Involves simple frequency count to understanding and generating complex text
4
NLP/NLTK vocabulary Tokenization: getting words and punctuations out from text Stemming: getting the (inflectional) root of word; plays, playing, played : play POS(Part of Speech) tagging: Ram NNP killed VBD Ravana NNP POS tags: NNP proper noun VBD verb, past tense
5
NLP/NLTK vocabulary Chunking: groups the similar POS tags
6
Installing NLTK data We need data to experiment with: $python
>>> import nltk >>> nltk.download()
7
Basic commands Data from book: >>>from nltk.book import *
see what is it? >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words >>> text1.similar("monstrous")
8
Basic commands Find common context
>>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting vocab >>>len(text4) #gives tokens(words and punctuations)
9
Basic commands, example
Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) Counting occurances >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5)
10
Count_words.py de NLTK Frequency Distribution
>>> fdist1 = FreqDist(text1)
11
Scrapeando la web con NLTK
What about HTML? >>> url = " m" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)
12
POS-tagging and stemming
First, define lemmatizer >>> wnl = nltk.WordNetLemmatizer() Second, tokenize sentence >>> pos_tagged = nltk.pos_tag(tokens) >>> stem_words = [wnl.lemmatize(w,p) for (w,p) in pos_tagged)
13
Extras Para recorrer directorios: import glob
files = glob.glob('*.html') Para convertir códigos HTML a caracteres UTF- 8 from BeautifulSoup import BeautifulStoneSoup clean = nltk.clean_html(text) words = BeautifulStoneSoup(clean, convertEntities=BeautifulStoneSoup.HTML_ ENTITIES).contents[0] (instalar el paquete python-beautifulsoup) sudo apt-get install python-beautifulsoup
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.