Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

Similar presentations


Presentation on theme: "NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009."— Presentation transcript:

1 NLTK http://www.nltk.org
Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

2 NLTK Open source python modules and linguistic data for Natural Language Processing application Developed by group of people, project leaders: Steven Bird, Edward Loper, Ewan Klein

3 NLP Natural Language Processing: field of computer science and linguistics woks out interactions between computers and human (natural) languages. Main uses: machine translation, automatic summarization, information extraction, transliteration, question answering, opinion mining Basic tasks: stemming, POS tagging, chunking, parsing, Involves simple frequency count to understanding and generating complex text

4 NLP/NLTK vocabulary Tokenization: getting words and punctuations out from text Stemming: getting the (inflectional) root of word; plays, playing, played : play POS(Part of Speech) tagging: Ram NNP killed VBD Ravana NNP POS tags: NNP­ proper noun VBD verb, past tense

5 NLP/NLTK vocabulary Chunking: groups the similar POS tags

6 Installing NLTK data We need data to experiment with: $python
>>> import nltk >>> nltk.download()

7 Basic commands Data from book: >>>from nltk.book import *
see what is it? >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words >>> text1.similar("monstrous")

8 Basic commands Find common context
>>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting vocab >>>len(text4) #gives tokens(words and punctuations)

9 Basic commands, example
Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) Counting occurances >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5)

10 Count_words.py de NLTK Frequency Distribution
>>> fdist1 = FreqDist(text1)

11 Scrapeando la web con NLTK
What about HTML? >>> url = " m" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)

12 POS-tagging and stemming
First, define lemmatizer >>> wnl = nltk.WordNetLemmatizer() Second, tokenize sentence >>> pos_tagged = nltk.pos_tag(tokens) >>> stem_words = [wnl.lemmatize(w,p) for (w,p) in pos_tagged)

13 Extras Para recorrer directorios: import glob
files = glob.glob('*.html') Para convertir códigos HTML a caracteres UTF- 8 from BeautifulSoup import BeautifulStoneSoup clean = nltk.clean_html(text) words = BeautifulStoneSoup(clean, convertEntities=BeautifulStoneSoup.HTML_ ENTITIES).contents[0] (instalar el paquete python-beautifulsoup) sudo apt-get install python-beautifulsoup


Download ppt "NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009."

Similar presentations


Ads by Google