NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

NLTK http://www.nltk.org
Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

NLTK Open source python modules and linguistic data for Natural Language Processing application Developed by group of people, project leaders: Steven Bird, Edward Loper, Ewan Klein

NLP Natural Language Processing: field of computer science and linguistics woks out interactions between computers and human (natural) languages. Main uses: machine translation, automatic summarization, information extraction, transliteration, question answering, opinion mining Basic tasks: stemming, POS tagging, chunking, parsing, Involves simple frequency count to understanding and generating complex text

NLP/NLTK vocabulary Tokenization: getting words and punctuations out from text Stemming: getting the (inflectional) root of word; plays, playing, played : play POS(Part of Speech) tagging: Ram NNP killed VBD Ravana NNP POS tags: NNP proper noun VBD verb, past tense

NLP/NLTK vocabulary Chunking: groups the similar POS tags

Installing NLTK data We need data to experiment with: $python
>>> import nltk >>> nltk.download()

Basic commands Data from book: >>>from nltk.book import *
see what is it? >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words >>> text1.similar("monstrous")

Basic commands Find common context
>>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting vocab >>>len(text4) #gives tokens(words and punctuations)

Basic commands, example
Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) Counting occurances >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5)

Count_words.py de NLTK Frequency Distribution
>>> fdist1 = FreqDist(text1)

Scrapeando la web con NLTK
What about HTML? >>> url = " m" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)

POS-tagging and stemming
First, define lemmatizer >>> wnl = nltk.WordNetLemmatizer() Second, tokenize sentence >>> pos_tagged = nltk.pos_tag(tokens) >>> stem_words = [wnl.lemmatize(w,p) for (w,p) in pos_tagged)

Extras Para recorrer directorios: import glob
files = glob.glob('*.html') Para convertir códigos HTML a caracteres UTF- 8 from BeautifulSoup import BeautifulStoneSoup clean = nltk.clean_html(text) words = BeautifulStoneSoup(clean, convertEntities=BeautifulStoneSoup.HTML_ ENTITIES).contents[0] (instalar el paquete python-beautifulsoup) sudo apt-get install python-beautifulsoup

NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

Similar presentations

Presentation on theme: "NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.

Similar presentations

Presentation on theme: "NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009."— Presentation transcript:

Similar presentations

About project

Feedback