NLTK http://www.nltk.org Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.
NLTK Open source python modules and linguistic data for Natural Language Processing application Developed by group of people, project leaders: Steven Bird, Edward Loper, Ewan Klein
NLP Natural Language Processing: field of computer science and linguistics woks out interactions between computers and human (natural) languages. Main uses: machine translation, automatic summarization, information extraction, transliteration, question answering, opinion mining Basic tasks: stemming, POS tagging, chunking, parsing, Involves simple frequency count to understanding and generating complex text
NLP/NLTK vocabulary Tokenization: getting words and punctuations out from text Stemming: getting the (inflectional) root of word; plays, playing, played : play POS(Part of Speech) tagging: Ram NNP killed VBD Ravana NNP POS tags: NNP proper noun VBD verb, past tense
NLP/NLTK vocabulary Chunking: groups the similar POS tags
Installing NLTK data We need data to experiment with: $python >>> import nltk >>> nltk.download()
Basic commands Data from book: >>>from nltk.book import * see what is it? >>>text1 Search the word: >>> text1.concordance("monstrous") Find similar words >>> text1.similar("monstrous")
Basic commands Find common context >>>text2.common_contexts(["monstrous", "very"]) Find position of the word in entire text >>>text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) Counting vocab >>>len(text4) #gives tokens(words and punctuations)
Basic commands, example Vocabulary(unique words) used by author >>>len(set(text1)) See sorted vocabulary(upper case first!) >>>sorted(set(text1)) Measuring richness of text(average repetition) >>>len(text1)/len(set(text1)) (before division import for floating point division from __future__ import division) Counting occurances >>>text5.count("lol") % of the text taken >>> 100 * text5.count('a') / len(text5)
Count_words.py de NLTK Frequency Distribution >>> fdist1 = FreqDist(text1)
Scrapeando la web con NLTK What about HTML? >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.st m" >>> html = urlopen(url).read() >>> html[:60] It's HTML; Clean the tags... >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens[:60] >>> text = nltk.Text(tokens)
POS-tagging and stemming First, define lemmatizer >>> wnl = nltk.WordNetLemmatizer() Second, tokenize sentence >>> pos_tagged = nltk.pos_tag(tokens) >>> stem_words = [wnl.lemmatize(w,p) for (w,p) in pos_tagged)
Extras Para recorrer directorios: import glob files = glob.glob('*.html') Para convertir códigos HTML a caracteres UTF- 8 from BeautifulSoup import BeautifulStoneSoup clean = nltk.clean_html(text) words = BeautifulStoneSoup(clean, convertEntities=BeautifulStoneSoup.HTML_ ENTITIES).contents[0] (instalar el paquete python-beautifulsoup) sudo apt-get install python-beautifulsoup