Presentation is loading. Please wait.

Presentation is loading. Please wait.

A PRACTICAL GUIDE TO NATURAL LANGUAGE PROCESSING Emily Daniels May 2014.

Similar presentations


Presentation on theme: "A PRACTICAL GUIDE TO NATURAL LANGUAGE PROCESSING Emily Daniels May 2014."— Presentation transcript:

1 A PRACTICAL GUIDE TO NATURAL LANGUAGE PROCESSING Emily Daniels May 2014

2 Agenda Introduction NLTK WordNet & SentiWordNet Pattern Collocation Concordance Word Frequency Dictionary Tagging Entity Recognition Document Categorization Corpora Comparison Visualization Conclusion References Questions

3 Introduction This talk is an introductory selection of natural language processing concepts and working examples useful for integrating into text and content analysis projects. The topic coverage of Python and NLP is selective and based off of an ongoing project of mine that incorporates many of the following methods and goals. Please excuse any non-standardized code or approaches to Python; this has largely been a self-taught endeavor.

4 Introduction Natural language processing gives machines the ability to read and understand the languages that humans speak. A sufficiently powerful natural language processing system would enable natural language user interfaces and the acquisition of knowledge directly from human-written sources. Human-level natural language processing is an AI- complete problem, which means it is equivalent to solving the central artificial intelligence problem of making computers as intelligent as people. NLP's future is therefore tied closely to the development of AI in general.

5 Introduction Current major areas of NLP research and development include: Search Engines Advanced Text Editors Machine Translation Systems Computational Advertising Fraud Detection Sentiment Analysis Opinion Mining Advanced Speech Processing

6 Introduction Disaster Type: Earthquake Location: Afghanistan Date: 05/30/1998 Magnitude: 6.9 Epicenter: a remote part of the country Damage: Human-effect: Victim: Thousands of people Number: Thousands Outcome: dead Physical-effect: Object: entire villages Outcome: damaged QUAKE IN AFGHANISTAN Thousands of people are feared dead following... (voice-over)... a powerful earthquake that hit Afghanistan today. The quake registered 6.9 on the Richter scale, centered in a remote part of the country. (on camera) Details now hard to come by, but reports say entire villages were buried by the quake. [1]

7 NLTK NLTK is a leading platform for building Python programs to work with human language data. It provides interfaces to over 50 corpora and a suite of text processing libraries for: classification tokenization stemming tagging parsing semantic reasoning Natural Language Processing with Python was written by the creators of NLTK and it provides a practical introduction to programming for language processing [2].

8 WordNet & SentiWordNet WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. [3] SentiWordNet is a lexical resource for opinion mining. It assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity. [4]

9 Pattern Pattern [5] is a web mining module that has tools for: data mining Google Twitter and Wikipedia API web crawler HTML DOM parser natural language processing part-of-speech taggers n-gram search sentiment analysis WordNet machine learning vector space model clustering SVM network analysis visualization

10 Collocation Word pattern recognition is an important concept in language learning and attaining fluency. Collocation is a sequence of words or terms that occur together unusually often and are resistant to substitution with words that have similar senses: “crystal clear”, “middle management”, “red wine”. These sequences are instantly recognizable to native speakers of a language, but are difficult for second language learners to acquire and use properly.

11 Collocation from nltk import bigrams words = ['the','quick','brown','fox', 'jumps','over','the','lazy','dog'] print bigrams(words) >>> [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]

12 Collocation To find significant bigrams, we can use nltk.collocations.BigramCollocationFinder along with nltk.metrics.BigramAssocMeasures. The BigramCollocationFinder maintains two internal FreqDists, one for individual word frequencies, and another for bigram frequencies. Once it has these frequency distributions, it can score individual bigrams using a scoring function provided by BigramAssocMeasures, such as chi-square (most often “goodness of fit”, measures typically summarize the discrepancy between observed values and the values expected under the model in question). These scoring functions measure the collocation correlation of 2 words, basically whether the bigram occurs about as frequently as each individual word. [6]

13 Collocation import itertools from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures words =['the','quick','brown','fox', 'jumps','over','the','lazy','dog'] bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 200) print dict([(ngram, True) for ngram in itertools.chain(words, bigrams)])

14 Collocation >>> {'brown': True, 'lazy': True, ('brown', 'fox'): True, ('jumps', 'over'): True, 'over': True, 'fox': True, 'dog': True, ('the', 'quick'): True, ('the', 'lazy'): True, ('quick', 'brown'): True, ('lazy', 'dog'): True, ('over', 'the'): True, 'quick': True, ('fox', 'jumps'): True, 'the': True, 'jumps': True}

15 Concordance There are many ways to examine the context of a text apart from simply reading it. In the publishing world, a concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. It is more than an index; additional material, such as commentary, definitions, and topical cross-indexing make producing them a labor- intensive process. A concordance view using NLP shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick:

16 Concordance from nltk.book import * print text1.concordance("monstrous") >>> Building index... Displaying 11 of 11 matches: ong the former, one was of a most monstrous size.... This came towards us, ON OF THE PSALMS. " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears. Some were thick d as you gazed, and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable, or still worse and more de th of Radney.'" CHAPTER 55 Of the Monstrous Pictures of Whales. I shall ere l ing Scenes. In connexion with the monstrous pictures of whales, I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling. But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

17 Concordance We can find out what other words appear in a similar range of contexts by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses: from nltk.book import * #Moby Dick by Herman Melville 1851 print text1.similar("monstrous") #Sense and Sensibility by Jane Austen 1811 print text2.similar("monstrous")

18 Concordance >>> Building word-context index... abundant candid careful christian contemptible curious delightfully determined doleful domineering exasperate fearless few gamesome horrible impalpable imperial lamentable lazy loving Building word-context index... very exceedingly heartily so a amazingly as extremely good great remarkably sweet vast Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very. [2]

19 Word Frequency Distributional cues are useful for the categorization of high frequency words which are encountered many times in the same context. The cues can be less useful for low frequency words, unless the words are more phonetically complex, which can then give more in-depth information. We use the book Moby Dick again to show different ways to return the frequency distribution of vocabulary items in the story:

20 Word Frequency from nltk.book import * #Moby Dick by Herman Melville 1851 fdist = FreqDist(text1) vocabulary = fdist.keys() print vocabulary[:50] >>> [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] >>> print fdist['whale'] 906

21 Dictionary Tagging Part-of-speech tagging is harder than just having a list of words and their parts of speech because some words can represent more than one part of speech at different times, and because some parts of speech are complex or unspoken. In natural languages a large percentage of word-forms are ambiguous. For example, even "dogs", which is usually thought of as just a plural noun, can also be a verb in this sentence: “The sailor dogs the hatch.” Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more common plural noun.

22 Dictionary Tagging A part-of-speech tagger, or POS-tagger, processes a sequence of words and attaches a part of speech tag to each word: from nltk import pos_tag, word_tokenize sentence = "The quick brown fox jumps over the lazy dog." tokens = word_tokenize(sentence) print pos_tag(tokens) >>> [('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN'), ('.', '.')]

23 Entity Recognition Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, etc. Most research on NER systems has been structured as taking an unannotated block of text, such as this one: Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text that highlights the names of entities: [Jim] Person bought 300 shares of [Acme Corp.] Organization in [2006] Time. In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

24 Entity Recognition The architecture of a simple information system. [2]

25 Entity Recognition import nltk #Saul Bellow, Herzog (1964) sentence = "If I am out of my mind, it's all right with me, thought Moses Herzog." tokens = nltk.word_tokenize(sentence) pos_tags = nltk.pos_tag(tokens) print nltk.ne_chunk(pos_tags, binary=False)

26 Entity Recognition >>> (S If/IN I/PRP am/VBP out/RP of/IN my/PRP$ mind/NN,/, it/PRP 's/VBZ all/DT right/RB with/IN me/PRP,/, thought/VBD (PERSON Moses/NNS Herzog/NNP)./.)

27 Entity Recognition #print nltk.ne_chunk(pos_tags, binary=False) nltk.ne_chunk(pos_tags, binary=False).draw()

28 Document Categorization In document categorization the primary task is to assign a document to one or more classes or categories. Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. This could be determined by the number of times given words appear in a document. Request oriented classification is classification in which the anticipated request from users influences how documents are classified. This could be determined by classification that is targeted towards a particular audience or user group.

29 Document Categorization Using a corpora, we can build classifiers that will automatically tag new documents with appropriate category labels. To do this, we first construct a list of documents, labeled with the appropriate categories. This example contains the Movie Reviews Corpus, which categorizes each review as positive or negative. [2] import nltk import random from nltk.corpus import movie_reviews documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)] random.shuffle(documents)

30 Document Categorization Next, we define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention to: all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words()) word_features = all_words.keys()[:2000] def document_features(document): document_words = set(document) features = {} for word in word_features: features['contains(%s)' % word] = (word in document_words) return features

31 Document Categorization Now that we've defined our feature extractor, we can use it to train a classifier to label new movie reviews. featuresets = [(document_features(d), c) for (d,c) in documents] train_set, test_set = featuresets[100:], featuresets[:100] classifier = nltk.NaiveBayesClassifier.train(train_set) classifier.show_most_informative_features(5)

32 Document Categorization A Naive Bayes classifier assumes that the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. The classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of the presence or absence of the other features. To check how reliable the resulting classifier is, we can compute its accuracy on the test set: >>> print nltk.classify.accuracy(classifier, test_set) 0.79

33 Document Categorization Apparently in this corpus, a review that mentions "Seagal" is almost 8 times more likely to be negative than positive, while a review that mentions "Damon" is about 6 times more likely to be positive. >>> Most Informative Features contains(outstanding) = True pos : neg = 13.6 : 1.0 contains(mulan) = True pos : neg = 9.0 : 1.0 contains(seagal) = True neg : pos = 7.8 : 1.0 contains(wonderfully) = True pos : neg = 7.7 : 1.0 contains(damon) = True pos : neg = 5.9 : 1.0

34 Corpora Comparison It is often interesting to discover which words are markedly different in their distribution between two texts or corpora. NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains 25,000 free electronic books, hosted at http://www.gutenberg.org/. The NLTK also contains a corpus of every inaugural speech made by a US president from George Washington in 1789 to Barack Obama in 2009.

35 Corpora Comparison import nltk from nltk.corpus import inaugural wash = inaugural.words('1789-Washington.txt') obama = inaugural.words('2009-Obama.txt') wash_word_lengths = [len(w) for w in set(v.lower() for v in wash)] print sum(wash_word_lengths) * 1.0 / len(wash_word_lengths) obama_word_lengths = [len(w) for w in set(v.lower() for v in obama)] print sum(obama_word_lengths) * 1.0 / len(obama_word_lengths) >>> 6.94370860927 6.05555555556

36 Visualization Text visualization is an approach to data summarization that builds on top of human visual perception for ease in comprehension. A form of text visualization lies in formatting strings for tabular data. This can help to see general patterns in the use of words over categories of text. First we’ll import bodies of text that fall into particular genres: import nltk from nltk.corpus import brown cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories=genre)) genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] modals = ['can', 'could', 'may', 'might', 'must', 'will']

37 Visualization Then we’ll define a way to tabulate the text: def tabulate(cfdist, words, categories): print '%-16s' % 'Category', for word in words: print '%6s' % word, print for category in categories: print '%-16s' % category, for word in words: print '%6d' % cfdist[category][word], print tabulate((cfd), modals, genres)

38 Visualization >>> Category can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13

39 Conclusion In view of the complexity of language and the broad range of interest in studying it from different angles, it's clear that we have barely scratched the surface here. Hopefully this talk has given you a solid base to work with large datasets, create models of linguistic phenomena, and to extend them into components for practical language technologies.

40 Conclusion Just remember, we don’t _have_ to teach them to read lips.

41 References [1] rohitnayak, Introduction to Natural Language Processing, 28 12 2009. http://www.slideshare.net/rohitnayak/introduction-to- natural-language-processinghttp://www.slideshare.net/rohitnayak/introduction-to- natural-language-processing [2] E. K. E. L. Steven Bird, Natural Language Processing with Python, 15 10 2012. http://www.nltk.org/book/.http://www.nltk.org/book/ [3] The Trustees of Princeton University, WordNet, 2014. http://wordnet.princeton.edu/ http://wordnet.princeton.edu/ [4] SentiWordNet, 2010. http://sentiwordnet.isti.cnr.it/http://sentiwordnet.isti.cnr.it/ [5] De Smedt, T. & Daelemans, W., Pattern for Python, 2012. http://www.clips.ua.ac.be/pattern http://www.clips.ua.ac.be/pattern [6] Jacob Perkins, Text Classification for Sentiment Analysis- Stopwords and Collocations, 24 05 2010. http://streamhacker.com/2010/05/24/text-classification- sentiment-analysis-stopwords-collocations/. http://streamhacker.com/2010/05/24/text-classification- sentiment-analysis-stopwords-collocations/

42 Questions Thank you! Emily Daniels @emdaniels (twitter | github) emilydaniels.com (blog) emily@emilydaniels.com (email)


Download ppt "A PRACTICAL GUIDE TO NATURAL LANGUAGE PROCESSING Emily Daniels May 2014."

Similar presentations


Ads by Google