Wordnet, Raw Text Pinker, continuing Chapter 2

Name: Wordnet, Raw Text Pinker, continuing Chapter 2
Uploaded: 2017-07-29T20:07:09+00:00
Duration: PTM13S35
Channel: Augustine Bryan
Description: Wordnet, Raw Text Pinker, continuing Chapter 2

Wordnet, Raw Text Pinker, continuing Chapter 2

Today’s Class Wordnet Raw Text Pinker Continuing Chapter 2 (Millar)

WordNet NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets.

WordNet We can explore these words with the help of WordNet:
Thus, motorcar has just one possible meaning and it is identified as car.n.01, the first noun sense of car. The entity car.n.01 is called a synset, or "synonym set", a collection of synonymous words (or "lemmas"): Synsets also come with a prose definition and some example sentences:

WordNet Unlike the words automobile and motorcar, which are unambiguous and have one synset, the word car is ambiguous, having five synsets:

The WordNet Hierarchy WordNet synsets correspond to abstract concepts, and they don't always have corresponding words in English. These concepts are linked together in a hierarchy. Some concepts are very general, such as Entity, State, Event — these are called unique beginners or root synsets. Others, such as gas guzzler and hatchback, are much more specific. A small portion of a concept hierarchy is illustrated in Figure 2.11.

The WordNet Hierarchy It’s very easy to navigate between concepts. For example, given a concept like motorcar, we can look at the concepts that are more specific; the (immediate) hyponyms.

The WordNet Hierarchy We can also navigate up the hierarchy by visiting hypernyms. Some words have multiple paths, because they can be classified in more than one way. There are two paths between car.n.01 and entity.n.01 because wheeled_vehicle.n.01 can be classified as both a vehicle and a container. Hypernyms and hyponyms are called lexical relations because they relate one synset to another. These two relations navigate up and down the "is-a" hierarchy.

WordNet: More Lexical Relations
Another important way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms). For example, the parts of a tree are its trunk, crown, and so on; the part_meronyms() The substance a tree is made of includes heartwood and sapwood; the substance_meronyms() A collection of trees forms a forest; the member_holonyms()

WordNet: More Lexical Relations
Some lexical relationships hold between lemmas, e.g., antonymy: There are also relationships between verbs. For example, the act of walking involves the act of stepping, so walking entails stepping. Some verbs have multiple entailments:

WordNet: Semantic Similarity
Knowing which words are semantically related is useful for indexing a collection of texts, so that a search for a general term like vehicle will match documents containing specific terms like limousine. Two synsets linked to the same root may have several hypernyms in common. If two synsets share a very specific hypernym — one that is low down in the hypernym hierarchy — they must be closely related.

Of course we know that whale is very specific (and baleen whale even more so), while vertebrate is more general and entity is completely general. We can quantify this concept of generality by looking up the depth of each synset:

Similarity measures have been defined over the collection of WordNet synsets which incorporate the above insight. For example, path_similarity assigns a score in the range 0–1 based on the shortest path that connects the concepts in the hypernym hierarchy The numbers don’t mean much, but they decrease as we move away from the semantic space of sea creatures to inanimate objects.

Computing Semantic Similarity
__author__ = 'guinnc' import nltk from nltk.corpus import wordnet as wn words = ['right_whale', 'orca', 'minke_whale', 'tortoise', 'novel'] listOfSynsets = [] for word in words: firstSynset = wn.synsets(word, 'n')[0] print firstSynset listOfSynsets.append(firstSynset) #print header print '%15s' % ' ', for synset1 in listOfSynsets: firstLemma = synset1.lemma_names[0] print '%15s' % firstLemma, print print '%15s' % synset1.lemma_names[0], for synset2 in listOfSynsets: print '%15.2f' % synset1.path_similarity(synset2),

VerbNet: A Verb Lexicon
VerbNet, a hierarhical verb lexicon linked to WordNet. It can be accessed with nltk.corpus.verbnet. *VerbNet is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon with mappings to other lexical resources such as WordNet and FrameNet. * Adapted from VerbNet website

Each VerbNet class contains a set of syntactic descriptions, depicting the possible surface realizations of the argument structure for constructions such as transitive, intransitive, prepositional phrases, etc. Semantic restrictions (such as animate, human, organization) are used to constrain the types of thematic roles allowed by the arguments Syntactic frames may also be constrained in terms of which prepositions are allowed. Each frame is associated with explicit semantic information A complete entry for a frame in VerbNet class Hit-18.1 * Adapted from VerbNet website

Each verb argument is assigned one (usually unique) thematic role within the class.

NLTK and VerbNet __author__ = 'guinnc' import nltk from nltk.corpus import verbnet as vn theWord = raw_input("Type a verb: ") verb_uses= vn.classids(theWord) print verb_uses for verb in verb_uses: print vn.pprint (vn.vnclass(verb)) #print vn.pprint_subclasses(vn.vnclass(vn.classids('spray')[0]))

Processing “raw” Text in NLTK
As mentioned before, Project Gutenberg has tens of thousands of free books Only a small subset is included with the NLTK download Suppose you want to access a text online. How do you do it?

URLs If you know the URL of a text, you can read it!
from __future__ import division __author__ = 'guinnc' import nltk, re, pprint from urllib import urlopen url = " raw = urlopen(url).read() print type(raw) print len(raw) #break it into tokens tokens = nltk.word_tokenize(raw) print len(tokens) # to use some of nltk's functions, we need to run Text on this text = nltk.Text(tokens) print text.collocations()

What about HTML files? You can read them as “raw” files with all the HTML tags or … Clean it up url = " html = urlopen(url).read() raw = nltk.clean_html(html) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) text.concordance('gene')

Web searches Google prohibits web searches from programs unless you get a developer’s licence. I was able to do this with Bing (and also ask.com): from __future__ import division __author__ = 'guinnc' import nltk, re, pprint from urllib import urlopen url = " html = urlopen(url).read() raw = nltk.clean_html(html) tokens = nltk.word_tokenize(raw) text = nltk.Text(tokens) print text.concordance("NLTK")

Text Files on your local computer
Just use f = open(‘document.txt’) raw = f.read() If you want to read a line at a time: f = open(‘document.txt’, ‘rU’) for line in f: print line

Useful string methods (Table 3-2)
Functionality s.find(t) index of first instance of string t inside s (-1 if not found) s.rfind(t) index of last instance of string t inside s (-1 if not found) s.index(t) like s.find(t) except it raises ValueError if not found s.rindex(t) like s.rfind(t) except it raises ValueError if not found s.join(text) combine the words of the text into a string using s as the glue s.split(t) split s into a list wherever a t is found (whitespace by default) s.splitlines() split s into a list of strings, one per line s.lower() a lowercased version of the string s s.upper() an uppercased version of the string s s.title() a titlecased version of the string s s.strip() a copy of s without leading or trailing whitespace s.replace(t, u) replace instances of t with u inside s

Unicode Not all files (on the web, for instance) use Unicode.
Sometimes we have to translate into Unicode (decoding) and sometimes we need to go from Unicode to some other encoding (encoding).

What happens if you use the wrong encoding?
import nltk import codecs path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt') f = open(path, 'rU') for line in f: print line, print f = codecs.open(path, encoding='latin2')

What’s next? Continuing chapter 2 Homework 3 is assigned
Regular expressions on Tuesday

Wordnet, Raw Text Pinker, continuing Chapter 2

Similar presentations

Presentation on theme: "Wordnet, Raw Text Pinker, continuing Chapter 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wordnet, Raw Text Pinker, continuing Chapter 2

Similar presentations

Presentation on theme: "Wordnet, Raw Text Pinker, continuing Chapter 2"— Presentation transcript:

Similar presentations

About project

Feedback