Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy.

Similar presentations


Presentation on theme: "CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy."— Presentation transcript:

1 CSA2050 Assignment Notes Mike Rosner

2 Aim Get text Identify people names Print frequency ranked list of names Assess accuracy

3 Pipeline Processing downloadtokenizetagchunk showcount

4 Download File urllib2 Read about urllib in Python documentation which is found under idle >>>import urllib2 >>>reply = urllib2.urlopen('http://…') >>>s = reply.read() N.B. s is just a big string.

5 Tokenize File see http://nltk.org/doc/guides/ for notes on tokenisation.http://nltk.org/doc/guides/ Tokenisation turns a string into a list of tokens. Things to try: –Eliminate HTML tags –Break into sentences can use RegExpTokeniser

6 Example >>>from nltk.tokenize import * >>>regexp_tokenize(s, pattern='\w+') ['DOC', 'id', 'APW20010911', '0566', 'type', 'other', 'DATELINE', 'NEW', 'YORK', 'DATELINE', 'TEXT', 'Plane', 'crashes', 'into', 'World', 'Trade', 'Center', 'according', 'to', 'television', 'reports', 'TEXT', 'DOC', 'DOC', 'id', 'APW20010911', '0571', 'type', 'story', 'HEADLINE', 'BULLETIN', 'HEADLINE', 'DATELINE', 'NEW', 'YORK', 'AP', 'DATELINE', 'TEXT', 'P', 'Smoke', 'poured', 'out', 'of', 'a', 'gaping', 'hole', 'in', 'the', 'upper', 'floors', 'of', 'the', 'World', 'Trade', 'Center', 'on', 'Tuesday', 'and', 'there', 'were', 'broadcast', 'reports', 'a', 'plane', 'had', 'struck', 'it', 'P', 'P', 'MORE', 'P', 'TEXT', 'DOC']

7 Special Sequences \b Word boundary (zero width) \d Any decimal digit (equivalent to [0-9]) \D Any non-digit character (equivalent to [^0-9]) \s Any whitespace character (equivalent to [ \t\n\r\f\v] \S Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) \w Any alphanumeric character (equivalent to [a-zA-Z0-9_]) \W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) see also http://nltk.org/doc/en/regexps.pdf

8 Tag the Tokens See http://nltk.org/doc/guides/tag.html >>> import nltk.tag >>> tagger = nltk.RegexpTagger([(' ','TG') ]) >>> tagger.tag([' ','the','Bible',' ']) [(' ', 'TG'), ('the', None), ('Bible', None), (' ', None)] other pairs

9 Chunking See http://nltk.org/doc/guides/chunk.html http://nltk.org/doc/en/chunk.pdf >>> from nltk.chunk import * >>> from nltk.chunk.util import * >>> from nltk.chunk.regexp import * >>> from nltk import Tree

10 Chunk Example >>> tt = [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')] >>> cp = nltk.RegexpParser("Z: { * }") >>>cp.parse(tt) Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')])

11 Extract Leaves >>>tr = Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')]). >>>tr.leaves() [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')]


Download ppt "CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy."

Similar presentations


Ads by Google