Download presentation
Presentation is loading. Please wait.
Published byMabel Beasley Modified over 9 years ago
1
CSA2050 Assignment Notes Mike Rosner
2
Aim Get text Identify people names Print frequency ranked list of names Assess accuracy
3
Pipeline Processing downloadtokenizetagchunk showcount
4
Download File urllib2 Read about urllib in Python documentation which is found under idle >>>import urllib2 >>>reply = urllib2.urlopen('http://…') >>>s = reply.read() N.B. s is just a big string.
5
Tokenize File see http://nltk.org/doc/guides/ for notes on tokenisation.http://nltk.org/doc/guides/ Tokenisation turns a string into a list of tokens. Things to try: –Eliminate HTML tags –Break into sentences can use RegExpTokeniser
6
Example >>>from nltk.tokenize import * >>>regexp_tokenize(s, pattern='\w+') ['DOC', 'id', 'APW20010911', '0566', 'type', 'other', 'DATELINE', 'NEW', 'YORK', 'DATELINE', 'TEXT', 'Plane', 'crashes', 'into', 'World', 'Trade', 'Center', 'according', 'to', 'television', 'reports', 'TEXT', 'DOC', 'DOC', 'id', 'APW20010911', '0571', 'type', 'story', 'HEADLINE', 'BULLETIN', 'HEADLINE', 'DATELINE', 'NEW', 'YORK', 'AP', 'DATELINE', 'TEXT', 'P', 'Smoke', 'poured', 'out', 'of', 'a', 'gaping', 'hole', 'in', 'the', 'upper', 'floors', 'of', 'the', 'World', 'Trade', 'Center', 'on', 'Tuesday', 'and', 'there', 'were', 'broadcast', 'reports', 'a', 'plane', 'had', 'struck', 'it', 'P', 'P', 'MORE', 'P', 'TEXT', 'DOC']
7
Special Sequences \b Word boundary (zero width) \d Any decimal digit (equivalent to [0-9]) \D Any non-digit character (equivalent to [^0-9]) \s Any whitespace character (equivalent to [ \t\n\r\f\v] \S Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) \w Any alphanumeric character (equivalent to [a-zA-Z0-9_]) \W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) see also http://nltk.org/doc/en/regexps.pdf
8
Tag the Tokens See http://nltk.org/doc/guides/tag.html >>> import nltk.tag >>> tagger = nltk.RegexpTagger([(' ','TG') ]) >>> tagger.tag([' ','the','Bible',' ']) [(' ', 'TG'), ('the', None), ('Bible', None), (' ', None)] other pairs
9
Chunking See http://nltk.org/doc/guides/chunk.html http://nltk.org/doc/en/chunk.pdf >>> from nltk.chunk import * >>> from nltk.chunk.util import * >>> from nltk.chunk.regexp import * >>> from nltk import Tree
10
Chunk Example >>> tt = [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')] >>> cp = nltk.RegexpParser("Z: { * }") >>>cp.parse(tt) Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')])
11
Extract Leaves >>>tr = Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')]). >>>tr.leaves() [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.