CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy.

CSA2050 Assignment Notes Mike Rosner

Aim Get text Identify people names Print frequency ranked list of names Assess accuracy

Pipeline Processing downloadtokenizetagchunk showcount

Download File urllib2 Read about urllib in Python documentation which is found under idle >>>import urllib2 >>>reply = urllib2.urlopen('http://…') >>>s = reply.read() N.B. s is just a big string.

Tokenize File see http://nltk.org/doc/guides/ for notes on tokenisation.http://nltk.org/doc/guides/ Tokenisation turns a string into a list of tokens. Things to try: –Eliminate HTML tags –Break into sentences can use RegExpTokeniser

Example >>>from nltk.tokenize import * >>>regexp_tokenize(s, pattern='\w+') ['DOC', 'id', 'APW20010911', '0566', 'type', 'other', 'DATELINE', 'NEW', 'YORK', 'DATELINE', 'TEXT', 'Plane', 'crashes', 'into', 'World', 'Trade', 'Center', 'according', 'to', 'television', 'reports', 'TEXT', 'DOC', 'DOC', 'id', 'APW20010911', '0571', 'type', 'story', 'HEADLINE', 'BULLETIN', 'HEADLINE', 'DATELINE', 'NEW', 'YORK', 'AP', 'DATELINE', 'TEXT', 'P', 'Smoke', 'poured', 'out', 'of', 'a', 'gaping', 'hole', 'in', 'the', 'upper', 'floors', 'of', 'the', 'World', 'Trade', 'Center', 'on', 'Tuesday', 'and', 'there', 'were', 'broadcast', 'reports', 'a', 'plane', 'had', 'struck', 'it', 'P', 'P', 'MORE', 'P', 'TEXT', 'DOC']

Special Sequences \b Word boundary (zero width) \d Any decimal digit (equivalent to [0-9]) \D Any non-digit character (equivalent to [^0-9]) \s Any whitespace character (equivalent to [ \t\n\r\f\v] \S Any non-whitespace character (equivalent to [^ \t\n\r\f\v]) \w Any alphanumeric character (equivalent to [a-zA-Z0-9_]) \W Any non-alphanumeric character (equivalent to [^a-zA-Z0-9_]) see also http://nltk.org/doc/en/regexps.pdf

Tag the Tokens See http://nltk.org/doc/guides/tag.html >>> import nltk.tag >>> tagger = nltk.RegexpTagger([(' ','TG') ]) >>> tagger.tag([' ','the','Bible',' ']) [(' ', 'TG'), ('the', None), ('Bible', None), (' ', None)] other pairs

Chunking See http://nltk.org/doc/guides/chunk.html http://nltk.org/doc/en/chunk.pdf >>> from nltk.chunk import * >>> from nltk.chunk.util import * >>> from nltk.chunk.regexp import * >>> from nltk import Tree

Chunk Example >>> tt = [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')] >>> cp = nltk.RegexpParser("Z: { * }") >>>cp.parse(tt) Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')])

Extract Leaves >>>tr = Tree('S', [Tree('Z', [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG')]), ('.', 'UNK')]). >>>tr.leaves() [(' ', 'TG'), ('the', 'UNK'), ('Bible', 'UNK'), (' ', 'TG'), ('.', 'UNK')]

CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy.

Similar presentations

Presentation on theme: "CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy.

Similar presentations

Presentation on theme: "CSA2050 Assignment Notes Mike Rosner. Aim Get text Identify people names Print frequency ranked list of names Assess accuracy."— Presentation transcript:

Similar presentations

About project

Feedback