Download presentation
Presentation is loading. Please wait.
Published byΝικίας Κουβέλης Modified over 5 years ago
1
LING/C SC 581: Advanced Computational Linguistics
Lecture 4 Jan 22nd
2
Administrivia I'll be away for four lectures (too long) starting next week Topic: WordNet I'll post slides (as usual) and one homework You should read the slides and do the homework
3
Administrivia Homework 3 remark:
some people had trouble with the graphical tree display one alternative: .pretty_print() (see end of today's lecture) ascii graphics
4
2019 HLT Lecture Series Speaker Title Date Tatjana Scheffler
Analyzing Discourse Structure on Social Media Friday Feb 15th, 3pm, Comm 311. Marcos Zampieri Language Variation and Automatic Language Identification. The Case of Dialects and Similar Languages. Wednesday Feb 20th, noon, room TBA Adriana Picoral Investigating Multilingualism through Computational Linguistics. Wednesday Feb 27th, noon, room TBA Gus Hahn-Powell TBA Wednesday Mar 13th, noon, room TBA Miikka Silfverberg Wednesday Mar 20th, noon, room TBA
5
Homework 2 Review import nltk import numpy as np # np.array() a.max() a.min() import matplotlib.pyplot as plt # .hist(density=True) .show() .xlabel() .ylabel() .legend() from urllib import request url = " response = request.urlopen(url) raw = response.read().decode('latin-1') raw = raw[431:] raw = raw[1217:] raw = raw[:368976]
6
Homework 2 Review Blue = Mrs. Dalloway Orange = Brown corpus
7
Penn Treebank (version 3)
USB stick: Homework 4: install and test it (no need to report)
8
nltk: Corpus Readers The NLTK data package includes a 10% sample of the Penn Treebank (in treebank) – you use this in your Homework 3, as well as the Sinica Treebank (in sinica_treebank). Reading the Penn Treebank (Wall Street Journal sample): >>> from nltk.corpus import treebank >>> print(treebank.fileids()) # doctest: +ELLIPSIS ['wsj_0001.mrg', 'wsj_0002.mrg', 'wsj_0003.mrg', 'wsj_0004.mrg', ...] >>> print(treebank.words('wsj_0003.mrg')) ['A', 'form', 'of', 'asbestos', 'once', 'used', ...] >>> print(treebank.tagged_words('wsj_0003.mrg')) [('A', 'DT'), ('form', 'NN'), ('of', 'IN'), ...] >>> print(treebank.parsed_sents('wsj_0003.mrg')[0]) # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE (S (S-TPC-1 (NP-SBJ (NP (NP (DT A) (NN form)) (PP (IN of) (NP (NN asbestos)))) (RRC ...)...)...) ... (VP (VBD reported) (SBAR (-NONE- 0) (S (-NONE- *T*- 1)))) (. .))
9
nltk: Corpus Readers If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank: >>> from nltk.corpus import ptb >>> print(ptb.fileids()) # doctest: +SKIP ['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...] >>> print(ptb.words('WSJ/00/WSJ_0003.MRG')) # doctest: +SKIP ['A', 'form', 'of', 'asbestos', 'once', 'used', '*', ...] >>> print(ptb.tagged_words('WSJ/00/WSJ_0003.MRG')) # doctest: +SKIP [('A', 'DT'), ('form', 'NN'), ('of', 'IN'), ...]
10
Penn Treebank (PTB) with nltk
TREEBANK_3.zip Put your wsj directory (from mrg) here ~/nltk_data/corpora/ptb Filename case problem!
11
Penn Treebank (PTB) with nltk
~/nltk_data/corpora/ptb
12
Penn Treebank (PTB) with nltk
Rename files to uppercase for f in `find wsj`; do mv -v "$f" "`echo $f | tr '[a-z]' '[A-Z]'`"; done (found on stackoverflow.com) seems to work but not clean directory name needs to be uppercased too!
13
Penn Treebank (PTB) with nltk
Note: you may run into problems with file permissions when renaming: Change permissions (recursively): chmod -R u+w atis
14
Penn Treebank (PTB) with nltk
Renaming script courtesy of Sandeep Suntwal (from last year's class):
15
Penn Treebank (PTB) with nltk
Checking the install: class BracketParseCorpusReader seems to be the Brown corpus + the Wall Street Journal corpus …
16
Penn Treebank (PTB) with nltk
WSJ only: Defined in ~/nltk_data/corpora/ptb/allcats.txt:
17
Penn Treebank (PTB) with nltk
Validation: methods words(), tagged_words()
18
Penn Treebank (PTB) with nltk
Validation: methods sents(), tagged_sents()
19
Penn Treebank (PTB) with nltk
Validation: method parsed_sents()
20
Penn Treebank (PTB) with nltk
print function and method draw() ptb.parsed_sents(categories=['news'])[0].draw()
21
Penn Treebank (PTB) with nltk
Class nltk.tree methods s = ptb.parsed_sents(categories=['news'])[0] >>>s.productions() [S ->NP-SBJ VP ., NP-SBJ ->NP , ADJP ,, NP ->NNP NNP, NNP - >'Pierre', NNP ->'Vinken', , ->',', ADJP ->NP JJ, NP ->CD NNS, CD - >'61', NNS ->'years', JJ ->'old', , ->',', VP ->MD VP, MD ->'will', VP ->VB NP PP-CLR NP-TMP, VB ->'join', NP ->DT NN, DT ->'the', NN - >'board', PP-CLR ->IN NP, IN ->'as', NP ->DT JJ NN, DT ->'a', JJ - >'nonexecutive', NN ->'director', NP-TMP ->NNP CD, NNP ->'Nov.', CD ->'29', . ->'.'] type(s) <class 'nltk.tree.Tree'>
22
Penn Treebank (PTB) with nltk
Class nltk.tree methods s.productions() [S ->NP-SBJ VP ., NP-SBJ ->NP , ADJP ,, NP ->NNP NNP, NNP - >'Pierre', NNP ->'Vinken', , ->',', ADJP ->NP JJ, NP ->CD NNS, CD - >'61', NNS ->'years', JJ ->'old', , ->',', VP ->MD VP, MD ->'will', VP ->VB NP PP-CLR NP-TMP, VB ->'join', NP ->DT NN, DT ->'the', NN - >'board', PP-CLR ->IN NP, IN ->'as', NP ->DT JJ NN, DT ->'a', JJ - >'nonexecutive', NN ->'director', NP-TMP ->NNP CD, NNP ->'Nov.', CD ->'29', . ->'.'] s.words() not defined
23
Penn Treebank (PTB) with nltk
Class nltk.tree methods >>>len(s) 3 >>>s[0] Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]) >>>s[1] Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP-TMP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]) >>>s[2] Tree('.', ['.'])
24
Penn Treebank (PTB) with nltk
Class nltk.tree methods s.label() 'S' >>>s.leaves() ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'] >>>s.flatten() Tree('S', ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']) >>>s.height() 7
25
Penn Treebank (PTB) with nltk
Class nltk.tree methods for t in s.subtrees(): print(t) (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .)) (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (NP (NNP Pierre) (NNP Vinken)) (NNP Pierre) (NNP Vinken) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (NP (CD 61) (NNS years)) (CD 61)) (NNS years) (JJ old) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (MD will) (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29))) (VB join) (NP (DT the) (NN board)) (DT the) (NN board) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (IN as) (NP (DT a) (JJ nonexecutive) (NN director)) (DT a) (JJ nonexecutive) (NN director) (NP-TMP (NNP Nov.) (CD 29)) (NNP Nov.) (CD 29) (. .)
26
Penn Treebank (PTB) with nltk
Class nltk.tree methods s.pos() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')] Source Code here: chomsky_normal_form() fromstring() pretty_print()
27
Penn Treebank (PTB) with nltk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.