Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.

Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions

Split into words sent = “That isn’t the problem, Bob.” sent.split() vs. nltk.word_tokenize(sent) 1

List Comprehensions Compact way to process every item in a list. [x for x in array] dest = [] for x in array: dest.append(x) 2

Methods Using the iterating variable, x, methods can be applied. Their value is stored in the resulting list. [len(x) for x in array] dest = [] for x in array: dest.append(len(x)) 3

Conditionals Elements from the original list can be omitted from the resulting list, using conditional statements [x for x in array if len(x) == 3] dest = [] for x in array: if len(x) == 3: dest.append(x) 4

Building up These can be combined to build up complicated lists [x.upper() for x in array if len(x) > 3 and x.startswith(‘t’)] dest = [] for x in array: if len(x) > 3 and x.startswith(‘t’): dest.append(x.upper()) 5

Lists Containing Lists Lists can contain lists [[a, 1], [b, 2], [d, 4]]...or tuples [(a, 1), (b, 2), (d, 4)] [ [d, d*d] for d in array if d < 4] 6

Using multiple lists Multiple lists can be processed simultaneously in a list comprehension [x*y for x in array1 for y in array2] 7

List Comprehension Exercises Make a list of the first ten multiples of ten (10, 20, 30... 90, 100) using a list comprehension. Make a list of the first ten cubes (1, 8, 27... 1000) using a list comprehension. Store five names in a list. Make a second list that adds the phrase "is awesome!" to each name, using a list comprehension. Write out the following code without using a list comprehension: plus_thirteen = [number + 13 for number in range(1,11)] Exercises from: http://introtopython.org/all_exercises_challenges.html#ex_ch_12 8

Lists within lists are often called 2-d arrays 9 This is another way we store tables. Similar to nested dictionaries. a = [[0,1], [1,0]] a[1][1] a[0][0]

Numpy & Arrays Numpy is a commonly used package for numerical calculations in python. Its main object is a multidimensional array. A[1]List A[1][2]‘Rectangular’ 2-d Matrix A[1][2][3]‘Cube/Prism’ 3-d Matrix A[1][2][3][4]4-d Matrix etc. 10

Numpy arrays from numpy import * a = array([1,2,3,4]) a = array([1,2], [3,4]) a.ndimNumber of dimensions a.shapeLength of each dimension a.sizeTotal number of elements 11

numpy array initialization >>> zeros( (3,4) ) array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) >>> ones( (2,3,4), dtype=int16 ) array([[[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]], [[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]]], dtype=int16) >>> empty( (2,3) ) array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260], [ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]]) 12

Content Types arrays are homogenous (ndarray) –array([1, 3, 4], dtype=int16) lists are not homogenous –[‘abc’, 123, [list1, list2]] dtype describes the “type” of object in the array –str, tuple, int, etc. –numpy.int16, numpy.int32, numpy.float64 etc. 13

zip Zip allows you to “zip” two lists together, creating a list of tuples names = [‘Andrew’, ‘Beth’, ‘Charles’] ages = [35, 34, 33] name_age = zip(names, ages) –[(‘Andrew’, 35), (‘Beth’, 34), (‘Charles’, 33)] 14

foreach vs. indexed for loops “More pythonic” for n, a in zip(names, ages): print “%s -- %s” % (n, a) vs. for i in xrange(len(names)): print “%s -- %s” % (names[i], ages[i]) 15

map map allows you to apply the same function to a list of objects. a = [‘1’, ‘2’, ‘4’] map(int, a) 16

map Any function can be ‘map’ed over a list, but the elements of the list need to be a value argument. def uppercase(s): return s.upper() a = [‘abc’, ‘def’, ‘ghi’] map(uppercase, a) 17

Functions as objects A function name can be assigned to a variable. map is an example of this, where the first argument to map is a function object. a = [1, 3, 4] len(a) sum(a) functions = [len, sum] for fn in functions: print str(fn), fn(a) 18

lambda Lambda functions are single use functions that do not need to be ‘def’ed. Using the uppercase example again: def uppercase(s): return s.upper() a = [‘abc’, ‘def’, ‘ghi’] map(uppercase, a) 19

lambda Lambda functions are single use functions that do not need to be ‘def’ed. These are “anonymous” functions Using the uppercase example again: a = [‘abc’, ‘def’, ‘ghi’] map(lambda s : s.upper(), a) By design, lambdas are only a single statement 20

Aside: Glob Construct a list of all filemames matching a pattern. from glob import glob glob(‘*.txt’) glob(‘/Users/andrew/Documents/*/*.ppt’) 21

Linguistic Annotation Text only takes us so far. People are reliable judges of linguistic behavior. We can model with machines, but for “gold-standard” truth, we ask people to make judgments about linguistic qualities. 22

Example Linguistic Annotations Sentence Boundaries Part of Speech Tags Phonetic Transcription Syntactic parse trees Speaker Identity Semantic Role Speech Act Document Topic Argument structure Word Sense many many many more 23

We need… Techniques to process these. Every corpus has its own format for linguistic annotation. so…we need to parse annotation formats. 24

Constructing a linguistic corpus Decisions that need to be made: –Why are you doing this? –What material will be collected? –How will it be collected? Automatically? Manually? Found material vs. laboratory language? –What meta information will be stored? –What manual annotations are required? How will each annotation be defined? How many annotators will be used? How will agreement be assessed? How will disagreements be resolved? –How will the material be disseminated? Is this covered by your IRB if the material is the result of a human subject protocol? 25

Part of Speech Tagging Task: Given a string of words, identify the parts of speech for each word. 26

Part of Speech tagging Surface level syntax. Primary operation Parsing Word Sense Disambiguation Semantic Role labeling Segmentation Discourse, Topic, Sentence 27

How is it done? Learn from Data. Annotated Data: Unlabeled Data: 28

Learn the association from Tag to Word 29

Limitations Unseen tokens Uncommon interpretations Long term dependencies 30

Format conversion exercise The/DET Dog/NN is/VB fast/JJ./. The dog is fast. 1, 3, DET 5, 7, NN 9, 10, VB 12, 15, JJ 16, 16,. 31

Parsing Generate a parse tree. 32

Parsing Generate a Parse Tree from: The surface form (words) of the text Part of Speech Tokens 33

Parsing Styles 34

Parsing styles 35

Context Free Grammars for Parsing S → VP S →NP VP NP → Det Nom Nom → Noun Nom → Adj Nom VP → Verb Nom Det → “A”, “The” Noun → “I”, “John”, “Address” Verb → “Gave” Adj → “My”, “Blue” Adv → “Quickly” 36

Limitations The grammar must be built by hand. Can’t handle ungrammatical sentences. Can’t resolve ambiguity. 37

Probabilistic Parsing Assign each transition a probability Find the parse with the greatest “likelihood” Build a table and count –How many times does each transition happen Structured learning. 38

Segmentation Sentence Segmentation Topic Segmentation Speaker Segmentation Phrase Chunking –NP, VP, PP, SubClause, etc. 39

Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.

Similar presentations

Presentation on theme: "Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.

Similar presentations

Presentation on theme: "Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions."— Presentation transcript:

Similar presentations

About project

Feedback