Python NumPy AILab Batselem Jagvaral 2016 March
What is NumPy? NumPy (numerical python) is a package for scientific computing. It provides tools for handling n- dimensional arrays (especially vectors and matrices). The objects are all the same type into a NumPy arrays structure The package offers a large number of routines for fast access to data (e.g. search, extraction), for various manipulations (e.g. sorting), for calculations (e.g. statistical computing) etc
Fancy indexing and index tricks Overview Broadcasting Array Broadcasting Broadcasting rules Fancy indexing and index tricks Indexing with Arrays of indices Indexing with Boolean Arrays The ix_() function Indexing with strings Linear Algebra Simple Array Operations Tricks and Tips “Automatic” Reshaping Vector Stacking Histograms
Broadcasting
Broadcasting Broadcasting allows us to deal with inputs that do not have exactly the same shape. NumPy operations are usually done on pairs of arrays on an element-by- element basis. The two arrays must have exactly the same shape, as in the following example. NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. Add a multiarray with the same shape Add a scala to a multiarray >>> a = np.array([[1, 2], [3, 4]]) >>> b = np.array([[2, 2], [2, 2]]) >>> a * b array([[2, 4], [6, 7]]) >>> a = np.array([[1, 2], [3, 4]]) >>> b = 2 >>> a * b array([[2, 4], [6, 7]]) The result is equivalent to the previous example where b was an array. We can think of the scalar b being stretched during the arithmetic operation into an array with the same shape as a. The new elements in b are simply copies of the original scalar. NumPy is smart enough to use the original scalar value without actually making copies, so that broadcasting operations are as memory and computationally efficient as possible. The code in the second example is more efficient than that in the first because broadcasting moves less memory around during the multiplication (b is a scalar rather than an array). 2x2 1x1 2x2 2x2 2x2 2x2 8 4 2 6 4 2 1 3 2 8 4 2 6 4 2 1 3 2 * = * = Broadcasting occurs!
Broadcasting Both A and B arrays have axes with length one that are expanded to a larger size during the broadcast operation: >>> A = np.array([0,10,20,30]) >>> B = np.array([0,1,2]) >>> y = A[:, None] + B 4x1 1x3 4x3 stretch stretch The smaller array is “broadcast” across the larger array so that they have compatible shapes
Broadcasting Arrays 4x3 4x3 4x3 3 stretch 4x1 3 stretch stretch 2 stretch 4x1 3 3 The result is equivalent to the previous examples. stretch stretch
Two dimensions are compatible when Broadcasting When operating on two arrays, NumPy compares their shapes element-wise. Two dimensions are compatible when they are equal, or one of them is 1 mismatch! 4x3 4 If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes.
Fancy indexing and index tricks
Indexing with Arrays of Indices NumPy offers more indexing facilities than regular Python sequences. In addition to indexing by integers and slices, arrays can be indexed by arrays of integers and arrays of booleans. >>> a = np.arange(6)**2 # the first 5 square numbers >>> i = np.array([ 0,0,1,3 ]) # an array of indices >>> a[i] # the elements of a at the positions i array([ 0, 0, 1, 9 ]) >>> j = np.array( [ [ 1, 2], [ 0, 4 ] ] ) # a bidimensional array of indices >>> a[j] # the same shape as j array([[1, 4], [0, 16]]) 9 1 4 16 25 → 1 9 2x2 Unlike slicing, fancy indexing creates copies instead of a view into original array 9 1 4 16 25 →
Indexing with Arrays of Indices When the indexed array a is multidimensional, a single array of indices refers to the first dimension of a. The following example shows this behavior by converting an image of labels into a color image using a palette. >>> palette = np.array( [ [0,0,0], # black [255,0,0], # red [0,255,0], # green [0,0,255], # blue [255,255,255] ] ) # white >>> image = np.array( [ [ 0, 1, 2, 0 ], # each value corresponds to # a color in the palette [ 0, 3, 4, 0 ] ] ) >>> palette[image] # the (2,4,3) color image array([[[ 0, 0, 0], [255, 0, 0], [ 0, 255, 0], [ 0, 0, 0]], [[ 0, 0, 0], [ 0, 0, 255], [255, 255, 255], [ 0, 0, 0]]]) 5x3 2x5x3 255 255 255 →cc 255 255 255 255 255 255 255 255
Indexing with Arrays of Indices We can also give indexes for more than one dimension. The arrays of indices for each dimension must have the same shape. 1 2 3 6 7 5 9 11 10 3x4 4 8 j i Indexing arrays for multi-dimension >>> a = np.arange(12).reshape(3,4) >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> i = np.array( [ [0,1], # indices for the first dim of a [1,2] ] ) >>> j = np.array( [ [2,1], # indices for the second dim of a [3,3] ] ) >>> a[i,j] # i and j must have equal shape array([[ 2, 5], [ 7, 11]]) >>> a[i,2] array([[ 2, 6 ], [ 6, 10 ]]) → 2x2 5 2 7 11 Indexing with a fixed index Indexing for the complete row slices 3x4 >>> a[:,j] # i.e., a[ : , j] array([[[ 2, 1], [ 3, 3]], [[ 6, 5], [ 7, 7]], [[10, 9], [11, 11]]]) 1 2 3 i 4 5 6 7 8 9 10 11 j =2
Indexing with Arrays of Indices Naturally, we can put i and j in a sequence (say a list) and then do the indexing with the list >>> l = [i,j] >>> a[l] # equivalent to a[i,j] array([[ 2, 5], [ 7, 11]])
Indexing with Arrays of Indices Another common use of indexing with arrays is the search of the maximum value of time-dependent series : Time-dependent series >>> time = np.linspace(20, 145, 5) # time scale >>> data = np.sin(np.arange(20)).reshape(5,4) # 4 time-dependent series >>> time array([ 20. , 51.25, 82.5 , 113.75, 145. ]) >>> data array([[ 0. , 0.84147098, 0.90929743, 0.14112001], [-0.7568025 , -0.95892427, -0.2794155 , 0.6569866 ], [ 0.98935825, 0.41211849, -0.54402111, -0.99999021], [-0.53657292, 0.42016704, 0.99060736, 0.65028784], [-0.28790332, -0.96139749, -0.75098725, 0.14987721]]) >>> ind = data.argmax(axis=0) >>> ind array([2, 0, 3, 1]) >>> time_max = time[ ind ] # times corresponding to the maxima >>> time_max array([ 82.5 , 20. , 113.75, 51.25]) Index of the maxima for each series
Indexing with Arrays of Indices You can also use indexing with arrays as a target to assign to: However, when the list of indices contains repetitions, the assignment is done several times, leaving behind the last value: >>> a = np.arange(5) >>> a array([0, 1, 2, 3, 4]) >>> a[[1,3,4]] = 0 array([0, 0, 2, 0, 0]) 2 3 4 1 0 1 2 3 4 → >>> a = np.arange(5) >>> a array([0, 1, 2, 3, 4]) >>> a[[0,0,2]]=[1,2,3] array([2, 1, 3, 3, 4])
Boolean or “mask” Index Arrays When we index arrays with arrays of (integer) indices we are providing the list of indices to pick. With boolean indices the approach is different; we explicitly choose which items in the array we want and which ones we don’t. The most natural way one can think of for boolean indexing is to use boolean arrays that have the same shape as the original array: >>> a = np.arange(12).reshape(3,4) >>> mask = a > 4 >>> mask # mask is a boolean with a's shape array([[False, False, False, False], [False, True, True, True], [ True, True, True, True]], dtype=bool) >>> a[mask] # 1d array with the selected elements array([ 5, 6, 7, 8, 9, 10, 11]) 1 2 3 6 7 5 9 11 10 3x4 4 8 Unlike in the case of integer index arrays, in the boolean case, the result is a 1-D array containing all the elements in the indexed array corresponding to all the true elements in the boolean array.
Boolean or “mask” Index Arrays This property can be very useful in assignments: >>> a[mask] = 0 # All elements of 'a' higher than 4 become 0 >>> a array([[0, 1, 2, 3], [4, 0, 0, 0], [0, 0, 0, 0]]) 3x4 1 2 3 4
Indexing with Boolean Arrays The second way of indexing with booleans is more similar to integer indexing; for each dimension of the array we give a 1D boolean array selecting the slices we want. >>> a = np.arange(12).reshape(3,4) >>> b1 = np.array([False,True,True]) # first dim selection >>> b2 = np.array([True,False,True,False]) # second dim selection >>> a[b1,:] # selecting rows array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]]) mask 4 5 1 2 3 6 7 5 9 11 10 4 8 F F 6 7 F F + → 8 9 10 11 T T T T T T T T
Indexing with Boolean Arrays Without the np.ix_ call or only the diagonal elements would be selected. Indexing with Boolean Arrays >>> a = np.arange(12).reshape(3,4) >>> b1 = np.array([False,True,True]) # first dim selection >>> b2 = np.array([True,False,True,False]) # second dim selection >>> a[:,b2] # selecting columns array([[ 0, 2], [ 4, 6], [ 8, 10]]) >>> a[b1,b2] array([4, 10]) >>> a[np.ix_(b1,b2)] array([[4, 6], [8, 10]]) Taking advantage of numpy’s broadcasting facilities. mask1 mask2 F F T F T 1 1 2 3 6 7 5 9 11 10 4 8 F 2 3 F F 4 6 → → T F T F 4 5 6 6 7 T T T T 8 10 T F T F 8 9 10 11 T T T T Without the np.ix_ call or only the diagonal elements would be selected!
rows 2,4 and 5 and columns 3 and 6. The ix_() function: Special indexing field Basic slicing is constructed by start:stop:step notation inside of brackets. The numpy.ix_ function generates indexes for irregular slices >>> a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a[1:7:2] array([1, 3, 5]) for example, a[1:3, : ], a[2,3], a[5:] MATLAB NumPy Description a( [2,4,5],[3,6] ) a[ ix_( [1,3,4],[2,5] ) ] rows 2,4 and 5 and columns 3 and 6. >>> a = np.arange(42).reshape(6,7) >>> a[np.ix_([1,3,4],[2,5])] 6x7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 9 12 23 26 30 33 3x2 Picking out rows and columns!
Reduce Operation Reduces a‘s dimension by one, by applying arithmetic functions SUMMING UP EACH ROW >>> a = np.arange(12).reshape(3,4) >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> np.add.reduce(a, 1) array([ 6, 22, 38]) 1 2 3 6 7 5 9 11 10 3x4 4 8 38 22 → → →
Linear Algebra
Simple Array Operations All functions we know by now operate element-wise on arrays. For linear algebra we need scalar, matrix-vector and matrix-matrix products. >>> A = np.array([[3, 4], [2, 3]]) >>> print(A) [[ 3 4] [ 2 3]] >>> A.transpose() # the same as matlab array([[ 3, 2], [ 4, 3]]) >>> np.linalg.inv(a) # Compute the (multiplicative) inverse of a matrix. array([[3, -4], [-2, 3]])
Simple Array Operations: INVERSE 3x3 ARRAY IDENTITY MATRIX PYTHON TEST >>> a = np.arange(9).reshape(3,3) >>> np.linalg.det(a) 0.0 >>> np.linalg.inv(a) ERROR AA–1 = A–1A = In 3x3 INVERSE OF 3x3 MATRIX 1 2 3 4 5 6 7 8 DETERMINANT OF a ARRAY detA = a11a22a33 + a21a32a13 + a31a12a23 - a11a32a23 - a31a22a13 - a21a12a33 det(a) = 0*4*8 + 3*7*2 + 6*1*5 – 0*7*5 – 6*4*2 – 3*1*8 = 0 a–1 is undefined! If the determinant of A is zero, then A inverse does not exist.
EYE ARRAY (IDENTITY MATRIX) Simple Array Operations EYE ARRAY (IDENTITY MATRIX) >>> u = np.eye(2) # unit 2x2 matrix; "eye" represents "I" >>> u array([[ 1., 0.], [ 0., 1.]]) >>> a = np.array([[1, 2, 3], [4, 5, 6]]) >>> b = np.array([[7, 8], [9, 10], [11, 12]]) >>> np.dot(a,b) array([[ 58, 64], [139, 154]]) DOT PRODUCT
Tips and Tricks
“Automatic” Reshaping To change the dimensions of an array, you can omit one of the sizes which will then be deduced automatically: CONSTRUCT 3D ARRAY >>> a = np.arange(30) >>> a.shape = 2,-1,3 # -1 means "whatever is needed" >>> a.shape (2, 5, 3) >>> a array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14]], [[15, 16, 17], [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]]]) 2 x A x 3 → 2*A*3 = 30 → A = 5
Vector Stacking How do we construct a 2D array from a list of equally-sized row vectors? In MATLAB this is quite easy: if x and y are two vectors of the same length you only need do m=[x;y]. In NumPy this works via the functions column_stack, dstack, hstack and vstack. For example: 4 5 2 4 1 2 → 6 7 6 8 3 4 8 9 10 11 >>> x = np.arange(0,10,2) # x=([0,2,4,6,8]) >>> y = np.arange(5) # y=([0,1,2,3,4]) >>> m = np.vstack([x,y]) # m=([[0,2,4,6,8], # [0,1,2,3,4]]) >>> xy = np.hstack([x,y]) # xy =([0,2,4,6,8,0,1,2,3,4])
PLOT A HISTOGRAM WITH MATPLOTLIB HIST Numpy Histogram vs Hist matplotlib Hist ( Matplotlib) plots the histogram automatically. Numpy.random.normal(mean, std, size) draws random samples from a normal (Gaussian) distribution. PLOT A HISTOGRAM WITH MATPLOTLIB HIST >>> import numpy as np >>> import matplotlib.pyplot as plt # Build a vector of 10000 normal deviates with variance 0.5^2 and mean 2 >>> mu, sigma = 2, 0.5 >>> v = np.random.normal(mu,sigma,10000) # Plot a normalized histogram with 50 bins >>> plt.hist(v, bins=50, normed=1) >>> plt.show() Normed=1 means that the sum of the histograms is normalized to 1.
Length(bins) = Length(v) + 1 Numpy Histogram vs Hist matplotlib numpy.histogram(a, bins, normed=True) v bins >>> a = np.random.normal(mu, sigma, 10) >>> (v, bins) = np.histogram(a, bins=5, normed=True) >>> bins array([ 1.51704794, 1.80359328, 2.09013862, 2.37668396, 2.6632293, 2.94977464]) >>> v array([ 1.39593965, 0.34898491, 1.04695474, 0.34898491, 0.34898491]) Length(bins) = Length(v) + 1
Numpy Histogram vs Hist matplotlib Beware: matplotlib also has a function to build histograms (called hist, as in Matlab) that differs from the one in NumPy. The main difference is that hist plots the histogram automatically, while numpy.histogram only generates the data >>> mu, sigma = 2, 0.5 >>> a = np.random.normal(mu,sigma,10000) >>> (v, bins) = np.histogram(a, bins=5, normed=True) >>> plt.plot(bins[1:], v) v bins v bins[1:]
Thank you
The ix_() function: Special indexing field numpy.ix_(*args) >>> ax, bx = np.ix_([1,3,4],[2,5])] (array([[1],[3],[4]]), array([[2, 5]])) >>> ax.shape, bx.shape (3, 1), (1, 2) The way it works is by taking advantage of numpy’s broadcasting facilities. You can see that the two arrays used as row and column indices have different shapes; numpy’s broadcasting repeats each along the too-short axis so that they conform. 2 5 1 2 3 4 5 6 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 7 8 9 10 11 12 13 1 7 → 14 15 16 17 18 19 20 14 21 22 23 24 25 26 27 3 21 4 28 28 35 35
Backup: numpy.linspace numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None) Return evenly spaced numbers over a specified interval. Returns num evenly spaced samples, calculated over the interval [start, stop]. Parameters: start : scalar The starting value of the sequence. stop : scalar The end value of the sequence num : int, optional Number of samples to generate