Scipy 'Ecosystem' containing a variety of scientific packages including iPython, numpy, matplotlib, and pandas. numpy is both a system for constructing multi-dimensional data structures and a scientific library. http://www.numpy.org/ The overarching collection for the numpy and pandas packages is scipy, a grouping of packages that includes matplotlib and the IPython project. At the root of many of these packages is compatibility with numpy, which is a data analysis library, but, perhaps more importantly, provides a structure for the construction of multi-dimensional data arrays.
ndarray ndarray or numpy.array (alias) is the basic data format. a = numpy.array([2,3,4]) Make array with list or tuple (NOT numbers) a = numpy.fromfile(file, dtype=float, count=-1, sep='') dtype Allows the construction of multi-type arrays count Number of values to read (-1 == all) sep Separator To generate using a function that acts on each element of a shape: numpy.fromfunction(function, shape, **kwargs) The core data type is the ndarray, or its alias numpy.array. As we'll see, the key advantage of this data format is the ability to do multi-dimensional slices. Note the potential confusion between numpy.array and array.array in Python, if you import both. The above slide shows three ways of constructing an ndarray. The usual way is from lists or a file. For a dtype example, see: https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.fromfile.html#numpy.fromfile For a function example, see: https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.fromfunction.html#numpy.fromfunction
Built in functions a = numpy.zeros( (3,4) ) array([[ 0., 0., 0., 0.], [ 0., 0., 0., 0.], [ 0., 0., 0., 0.]]) Also numpy.ones and numpy.empty (generates very small floats) numpy.random.random((2,3)) More generic : ndarray.fill(value) numpy.putmask() Put values based on a Boolean mask array (True == replace) There are a variety of functions that produce standardised arrays. For example, numpy.zeros takes in the size of an array as a tuple or dimension sizes, and makes an array containing zeros of the right size. numpy.empty does the same thing but doesn't set the contents. In practice this means the array is full of very small floats. If you have an array and you want to fill it with a specific number, use .fill().
arange Like range but generates arrays: a = np.arange( 1, 10, 2 ) array([1, 3, 5, 7, 9]) Can use with floating point numbers, but precision issues mean better to use: a = np.linspace(start, end, numberOfNumbersBetween) Note that with linspace "end" is generated. For a range of numbers, use arrange(). This generates a sequence like the standard "range", but in a 1D ndarray. We'll see in a bit how to then convert that to a multi-dimension array. While this can be used with floating point numbers, because of precision issues, it is better to use linspace to construct a set number of floats falling within a definite interval.
ndarray Options set with numpy.set_printoptions, including ndarray.ndim Number of axes (dimensions) ndarray.shape Length of different dimensions ndarray.size Total data amount ndarray.dtype Data type in the array (standard or numpy) print(array) Will print the array nicely, but if too larger to print nicely will print with "…," across central points. Options set with numpy.set_printoptions, including numpy.set_printoptions(threshold = None) Each ndarray has a set of attributes automatically set up, as above, which can be accessed to determine information about it. Printing an array will 'pretty print' it, more specifically, if it is too large, the middle numbers will be replaced by "…". This can be turned off with the set_printoptions function, as shown. https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html
Platform independent save Save/Load data in numpy .npy / .npz format numpy.save(file, arr, allow_pickle=True, fix_imports=True) arr = numpy.load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, encoding='ASCII') Although one can write out such arrays with standard text methods, there is also a platform-independent format for quick and effective data storage for use with numpy and related packages specifically. Note that in the above "arr" is the array to save/load into. https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.save.html#numpy.save https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.load.html#numpy.load
Indexing Data locations are referenced using [row, col] (for 2D): arrayA[1,2] not arrayA[1][2] This means we can slice across multiple dimensions, one of the most useful aspects on numpy arrays: a = arrayA[1:3,:] array of the 2nd and 3rd row, all columns. b = arrayA[:, 1] array of all values in second column. You can also use … to represent "the rest": a[4,...,5,:] == a[4,:,:,5,:]. The most obvious difference between ndarrays and standard Python lists and tuples is that the indexing is with a list of multiple dimensions, not multiple lists of single dimensions. The second is that, using this format, slices can be done across multiple dimesnions. Numpy also allocates the optional (but unused in standard Python) symbol "…" to mean "all the rest". For example, a[4,…] means "all the rest in the 4th row of the 2D array "a". Note that you have to avoid ambiguities with this, so, instead of: a[4,:,:,5,:]. We can say: a[4,...,5,:] but not a[4,...,5,…] As it wouldn't be clear which dimension the "5" referred to, whereas in a[4,...,5,:] it is clearly the second to last.
Indexing Can use numpy arrays to pull out values: j = np.array( [ [ 3, 4], [ 9, 7 ] ] ) a[j] Can also use Boolean arrays, with "True" values indicating values we want: mask = numpy.array([False,True,False]) a = numpy.array([1,2,3]) a[mask] == [2] Numpy has something called 'Structured arrays' which allow named columns, but these are better done through their wrappers in Pandas. We can also pull multiple values out of specific locations within arrays. For more on structured arrays, see: https://docs.scipy.org/doc/numpy-dev/user/basics.rec.html#structured-arrays
Shape changing To take the current values and force them into a different shape, use reshape, for example: a = numpy.arange(12).reshape(3,4) resize changes the underlying array. numpy.squeeze() removes empty dimensions arrayA.flat gives you all elements as an iterator arrayA.ravel() gives you the array flattened arrayA.T gives you array with rows and columns transposed (note not a function) There are a variety of functions for altering the shape of arrays. Note that reshape is how we would generate multi-dimensional arrays using arange.
Concatinating More generic is which allows you to say which axis. a = numpy.vstack((arrayA,arrayB)) Stack arrays vertically a = numpy.hstack((arrayA,arrayB)) Stack arrays horizontally column_stack stacks 1D arrays as columns. More generic is numpy.concatenate((a1, a2, ...), axis=0) which allows you to say which axis. To add arrays together, use the above functions.
Broadcasting The way data is filled or reused if arrays being used together are different shapes. For example, small arrays will usually be "stretched" - the data in them repeated. See: https://docs.scipy.org/doc/numpy-dev/user/basics.broadcasting.html http://scipy.github.io/old-wiki/pages/EricsBroadcastingDoc One of the most Pythonic aspects of ndarrays is that if you pass them into functions that expect arrays of a specific size (for example because it works with two arrays of the same size), arrays will resize temporarily so they work in an intuitive manner. The filling process is known as 'broadcasting'. By and large, it is better not to rely on broadcasting - understand your array sizes and make sure they are right. However, occasionally (for example when multiplying a number of different arrays by a single value array) it means useful shorthands can be used.
Data copies b = a.view() # New array, but referencing the old data This is what a slice returns. c = a.copy() # New array and data Arrays are generally filled with mutable values and respond to the appropriate pass-by-reference style rules we looked at in the core course. However, this isn't always the case, and you want to check out what functions return closely. Are they returning a copy of the original data, or the original data itself. Quite often, functions will return a "view" which is a new array, but containing references to the original data (this allows, for example, for resizing of arrays without affecting the original data). If you want a "deep" copy - that is, a copy where not only the array, but the data in it is copied, use array_name.copy().
Maths on arrays Maths done elementwise and generates a new array. a = arrayb + arrayc a = arrayb * arrayc a = arrayb.dot(arrayc) Matrix dot product (for matrix maths see also numpy.identity and numpy.eye) *= and += can be used to manipulate arrays in place: a += 3 arraya += arrayb You'll remember from the core course that operators like "+" actually call functions in one or other of the variables either side of them. These functions can be overridden to change the functionality of core operators. Here we see an example of how this, apparently confusing, idea comes to fruition. In numpy, the standard operators are overridden to work on ndarrays elementwise - that is, they run through the arrays and operate on each value separately. There are also standard functions for matrix mathematics. We're not going to go into matrix mathematics here, but there's a good introduction for those who need a refresher, here: https://www.mathsisfun.com/algebra/matrix-introduction.html If you want to do matrix maths, you may also want to know how to generate some of the standard matrices that are used in matrix maths: https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.identity.html#numpy.identity https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.eye.html#numpy.eye
Built in maths functions a.sum(); a.min(); a.max() a.sum(axis=0) # Array containing sum of columns a.sum(axis=1) # Array containing sum of rows b.cumsum(axis=1) # Array of the start size containing cumulative sums across rows. There are also elementwise functions like numpy.sqrt(arraya) numpy.sin(arraya) There are a large number of functions for data analysis in numpy (and even more in the associated scipy ecosystem). Here are some simple examples (we'll come to where you can find more shortly). Note that many will run over a whole ndarray, but can also be set to generate arrays of values per-row or per-column, depending on the axis set. Here we're assumed the first dimension is taken as rows and the second as columns. In some cases, functions work elementwise to generate new arrays of the same size.
Maths Basic Statistics cov, mean, std, var Basic Linear Algebra cross, dot, outer, linalg.svd, vdot Histogram: generates 1D arrays of counts and bins from array (counts, bins) = np.histogram(arrayIn, bins=50, normed=True) Full list of maths functions: https://docs.scipy.org/doc/numpy/reference/routines.html Here are some of the more useful basic functions.
Scipy functions Special functions (scipy.special) Integration (scipy.integrate) Optimization (scipy.optimize) Interpolation (scipy.interpolate) Fourier Transforms (scipy.fftpack) Signal Processing (scipy.signal) Linear Algebra (scipy.linalg) Sparse Eigenvalue Problems with ARPACK Compressed Sparse Graph Routines (scipy.sparse.csgraph) Spatial data structures and algorithms (scipy.spatial) Statistics (scipy.stats) Multidimensional image processing (scipy.ndimage) <-- Useful for kernel operations File IO (scipy.io) However, as mentioned, across the scipy ecosystem, there are some very powerful libraries. The scipy library, which is one library in the scipy ecosystem, contains a vast number of sub-packages for scientific analysis.