Numpy and Pandas Dr Andy Evans

Numpy and Pandas Dr Andy Evans
In this part we'll look at some packages for basic data analysis and visualisation, chiefly the two packages which are very widely used for this, and which sit at the base of a number of other packages: numpy and pandas. These packages are used so extensively within python data analytics that in some use areas they are almost synonymous with "Python" - you couldn't imagine using Python without actually using numpy, and some coders probably wouldn't register the difference in their daily use, so naturally do they work.

Scipy 'Ecosystem' containing a variety of scientific packages including iPython, numpy, matplotlib, and pandas. numpy is both a system for constructing multi-dimensional data structures and a scientific library. The overarching collection for the numpy and pandas packages is scipy, a grouping of packages that includes matplotlib and the IPython project. At the root of many of these packages is compatibility with numpy, which is a data analysis library, but, perhaps more importantly, provides a structure for the construction of multi-dimensional data arrays.

ndarray ndarray or numpy.array (alias) is the basic data format.
a = numpy.array([2,3,4]) Make array with list or tuple (NOT numbers) a = numpy.fromfile(file, dtype=float, count=-1, sep='') dtype Allows the construction of multi-type arrays count Number of values to read (-1 == all) sep Separator To generate using a function that acts on each element of a shape: numpy.fromfunction(function, shape, **kwargs) The core data type is the ndarray, or its alias numpy.array. As we'll see, the key advantage of this data format is the ability to do multi-dimensional slices. Note the potential confusion between numpy.array and array.array in Python, if you import both. The above slide shows three ways of constructing an ndarray. The usual way is from lists or a file. For a dtype example, see: For a function example, see:

Built in functions a = numpy.zeros( (3,4) ) array([[ 0., 0., 0., 0.], [ 0., 0., 0., 0.], [ 0., 0., 0., 0.]]) Also numpy.ones and numpy.empty (generates very small floats) numpy.random.random((2,3)) More generic : ndarray.fill(value) numpy.putmask() Put values based on a Boolean mask array (True == replace) There are a variety of functions that produce standardised arrays. For example, numpy.zeros takes in the size of an array as a tuple or dimension sizes, and makes an array containing zeros of the right size. numpy.empty does the same thing but doesn't set the contents. In practice this means the array is full of very small floats. If you have an array and you want to fill it with a specific number, use .fill().

arange Like range but generates arrays:
a = np.arange( 1, 10, 2 ) array([1, 3, 5, 7, 9]) Can use with floating point numbers, but precision issues mean better to use: a = np.linspace(start, end, numberOfNumbersBetween) Note that with linspace "end" is generated. For a range of numbers, use arrange(). This generates a sequence like the standard "range", but in a 1D ndarray. We'll see in a bit how to then convert that to a multi-dimension array. While this can be used with floating point numbers, because of precision issues, it is better to use linspace to construct a set number of floats falling within a definite interval.

ndarray Options set with numpy.set_printoptions, including
ndarray.ndim Number of axes (dimensions) ndarray.shape Length of different dimensions ndarray.size Total data amount ndarray.dtype Data type in the array (standard or numpy) print(array) Will print the array nicely, but if too larger to print nicely will print with "…," across central points. Options set with numpy.set_printoptions, including numpy.set_printoptions(threshold = None) Each ndarray has a set of attributes automatically set up, as above, which can be accessed to determine information about it. Printing an array will 'pretty print' it, more specifically, if it is too large, the middle numbers will be replaced by "…". This can be turned off with the set_printoptions function, as shown.

Platform independent save
Save/Load data in numpy .npy / .npz format numpy.save(file, arr, allow_pickle=True, fix_imports=True) arr = numpy.load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, encoding='ASCII') Although one can write out such arrays with standard text methods, there is also a platform-independent format for quick and effective data storage for use with numpy and related packages specifically. Note that in the above "arr" is the array to save/load into.

Indexing Data locations are referenced using [row, col] (for 2D):
arrayA[1,2] not arrayA[1][2] This means we can slice across multiple dimensions, one of the most useful aspects on numpy arrays: a = arrayA[1:3,:] array of the 2nd and 3rd row, all columns. b = arrayA[:, 1] array of all values in second column. You can also use … to represent "the rest": a[4,...,5,:] == a[4,:,:,5,:]. The most obvious difference between ndarrays and standard Python lists and tuples is that the indexing is with a list of multiple dimensions, not multiple lists of single dimensions. The second is that, using this format, slices can be done across multiple dimesnions. Numpy also allocates the optional (but unused in standard Python) symbol "…" to mean "all the rest". For example, a[4,…] means "all the rest in the 4th row of the 2D array "a". Note that you have to avoid ambiguities with this, so, instead of: a[4,:,:,5,:]. We can say: a[4,...,5,:] but not a[4,...,5,…] As it wouldn't be clear which dimension the "5" referred to, whereas in a[4,...,5,:] it is clearly the second to last.

Indexing Can use numpy arrays to pull out values:
j = np.array( [ [ 3, 4], [ 9, 7 ] ] ) a[j] Can also use Boolean arrays, with "True" values indicating values we want: mask = numpy.array([False,True,False]) a = numpy.array([1,2,3]) a[mask] == [2] Numpy has something called 'Structured arrays' which allow named columns, but these are better done through their wrappers in Pandas. We can also pull multiple values out of specific locations within arrays. For more on structured arrays, see:

Shape changing To take the current values and force them into a different shape, use reshape, for example: a = numpy.arange(12).reshape(3,4) resize changes the underlying array. numpy.squeeze() removes empty dimensions arrayA.flat gives you all elements as an iterator arrayA.ravel() gives you the array flattened arrayA.T gives you array with rows and columns transposed (note not a function) There are a variety of functions for altering the shape of arrays. Note that reshape is how we would generate multi-dimensional arrays using arange.

Concatinating More generic is which allows you to say which axis.
a = numpy.vstack((arrayA,arrayB)) Stack arrays vertically a = numpy.hstack((arrayA,arrayB)) Stack arrays horizontally column_stack stacks 1D arrays as columns. More generic is numpy.concatenate((a1, a2, ...), axis=0) which allows you to say which axis. To add arrays together, use the above functions.

Broadcasting The way data is filled or reused if arrays being used together are different shapes. For example, small arrays will usually be "stretched" - the data in them repeated. See: One of the most Pythonic aspects of ndarrays is that if you pass them into functions that expect arrays of a specific size (for example because it works with two arrays of the same size), arrays will resize temporarily so they work in an intuitive manner. The filling process is known as 'broadcasting'. By and large, it is better not to rely on broadcasting - understand your array sizes and make sure they are right. However, occasionally (for example when multiplying a number of different arrays by a single value array) it means useful shorthands can be used.

Data copies b = a.view() # New array, but referencing the old data This is what a slice returns. c = a.copy() # New array and data Arrays are generally filled with mutable values and respond to the appropriate pass-by-reference style rules we looked at in the core course. However, this isn't always the case, and you want to check out what functions return closely. Are they returning a copy of the original data, or the original data itself. Quite often, functions will return a "view" which is a new array, but containing references to the original data (this allows, for example, for resizing of arrays without affecting the original data). If you want a "deep" copy - that is, a copy where not only the array, but the data in it is copied, use array_name.copy().

Maths on arrays Maths done elementwise and generates a new array.
a = arrayb + arrayc a = arrayb * arrayc a = arrayb.dot(arrayc) Matrix dot product (for matrix maths see also numpy.identity and numpy.eye) *= and += can be used to manipulate arrays in place: a += 3 arraya += arrayb You'll remember from the core course that operators like "+" actually call functions in one or other of the variables either side of them. These functions can be overridden to change the functionality of core operators. Here we see an example of how this, apparently confusing, idea comes to fruition. In numpy, the standard operators are overridden to work on ndarrays elementwise - that is, they run through the arrays and operate on each value separately. There are also standard functions for matrix mathematics. We're not going to go into matrix mathematics here, but there's a good introduction for those who need a refresher, here: If you want to do matrix maths, you may also want to know how to generate some of the standard matrices that are used in matrix maths:

Built in maths functions
a.sum(); a.min(); a.max() a.sum(axis=0) # Array containing sum of columns a.sum(axis=1) # Array containing sum of rows b.cumsum(axis=1) # Array of the start size containing cumulative sums across rows. There are also elementwise functions like numpy.sqrt(arraya) numpy.sin(arraya) There are a large number of functions for data analysis in numpy (and even more in the associated scipy ecosystem). Here are some simple examples (we'll come to where you can find more shortly). Note that many will run over a whole ndarray, but can also be set to generate arrays of values per-row or per-column, depending on the axis set. Here we're assumed the first dimension is taken as rows and the second as columns. In some cases, functions work elementwise to generate new arrays of the same size.

Maths Basic Statistics cov, mean, std, var Basic Linear Algebra
cross, dot, outer, linalg.svd, vdot Histogram: generates 1D arrays of counts and bins from array (counts, bins) = np.histogram(arrayIn, bins=50, normed=True) Full list of maths functions: Here are some of the more useful basic functions.

Scipy functions Special functions (scipy.special)
Integration (scipy.integrate) Optimization (scipy.optimize) Interpolation (scipy.interpolate) Fourier Transforms (scipy.fftpack) Signal Processing (scipy.signal) Linear Algebra (scipy.linalg) Sparse Eigenvalue Problems with ARPACK Compressed Sparse Graph Routines (scipy.sparse.csgraph) Spatial data structures and algorithms (scipy.spatial) Statistics (scipy.stats) Multidimensional image processing (scipy.ndimage) <-- Useful for kernel operations File IO (scipy.io) However, as mentioned, across the scipy ecosystem, there are some very powerful libraries. The scipy library, which is one library in the scipy ecosystem, contains a vast number of sub-packages for scientific analysis.

Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension as rows or columns, the computer doesn't care. Pandas data is labelled in the sense of having column names and row indices (which can be names). This forces the direction of data (rows are rows, and can't contain columns). This makes things easier. data.info() gives info of labels and datatypes. While numpy forms the solid foundation of many packages involved in data analysis, its multi-dimensional nature makes some aspects of analysis over-complicated. Because of this, many people use pandas, a library specifically set up to work on one and two dimensional data, and timeseries data especially. Pandas is very widely used in the scientific and data analytical communities. It is built on ndarrays. One specific thing which makes pandas attractive is that it forces the first dimension of any data to be considered as rows, and the second as columns, avoiding the general ambiguity with arrays. The second useful thing is that the array rows and columns can be given names (for rows, these are called "indices"). While this can be done in numpy, it isn't especially simple. Pandas makes it very intuitive. As well as the standard ndarray functions for finding out about arrays, the "info" function will tell you about pandas arrays.

Creating Series data = [1,2,3,numpy.nan,5,6] # nan == Not a Number unindexed = pandas.Series(data) indices = ['a', 'b', 'c', 'd', 'e'] indexed = pandas.Series(data, index=indices) data_dict = {'a' : 1, 'b' : 2, 'c' : 3} indexed = pandas.Series(data_dict) fives = pandas.Series(5, indices) # Fill with 5s. named = pandas.Series(data, name='mydata') named.rename('my_data') print(named.name) Pandas data series are 1D arrays. The main difference from creating a ndarray is that each row can have its own 'index' label, and the series can have an overall name. The above shows how to set up: an array without indices; an array with indices using a list of indices; an array using a dict; an array filled with a single value repeated; a named series. We'll see shortly how to set up a multi-series dataframe.

Datetime data Treated specially
date_indices = pandas.date_range(' ', periods=6, freq='D') Generates six days, ' ' to ' '. Although dates have hyphens when printed, they should be used without for indexing etc. Frequency is days by default, but can be 'Y', 'M', 'W' or 'H', 'min', 'S', 'ms', 'um', 'N' and others: Dataseries having indices representing time and date data are treated specially in pandas, especially when part of dataframes. The simplest way to set up a data series as a index is above.

Print As with numpy. Also: df.head() First few lines
df.tail(5) Last 5 lines df.index df.columns df.values df.sort_index(axis=1, ascending=False) df.sort_values(by='col1') Pandas prints as with numpy, but there are some addition attributes of dataseries and functions as above for printing subsets of the information in series.

Indexing Can also take equations: named[0] 1 named[:0] a 1
b 2 named["a"] 1 "a" in named == True named.get('f', numpy.nan) Returns default if index not found Can also take equations: a[a > a.median()] Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label. Note that you can also use "where"-like statements, for example in the above "give me all a, where there value in a is greater than the median of a".

Looping for value in series: etc. But generally unneeded: most operations take in series and process them elementwise. Series will act as iterators, but in general this isn't used, as most functions and operators will act on them elementwise and return a single or array value containing the appropriate answers.

DataFrames Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray Series data_dict = {'col1' : [1, 2, 3, 4], 'col2' : [10, 20, 30, 40]} Lists here could be ndarrays or Series. indices = ['a', 'b', 'c', 'd'] df = pandas.DataFrame(data_dict, index = indices) If no indices are passed in, numbered from zero. If data shorter than current columns, filled with numpy.nan. DataFrames are multiple dataseries with zero, one, or many indices and column/series names. The above shows how to set these up using a dict.

a = pandas.DataFrame.from_items( [('col1', [1, 2, 3]),
col1 col2 3 6 [('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['col1', 'col2', 'col3']) col1 col2 col3 A B An alternative when using lists is to use "from_items()", which gives more control on how list data is orientated.

I/O Can then be written to a text file.
df = pandas.read_csv('data.csv') df.to_csv('data.csv') df = pandas.read_excel( 'data.xlsx', 'Sheet1', index_col=None, na_values=['NA']) df.to_excel('data.xlsx', sheet_name='Sheet1') df = pandas.read_json('data.json') json = df.to_json() Can then be written to a text file. Wide variety of other formats including HTML and the local CTRL-C copy clipboard: It is usual to read a dataframe in. The above shows three ways to do this, but there are a large number of others. For excel, see:

Adding to DataFrames Remove
concat() adds dataframes join() joins SQL style append() adds rows insert() inserts columns at a specific location Remove df.sub(df['col1'], axis=0) (though you might also see df - df['col1']) The above shows the various methods for adding and subtracting. For more information on this, see:

Broadcasting When working with a single series and a dataframe, the series will sometimes be assumed a row. Check. This doesn't happen with time series data. More details on broadcasting, see: Again, there are a variety of rules for broadcasting. It's better, again, not to rely on these, but worth understanding them so you recognise when it is happening. In particular, different functions can treat dataseries as a set of rows, or a column, and this sometimes makes a difference.

Indexing By name: named.loc['row1',['col1','col2']] (you may also see at() used) Note that with slicing on labels, both start and end returned. Note also, not function. iloc[1,1] used for positions. Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label and column names. Above are the basics of indexing using both names and locations.

Indexing Operation Syntax Result Select column df[col] Series
Select row by label df.loc[label] Series representing row Select row by integer location df.iloc[loc] Slice rows df[5:10] DataFrame Select rows by boolean 1D array df[bool_array] Pull out specific rows df.query('Col1 < 10') (See also isin() ) The above gives more detail, and is taken from: query and isin allow for filtering of data by values. For more info on the latter, see:

Sparse data df1.dropna(how='any') Drop rows associated with numpy.nan df1.fillna(value=5) a = pd.isna(df1) Boolean mask array Pandas is especially well set up for sparce data, that is data where lots of the cells are not numbers (explicitly in pandas given the value numpy.nan). This includes various memory-efficient storage options, but also functions to trim down the datasets (see above) and the fact that functions will skip nans.

Stack df.stack() Combines the last two columns into one with a column of labels: A B 10 20 30 40 A 10 B 20 A 30 B 40 unstack() does the opposite. For more sophistication, see pivot tables: There are also a wide range of ways of reshaping arrays. This include array_name.T, for getting the array transposed, but also a wide range of ways of producing pivot tables where rows and columns are manipulated against each other. Simple pivot-like tables can be produced with stack (above), but there are more sophisticated versions.

Looping for col in df.columns: series = df[col] for index in df.indices: print(series[index]) But, again, generally unneeded: most operations take in series and process them elementwise. If column names are sound variable names, they can be accessed like: df.col1 Again, looping is largely redundant because most functions and operations work elementwise.

Columns as results df['col3'] = df['col1'] + df['col2'] df['col2'] = df['col1'] < 10 # Boolean values df['col1'] = 10 df2 = df.assign(col3 = someFuntion + df['col1'], col4 = 10) Assign always returns a copy. Note names not strings. Cols inserted alphabetically, and you can't use cols created elsewhere in the statement. Columns can be created as the result of equations or functions using other columns or maths, either by including them in standard assignments, or using the assign() function which takes in either functions or equations. In all cases, these work elementwise. Note that assign assumes columns for the results will be given using variable names, so the column names have to be good variable names.

Functions Generally functions on two series result in the union of the labels. Generally both series and dataframes work well with numpy functions. Note that the documentation sometimes calls functions "operations". Operations generally exclude nan. df.mean() Per column df.mean(1) Per row Complete API of functions at: Note that most functions work on columns by default, but can be forced to work per-row.

Useful operations df.describe() Quick stats summary df.apply(function) Applies function,lambda etc. to data in cols df.value_counts() Histogram df.T or pandas.transpose(df) Transpose rows for columns. Here are a few useful operations.

Categorical data Categorical data can be: renamed sorted grouped by (histogram-like) merged and unioned This is a brief introduction to pandas, but it should be said that there is a lot more sophistication in the data handling, especially if you explicitly define data as either categorical or timeseries. The above outlines some things you can do with categorical data.

Time series data Time series data can be: resampled at lesser frequencies converted between time zones and representations sub-sampled and used in maths and indexing analysed for holidays and business days shifted/lagged While the above does the same for timeseries.

Plotting matlibplot wrapped in pandas:
Complex plots may need matplotlib.pyplot.figure() calling first. series.plot() Line plot a series dataframe.plot() Line plot a set of series dataframe.plot(x='col1', y='col2') Plot data against another bar or barh bar plots hist histogram box boxplot kde or density density plots area area plots scatter scatter plots hexbin hexagonal bin plots pie pie graphs Finally, it is worth noting that pandas integrates matplotlib. There are some basic wrappers for easy data plotting, though you may find you need to import pyplot from matplotlib for more complicated layouts and call matplotlib.pyplot.figure().

Packages built on pandas
Statsmodels Statistics and econometrics sklearn-pandas Scikit-learn (machine learning) with pandas Bokeh Big data visualisation seaborn Data visualisation yhat/ggplot Grammar of Graphics visualisation Plotly Web visualisation using D3.js IPython/Spyder Both allow integration of pandas dataframes GeoPandas Pandas for mapping As noted, pandas is widely used, and has its own ecosystem of packages that are built on it. Above are some of the main packages. Perhaps the most notable for geographers is geopandas, which we'll look at shortly.

Mapping packages Unfortunately none come with Anaconda (only geoprocessing is which does lat/long to Cartesian conversions). matplotlib integrates with basemap: But tricky to get working well. And cartopy for more sophisticated vector graphics: GeoPandas probably the simplest to use: Based on Shapely , which has its roots in PostGIS. There are a wide variety of mapping packages for Python. Unfortunately none of them come with Anaconda, but if you have your own machine you can install them. While matplotlib has a couple of spin-off packages, most notably basemap, these can be hard to get working, especially on windows with the latest version of Python. Better is geopandas, which works well and is easy to get running. GeoPandas is based on Shapely, a package for generic shape manipulation and plotting in arbitrary coordinate spaces. This ultimately has its roots in the PostGIS spatial database.

GeoPandas Based around GeoSeries and GeoDataFrames. These made of:
Points / Multi-Points Lines / Multi-Lines Polygons / Multi-Polygons GeoDataFrames always have one geometry column (by default called "geometry"). This can be set/get with: gdf = gdf.set_geometry('col1') print(gdf.geometry.name) GeoPandas basically adds the "geo" to the pandas series and dataframes setup. It does this by including a geometry column containing geographical data.

I/O Wide range of filetypes importable with:
gdf = geopandas.read_file("filename") Including shapefiles and GeoJSON. Can also use a URL string. To write, use: gdf.to_file() Supports a wide range of formats (underneath uses fiona: see manual for supported formats). One nice addition is that geopandas integrates (and simplifies) fiona, a library for reading and writing a wide variety of geographical data formats. This makes reading and writing geographical data files very easy.

Mapping Geometry mapped with: You can also plot slices etc.
gdf.plot(); You can also plot slices etc. To make a choropleth, give a column other than geometry: gdf.plot(column='col1'); To plot datapoints, use a basemap plus points: base = gdf.plot(color='white', edgecolor='black') pts.plot(ax=base, marker='o', color='blue', markersize=2); Wraps round matplotlib, so there are other options: As with pandas, producing a map is often just a matter of calling "plot". This will show the geometry data by default. The above slide gives the basics of also showing additional data columns and layers.

Data processing Columns can be referred to by column names, where good variable names. Also attributes: area: in projection units bounds: tuple of per-shape max and min coordinates total_bounds: tuple of per-dataframe max and min coordinates gdf.col1 = gdf.col1 / gdf.geometry.area As with pandas, columns can be created as the results of equations containing other columns, working elementwise. There are also a number of pre-calculated attributes associated with geometries.

Other operations Overlay (union; intersection; symmetrical difference; difference) Reprojection Buffering, convex hulls, etc. Scaling, translations, etc. Grouping and dissolves, attribute and spatial merges Geocoding through geocoding services (e.g. Google) Feature-to-feature distances Centroid finding Intersection and feature-in-feature checking GeoPandas adds a range of additional functions, encompassing much of the work you'd associated with a basic GIS. The documentation is generally good, and can be found, with examples, at the URL above. The functions have a very Pythonic style, with ease-of-use being of high importance. In general, as numpy gets wrapped deeper and deeper inside packages, the packages get more restricted in the kinds of jobs they'll do, but the process of analysis and visualisation gets easier and easier.

Numpy and Pandas Dr Andy Evans

Similar presentations

Presentation on theme: "Numpy and Pandas Dr Andy Evans"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Numpy and Pandas Dr Andy Evans

Similar presentations

Presentation on theme: "Numpy and Pandas Dr Andy Evans"— Presentation transcript:

Similar presentations

About project

Feedback