Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension.

Similar presentations


Presentation on theme: "Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension."— Presentation transcript:

1 Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension as rows or columns, the computer doesn't care. Pandas data is labelled in the sense of having column names and row indices (which can be names). This forces the direction of data (rows are rows, and can't contain columns). This makes things easier. data.info() gives info of labels and datatypes. While numpy forms the solid foundation of many packages involved in data analysis, its multi-dimensional nature makes some aspects of analysis over-complicated. Because of this, many people use pandas, a library specifically set up to work on one and two dimensional data, and timeseries data especially. Pandas is very widely used in the scientific and data analytical communities. It is built on ndarrays. One specific thing which makes pandas attractive is that it forces the first dimension of any data to be considered as rows, and the second as columns, avoiding the general ambiguity with arrays. The second useful thing is that the array rows and columns can be given names (for rows, these are called "indices"). While this can be done in numpy, it isn't especially simple. Pandas makes it very intuitive. As well as the standard ndarray functions for finding out about arrays, the "info" function will tell you about pandas arrays.

2 Creating Series data = [1,2,3,numpy.nan,5,6] # nan == Not a Number unindexed = pandas.Series(data) indices = ['a', 'b', 'c', 'd', 'e'] indexed = pandas.Series(data, index=indices) data_dict = {'a' : 1, 'b' : 2, 'c' : 3} indexed = pandas.Series(data_dict) fives = pandas.Series(5, indices) # Fill with 5s. named = pandas.Series(data, name='mydata') named.rename('my_data') print(named.name) Pandas data series are 1D arrays. The main difference from creating a ndarray is that each row can have its own 'index' label, and the series can have an overall name. The above shows how to set up: an array without indices; an array with indices using a list of indices; an array using a dict; an array filled with a single value repeated; a named series. We'll see shortly how to set up a multi-series dataframe.

3 Datetime data Treated specially
date_indices = pandas.date_range(' ', periods=6, freq='D') Generates six days, ' ' to ' '. Although dates have hyphens when printed, they should be used without for indexing etc. Frequency is days by default, but can be 'Y', 'M', 'W' or 'H', 'min', 'S', 'ms', 'um', 'N' and others: Dataseries having indices representing time and date data are treated specially in pandas, especially when part of dataframes. The simplest way to set up a data series as a index is above.

4 Print As with numpy. Also: df.head() First few lines
df.tail(5) Last 5 lines df.index df.columns df.values df.sort_index(axis=1, ascending=False) df.sort_values(by='col1') Pandas prints as with numpy, but there are some addition attributes of dataseries and functions as above for printing subsets of the information in series.

5 Indexing Can also take equations: named[0] 1 named[:0] a 1
b 2 named["a"] 1 "a" in named == True named.get('f', numpy.nan) Returns default if index not found Can also take equations: a[a > a.median()] Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label. Note that you can also use "where"-like statements, for example in the above "give me all a, where there value in a is greater than the median of a".

6 Looping for value in series: etc. But generally unneeded: most operations take in series and process them elementwise. Series will act as iterators, but in general this isn't used, as most functions and operators will act on them elementwise and return a single or array value containing the appropriate answers.

7 DataFrames Dict of 1D ndarrays, lists, dicts, or Series
2-D numpy.ndarray Series data_dict = {'col1' : [1, 2, 3, 4], 'col2' : [10, 20, 30, 40]} Lists here could be ndarrays or Series. indices = ['a', 'b', 'c', 'd'] df = pandas.DataFrame(data_dict, index = indices) If no indices are passed in, numbered from zero. If data shorter than current columns, filled with numpy.nan. DataFrames are multiple dataseries with zero, one, or many indices and column/series names. The above shows how to set these up using a dict.

8 a = pandas.DataFrame.from_items( [('col1', [1, 2, 3]),
col1 col2 3 6 [('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['col1', 'col2', 'col3']) col1 col2 col3 A B An alternative when using lists is to use "from_items()", which gives more control on how list data is orientated.

9 I/O Can then be written to a text file.
df = pandas.read_csv('data.csv') df.to_csv('data.csv') df = pandas.read_excel( 'data.xlsx', 'Sheet1', index_col=None, na_values=['NA']) df.to_excel('data.xlsx', sheet_name='Sheet1') df = pandas.read_json('data.json') json = df.to_json() Can then be written to a text file. Wide variety of other formats including HTML and the local CTRL-C copy clipboard: It is usual to read a dataframe in. The above shows three ways to do this, but there are a large number of others. For excel, see:

10 Adding to DataFrames Remove
concat() adds dataframes join() joins SQL style append() adds rows insert() inserts columns at a specific location Remove df.sub(df['col1'], axis=0) (though you might also see df - df['col1']) The above shows the various methods for adding and subtracting. For more information on this, see:

11 Broadcasting When working with a single series and a dataframe, the series will sometimes be assumed a row. Check. This doesn't happen with time series data. More details on broadcasting, see: Again, there are a variety of rules for broadcasting. It's better, again, not to rely on these, but worth understanding them so you recognise when it is happening. In particular, different functions can treat dataseries as a set of rows, or a column, and this sometimes makes a difference.

12 Indexing By name: named.loc['row1',['col1','col2']] (you may also see at() used) Note that with slicing on labels, both start and end returned. Note also, not function. iloc[1,1] used for positions. Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label and column names. Above are the basics of indexing using both names and locations.

13 Indexing Operation Syntax Result Select column df[col] Series
Select row by label df.loc[label] Series representing row Select row by integer location df.iloc[loc] Slice rows df[5:10] DataFrame Select rows by boolean 1D array df[bool_array] Pull out specific rows df.query('Col1 < 10') (See also isin() ) The above gives more detail, and is taken from: query and isin allow for filtering of data by values. For more info on the latter, see:

14 Sparse data df1.dropna(how='any') Drop rows associated with numpy.nan df1.fillna(value=5) a = pd.isna(df1) Boolean mask array Pandas is especially well set up for sparce data, that is data where lots of the cells are not numbers (explicitly in pandas given the value numpy.nan). This includes various memory-efficient storage options, but also functions to trim down the datasets (see above) and the fact that functions will skip nans.

15 Stack df.stack() Combines the last two columns into one with a column of labels: A B 10 20 30 40 A 10 B 20 A 30 B 40 unstack() does the opposite. For more sophistication, see pivot tables: There are also a wide range of ways of reshaping arrays. This include array_name.T, for getting the array transposed, but also a wide range of ways of producing pivot tables where rows and columns are manipulated against each other. Simple pivot-like tables can be produced with stack (above), but there are more sophisticated versions.

16 Looping for col in df.columns: series = df[col] for index in df.indices: print(series[index]) But, again, generally unneeded: most operations take in series and process them elementwise. If column names are sound variable names, they can be accessed like: df.col1 Again, looping is largely redundant because most functions and operations work elementwise.

17 Columns as results df['col3'] = df['col1'] + df['col2'] df['col2'] = df['col1'] < 10 # Boolean values df['col1'] = 10 df2 = df.assign(col3 = someFuntion + df['col1'], col4 = 10) Assign always returns a copy. Note names not strings. Cols inserted alphabetically, and you can't use cols created elsewhere in the statement. Columns can be created as the result of equations or functions using other columns or maths, either by including them in standard assignments, or using the assign() function which takes in either functions or equations. In all cases, these work elementwise. Note that assign assumes columns for the results will be given using variable names, so the column names have to be good variable names.

18 Functions Generally functions on two series result in the union of the labels. Generally both series and dataframes work well with numpy functions. Note that the documentation sometimes calls functions "operations". Operations generally exclude nan. df.mean() Per column df.mean(1) Per row Complete API of functions at: Note that most functions work on columns by default, but can be forced to work per-row.

19 Useful operations df.describe() Quick stats summary df.apply(function) Applies function,lambda etc. to data in cols df.value_counts() Histogram df.T or pandas.transpose(df) Transpose rows for columns. Here are a few useful operations.

20 Categorical data Categorical data can be: renamed sorted grouped by (histogram-like) merged and unioned This is a brief introduction to pandas, but it should be said that there is a lot more sophistication in the data handling, especially if you explicitly define data as either categorical or timeseries. The above outlines some things you can do with categorical data.

21 Time series data Time series data can be: resampled at lesser frequencies converted between time zones and representations sub-sampled and used in maths and indexing analysed for holidays and business days shifted/lagged While the above does the same for timeseries.

22 Plotting matlibplot wrapped in pandas:
Complex plots may need matplotlib.pyplot.figure() calling first. series.plot() Line plot a series dataframe.plot() Line plot a set of series dataframe.plot(x='col1', y='col2') Plot data against another bar or barh bar plots hist histogram box boxplot kde or density density plots area area plots scatter scatter plots hexbin hexagonal bin plots pie pie graphs Finally, it is worth noting that pandas integrates matplotlib. There are some basic wrappers for easy data plotting, though you may find you need to import pyplot from matplotlib for more complicated layouts and call matplotlib.pyplot.figure().

23 Packages built on pandas
Statsmodels Statistics and econometrics sklearn-pandas Scikit-learn (machine learning) with pandas Bokeh Big data visualisation seaborn Data visualisation yhat/ggplot Grammar of Graphics visualisation Plotly Web visualisation using D3.js IPython/Spyder Both allow integration of pandas dataframes GeoPandas Pandas for mapping As noted, pandas is widely used, and has its own ecosystem of packages that are built on it. Above are some of the main packages. Perhaps the most notable for geographers is geopandas, which we'll look at shortly.


Download ppt "Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension."

Similar presentations


Ads by Google