Pandas John R. Woodward
Ndarray in numpy are like matrix/vector in maths. Pandas – panel data Like spreadsheets.
Import statement import pandas as pd import numpy as np import matplotlib.pyplot as plt
spreadsheet
DataFrame A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
Create DataFrame from Dictionary data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = pd.DataFrame(data) print frame
OUTPUT pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
Adding dataframes df1 = DataFrame( np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2
df1, df2, df1 + df2 b d e b c d Utah 0 1 2 Ohio 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 b c d Ohio 0 1 2 Texas 3 4 5 Colorado 6 7 8 b c d e Colorado NaN NaN NaN NaN Ohio 3 NaN 6 NaN Oregon NaN NaN NaN NaN Texas 9 NaN 12 NaN Utah NaN NaN NaN NaN
fill_value=0 df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) print df1 print df2 print df1 + df2 print df1.add(df2, fill_value=0)
df1 df2 a b c d e 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 a b c d 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Without and with fill_value = 0 a b c d e 0 0 2 4 6 NaN 1 9 11 13 15 NaN 2 18 20 22 24 NaN 3 NaN NaN NaN NaN NaN a b c d e 0 0 2 4 6 4 1 9 11 13 15 9 2 18 20 22 24 14 3 15 16 17 18 19
Operations between DataFrame and Series arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0]
Output (broadcasting) [[ 0. 1. 2. 3.] [ 4. 5. 6. 7.] [ 8. 9. 10. 11.]] [ 0. 1. 2. 3.] [[ 0. 0. 0. 0.] [ 4. 4. 4. 4.] [ 8. 8. 8. 8.]]
Function application and mapping frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)
Calculates the absolute value b d e Utah -0.204708 0.478943 -0.519439 Ohio -0.555730 1.965781 1.393406 Texas 0.092908 0.281746 0.769023 Oregon 1.246435 1.007189 -1.296221 Utah 0.204708 0.478943 0.519439 Ohio 0.555730 1.965781 1.393406 Oregon 1.246435 1.007189 1.296221
Defining new functions f = lambda x: x.max() - x.min() print frame.apply(f) frame.apply(f, axis=1)
Output b 1.802165 d 1.684034 e 2.689627 dtype: float64
Sorting obj = Series([1,3,2,4], index=['d', 'a', 'b', 'c']) print obj print obj.sort_index() print obj.sort_values()
Sorted output d 1 a 3 b 2 c 4 dtype: int64
Sorting Dataframes frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) print frame.sort_index() print frame.sort_index(axis=1) print frame.sort_index(axis=1, ascending=False)
Sorted Dataframes d a b c one 4 5 6 7 three 0 1 2 3 a b c d d c b a three 0 3 2 1 one 4 7 6 5
Multiple indexes frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) print frame print frame.sort_index(by='b') print frame.sort_index(by=['a', 'b'])
output a b 0 0 4 1 1 7 2 0 -3 3 1 2 a b 2 0 -3 3 1 2 0 0 4 1 1 7 a b 2 0 -3 0 0 4 3 1 2 1 1 7
ranking obj = Series([7, -5, 7, 4, 2, 0, 4]) print obj print obj.rank()
Equal ties are split 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
Axis indexes with duplicate values obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique print obj['a'] print obj['c']
Output can be series or scalar dtype: int64 False a 0 a 1 dtype: int64 4
Summarizing and computing descriptive statistics df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) print df print df.sum() print df.sum(axis=1) (i.e. we sum horizontally and vertically)
Sum vertical and horizontal one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c NaN d -0.55
skipna=False print df.mean(axis=1, skipna=False) print df.idxmax() print df.cumsum() will not skip Nan calculates mean, id of max, and cumulative sum
Mean, idxmax, cumsum() a NaN b 1.300 c NaN d -0.275 dtype: float64 one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8 one b two d dtype: object
print df.describe() one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000
Unique values, value counts, and membership obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj uniques = obj.unique() print uniques print obj.value_counts()
Unique and counts ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64 0 c dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64
membership mask = obj.isin(['b', 'c']) print mask print obj[mask]
Obj, membership, filtering 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: object
Handling missing data string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() string_data[0] = None
Isnull(), np.nan, None 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool
Filtering out missing data from numpy import nan as NA data = Series([1, NA, 3.5, NA, 7]) print data print data.dropna() print data[data.notnull()] (the last two are equivalent)
Drops NaN 0 1.0 0 1.0 1 NaN 2 3.5 2 3.5 4 7.0 3 NaN dtype: float64 0 1.0 1 NaN 2 3.5 3 NaN 4 7.0 0 1.0 2 3.5 4 7.0 dtype: float64
data = DataFrame( [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned
With NaN dropped 0 1 2 0 1 2 0 1 6.5 3 0 1 6.5 3 1 1 NaN NaN 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 0 1 2 0 1 6.5 3 Any row with NaN is dropped
Passing how='all' will only drop rows that are all NA: data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned
Only full NaN rows dropped 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 0 1 2 0 1 6.5 3 1 1 NaN NaN 3 NaN 6.5 3 Rows with some NaN survive
Dropping columns data[4] = NA print data print data.dropna(axis=1, how='all')
Column 4 is dropped as all NaN 0 1 2 4 0 1 6.5 3 NaN 1 1 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3 NaN 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3
Filling in missing data data = Series(np.random.randn(10), index=[ ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) print data
Hierarchical indexing 2 0.478943 3 -0.519439 b 1 -0.555730 2 1.965781 3 1.393406 c 1 0.092908 2 0.281746 d 2 0.769023 3 1.246435
MultiIndex print data.index dtype: float64 MultiIndex( levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
Using hierarchical indexes. print data['b'] 1 -0.555730 2 1.965781 3 1.393406 dtype: float64 print data[:, 2] a 0.478943 b 1.965781 c 0.281746 d 0.769023 dtype: float64
unstack() stack() Unstack takes 1D data and converts into 2D 2 0.478943 3 -0.519439 b 1 -0.555730 2 1.965781 3 1.393406 c 1 0.092908 2 0.281746 d 2 0.769023 3 1.246435 dtype: float64 1 2 3 a -0.204708 0.478943 -0.519439 b -0.555730 1.965781 1.393406 c 0.092908 0.281746 NaN d NaN 0.769023 1.246435
Both Axis are hierarchical frame = DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame
output Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 3 4 5 b 1 6 7 8 2 9 10 11
Give index and columns names frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print print frame
output state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
print frame['Ohio'] color Green Red key1 key2 a 1 0 1 2 3 4 b 1 6 7 2 3 4 b 1 6 7 2 9 10
We can analyse the MultiIndex <class 'pandas.core. index.MultiIndex'> ('a', 1) ('a', 2) <type 'tuple'> b <type 'str'> <type 'numpy.int64'> 4 type(frame.index) frame.index[0] frame.index[1] type(frame.index[0]) frame.index[3][0] type(frame.index[3][0]) type(frame.index[3][1]) len(frame.index)
Reordering and sorting levels print frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
Swap keys print frame.swaplevel('key1', 'key2') state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
Sorting levels print frame.sortlevel(0) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 print frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11
Summary statistics by level frame.sum(level='key2') state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 frame.sum(level='color', axis=1) color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10