Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pandas John R. Woodward.

Similar presentations


Presentation on theme: "Pandas John R. Woodward."— Presentation transcript:

1 Pandas John R. Woodward

2 Ndarray in numpy are like matrix/vector in maths. Pandas – panel data Like spreadsheets.

3 Import statement import pandas as pd import numpy as np
import matplotlib.pyplot as plt

4 spreadsheet

5 DataFrame A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

6

7 Create DataFrame from Dictionary
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = pd.DataFrame(data) print frame

8 OUTPUT pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002
Nevada 2001 Nevada 2002

9

10 Adding dataframes df1 = DataFrame( np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2

11 df1, df2, df1 + df2 b d e b c d Utah 0 1 2 Ohio 0 1 2 Ohio 3 4 5
Texas Oregon b c d Ohio Texas Colorado b c d e Colorado NaN NaN NaN NaN Ohio NaN NaN Oregon NaN NaN NaN NaN Texas NaN NaN Utah NaN NaN NaN NaN

12 fill_value=0 df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) print df1 print df2 print df1 + df2 print df1.add(df2, fill_value=0)

13 df1 df2 a b c d e a b c d

14 Without and with fill_value = 0
a b c d e NaN NaN NaN 3 NaN NaN NaN NaN NaN a b c d e

15 Operations between DataFrame and Series
arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0]

16 Output (broadcasting)
[[ ] [ ] [ ]] [ ] [[ ] [ ] [ ]]

17 Function application and mapping
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)

18 Calculates the absolute value
b d e Utah Ohio Texas Oregon Utah Ohio Oregon

19 Defining new functions
f = lambda x: x.max() - x.min() print frame.apply(f) frame.apply(f, axis=1)

20 Output b d e dtype: float64

21 Sorting obj = Series([1,3,2,4], index=['d', 'a', 'b', 'c']) print obj
print obj.sort_index() print obj.sort_values()

22 Sorted output d 1 a 3 b 2 c 4 dtype: int64

23 Sorting Dataframes frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) print frame.sort_index() print frame.sort_index(axis=1) print frame.sort_index(axis=1, ascending=False)

24 Sorted Dataframes d a b c one 4 5 6 7 three 0 1 2 3 a b c d
d c b a three one

25 Multiple indexes frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) print frame print frame.sort_index(by='b') print frame.sort_index(by=['a', 'b'])

26 output a b a b a b

27 ranking obj = Series([7, -5, 7, 4, 2, 0, 4]) print obj
print obj.rank()

28 Equal ties are split 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 dtype: float64

29 Axis indexes with duplicate values
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique print obj['a'] print obj['c']

30 Output can be series or scalar
dtype: int64 False a 0 a 1 dtype: int64 4

31 Summarizing and computing descriptive statistics
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) print df print df.sum() print df.sum(axis=1) (i.e. we sum horizontally and vertically)

32 Sum vertical and horizontal
one two a NaN b c NaN NaN d one two dtype: float64 a b c NaN d

33 skipna=False print df.mean(axis=1, skipna=False) print df.idxmax()
print df.cumsum() will not skip Nan calculates mean, id of max, and cumulative sum

34 Mean, idxmax, cumsum() a NaN b 1.300 c NaN d -0.275 dtype: float64
one two a NaN b c NaN NaN d one b two d dtype: object

35 print df.describe() one two count 3.000000 2.000000
mean std min 25% 50% 75% max

36 Unique values, value counts, and membership
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj uniques = obj.unique() print uniques print obj.value_counts()

37 Unique and counts ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64 0 c
dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64

38 membership mask = obj.isin(['b', 'c']) print mask print obj[mask]

39 Obj, membership, filtering
0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: object

40 Handling missing data string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() string_data[0] = None

41 Isnull(), np.nan, None 0 aardvark 1 artichoke 2 NaN 3 avocado
dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool

42 Filtering out missing data
from numpy import nan as NA data = Series([1, NA, 3.5, NA, 7]) print data print data.dropna() print data[data.notnull()] (the last two are equivalent)

43 Drops NaN 0 1.0 0 1.0 1 NaN 2 3.5 2 3.5 4 7.0 3 NaN dtype: float64
1 NaN 3 NaN dtype: float64

44 data = DataFrame( [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned

45 With NaN dropped 0 1 2 0 1 2 0 1 6.5 3 0 1 6.5 3 1 1 NaN NaN
NaN NaN 2 NaN NaN NaN 3 NaN Any row with NaN is dropped

46 Passing how='all' will only drop rows that are all NA:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned

47 Only full NaN rows dropped
NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 3 NaN Rows with some NaN survive

48 Dropping columns data[4] = NA print data
print data.dropna(axis=1, how='all')

49 Column 4 is dropped as all NaN
NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 2 NaN NaN NaN 3 NaN

50 Filling in missing data
data = Series(np.random.randn(10), index=[ ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) print data

51 Hierarchical indexing
b c d

52 MultiIndex print data.index dtype: float64 MultiIndex(
levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

53 Using hierarchical indexes.
print data['b'] dtype: float64 print data[:, 2] a b c d dtype: float64

54 unstack() stack() Unstack takes 1D data and converts into 2D
b c d dtype: float64 a b c NaN d NaN

55 Both Axis are hierarchical
frame = DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame

56 output Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8
b

57 Give index and columns names
frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print print frame

58 output state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2
b

59 print frame['Ohio'] color Green Red key1 key2 a 1 0 1 2 3 4 b 1 6 7
b

60 We can analyse the MultiIndex
<class 'pandas.core. index.MultiIndex'> ('a', 1) ('a', 2) <type 'tuple'> b <type 'str'> <type 'numpy.int64'> 4 type(frame.index) frame.index[0] frame.index[1] type(frame.index[0]) frame.index[3][0] type(frame.index[3][0]) type(frame.index[3][1]) len(frame.index)

61 Reordering and sorting levels
print frame state Ohio Colorado color Green Red Green key1 key2 a b

62 Swap keys print frame.swaplevel('key1', 'key2') state Ohio Colorado
color Green Red Green key2 key1 1 a 2 a 1 b 2 b

63 Sorting levels print frame.sortlevel(0) state Ohio Colorado color Green Red Green key1 key2 a b print frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a b a b

64 Summary statistics by level
frame.sum(level='key2') state Ohio Colorado color Green Red Green key frame.sum(level='color', axis=1) color Green Red key1 key2 a b


Download ppt "Pandas John R. Woodward."

Similar presentations


Ads by Google