Download presentation
Presentation is loading. Please wait.
1
Pandas John R. Woodward
2
Ndarray in numpy are like matrix/vector in maths. Pandas – panel data Like spreadsheets.
3
Import statement import pandas as pd import numpy as np
import matplotlib.pyplot as plt
4
spreadsheet
5
DataFrame A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
7
Create DataFrame from Dictionary
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = pd.DataFrame(data) print frame
8
OUTPUT pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002
Nevada 2001 Nevada 2002
10
Adding dataframes df1 = DataFrame( np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2
11
df1, df2, df1 + df2 b d e b c d Utah 0 1 2 Ohio 0 1 2 Ohio 3 4 5
Texas Oregon b c d Ohio Texas Colorado b c d e Colorado NaN NaN NaN NaN Ohio NaN NaN Oregon NaN NaN NaN NaN Texas NaN NaN Utah NaN NaN NaN NaN
12
fill_value=0 df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) print df1 print df2 print df1 + df2 print df1.add(df2, fill_value=0)
13
df1 df2 a b c d e a b c d
14
Without and with fill_value = 0
a b c d e NaN NaN NaN 3 NaN NaN NaN NaN NaN a b c d e
15
Operations between DataFrame and Series
arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0]
16
Output (broadcasting)
[[ ] [ ] [ ]] [ ] [[ ] [ ] [ ]]
17
Function application and mapping
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)
18
Calculates the absolute value
b d e Utah Ohio Texas Oregon Utah Ohio Oregon
19
Defining new functions
f = lambda x: x.max() - x.min() print frame.apply(f) frame.apply(f, axis=1)
20
Output b d e dtype: float64
21
Sorting obj = Series([1,3,2,4], index=['d', 'a', 'b', 'c']) print obj
print obj.sort_index() print obj.sort_values()
22
Sorted output d 1 a 3 b 2 c 4 dtype: int64
23
Sorting Dataframes frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) print frame.sort_index() print frame.sort_index(axis=1) print frame.sort_index(axis=1, ascending=False)
24
Sorted Dataframes d a b c one 4 5 6 7 three 0 1 2 3 a b c d
d c b a three one
25
Multiple indexes frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) print frame print frame.sort_index(by='b') print frame.sort_index(by=['a', 'b'])
26
output a b a b a b
27
ranking obj = Series([7, -5, 7, 4, 2, 0, 4]) print obj
print obj.rank()
28
Equal ties are split 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 dtype: float64
29
Axis indexes with duplicate values
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique print obj['a'] print obj['c']
30
Output can be series or scalar
dtype: int64 False a 0 a 1 dtype: int64 4
31
Summarizing and computing descriptive statistics
df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) print df print df.sum() print df.sum(axis=1) (i.e. we sum horizontally and vertically)
32
Sum vertical and horizontal
one two a NaN b c NaN NaN d one two dtype: float64 a b c NaN d
33
skipna=False print df.mean(axis=1, skipna=False) print df.idxmax()
print df.cumsum() will not skip Nan calculates mean, id of max, and cumulative sum
34
Mean, idxmax, cumsum() a NaN b 1.300 c NaN d -0.275 dtype: float64
one two a NaN b c NaN NaN d one b two d dtype: object
35
print df.describe() one two count 3.000000 2.000000
mean std min 25% 50% 75% max
36
Unique values, value counts, and membership
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj uniques = obj.unique() print uniques print obj.value_counts()
37
Unique and counts ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64 0 c
dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64
38
membership mask = obj.isin(['b', 'c']) print mask print obj[mask]
39
Obj, membership, filtering
0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: object
40
Handling missing data string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() string_data[0] = None
41
Isnull(), np.nan, None 0 aardvark 1 artichoke 2 NaN 3 avocado
dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool
42
Filtering out missing data
from numpy import nan as NA data = Series([1, NA, 3.5, NA, 7]) print data print data.dropna() print data[data.notnull()] (the last two are equivalent)
43
Drops NaN 0 1.0 0 1.0 1 NaN 2 3.5 2 3.5 4 7.0 3 NaN dtype: float64
1 NaN 3 NaN dtype: float64
44
data = DataFrame( [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned
45
With NaN dropped 0 1 2 0 1 2 0 1 6.5 3 0 1 6.5 3 1 1 NaN NaN
NaN NaN 2 NaN NaN NaN 3 NaN Any row with NaN is dropped
46
Passing how='all' will only drop rows that are all NA:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned
47
Only full NaN rows dropped
NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 3 NaN Rows with some NaN survive
48
Dropping columns data[4] = NA print data
print data.dropna(axis=1, how='all')
49
Column 4 is dropped as all NaN
NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 2 NaN NaN NaN 3 NaN
50
Filling in missing data
data = Series(np.random.randn(10), index=[ ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) print data
51
Hierarchical indexing
b c d
52
MultiIndex print data.index dtype: float64 MultiIndex(
levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])
53
Using hierarchical indexes.
print data['b'] dtype: float64 print data[:, 2] a b c d dtype: float64
54
unstack() stack() Unstack takes 1D data and converts into 2D
b c d dtype: float64 a b c NaN d NaN
55
Both Axis are hierarchical
frame = DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame
56
output Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8
b
57
Give index and columns names
frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print print frame
58
output state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2
b
59
print frame['Ohio'] color Green Red key1 key2 a 1 0 1 2 3 4 b 1 6 7
b
60
We can analyse the MultiIndex
<class 'pandas.core. index.MultiIndex'> ('a', 1) ('a', 2) <type 'tuple'> b <type 'str'> <type 'numpy.int64'> 4 type(frame.index) frame.index[0] frame.index[1] type(frame.index[0]) frame.index[3][0] type(frame.index[3][0]) type(frame.index[3][1]) len(frame.index)
61
Reordering and sorting levels
print frame state Ohio Colorado color Green Red Green key1 key2 a b
62
Swap keys print frame.swaplevel('key1', 'key2') state Ohio Colorado
color Green Red Green key2 key1 1 a 2 a 1 b 2 b
63
Sorting levels print frame.sortlevel(0) state Ohio Colorado color Green Red Green key1 key2 a b print frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a b a b
64
Summary statistics by level
frame.sum(level='key2') state Ohio Colorado color Green Red Green key frame.sum(level='color', axis=1) color Green Red key1 key2 a b
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.