Pandas John R. Woodward.

Slides:



Advertisements
Similar presentations
How to work with Pivot Tables Step by step instruction.
Advertisements

© M. Winter COSC 4P41 – Functional Programming Patterns of computation over lists Applying to all – mapping map :: (a -> b) -> [a] -> [b] map f.
EGR 105 Foundations of Engineering I Session 3 Excel – Basics through Graphing Fall 2008.
Copyright © 2008 Pearson Prentice Hall. All rights reserved. 11 Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next Generation.
Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015.
11 Chapter 2: Formulas and Functions Chapter 02 Lecture Notes (CSIT 104) Exploring Microsoft Office Excel 2007.
A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.
Python Primer 1: Types and Operators © 2013 Goodrich, Tamassia, Goldwasser1Python Primer.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Spreadsheets What is Excel?. Objectives 1. Identify the parts of the Excel Screen 2. Identify the functions of a spreadsheet 3. Identify how spreadsheets.
Components of Spreadsheets Computer Applications 1 Obj. 4.01: Understand spreadsheets used in business.
DAY 4,5,6: EXCEL CHAPTERS 1 & 2 Rohit January 27 th to February 1 st
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Using Structured Query Language (SQL) NCCS Applications –MS Access queries (“show SQL”) –SAS (PROC SQL) –MySQL (the new dataserver) –Visual Foxpro Other.
Excel Tips and Tricks Leda Voigt Green River College.
Input, Output and Variables GCSE Computer Science – Python.
MSAA PRESENTS: AN EXCEL TUTORIAL
Python for Data Analysis
Python for data analysis Prakhar Amlathe Utah State University
Microsoft Excel.
GO! with Microsoft Office 2016
Formulas, Functions, and other Useful Features
Advanced Python Idioms
Information Systems and Databases
Numpy (Numerical Python)
CS 5163: Introduction to Data Science
Department of Computer Science,
Lecture 10 Data Collections
External libraries A very complete list can be found at PyPi the Python Package Index: To install, use pip, which comes with.
Data Tables, Indexes, pandas
Python NumPy AILab Batselem Jagvaral 2016 March.
Introduction to Python
Python Visualization Tools: Pandas, Seaborn, ggplot
Introduction to pandas
Numerical Descriptives in R
regex (Regular Expressions) Examples Logic in Python (and pandas)
PYTHON Prof. Muhammad Saeed.
More Loop Examples Functions and Parameters
Introduction to Python
4.01 Spreadsheet Formulas & Functions
Microsoft Excel 101.
Numpy (Numerical Python)
Numpy (Numerical Python)
1.
Python Tutorial for C Programmer Boontee Kruatrachue Kritawan Siriboon
4.01 Spreadsheet Formulas & Functions
Excel: Formulas & Functions I Participation Project
Lecture 6: Data Quality and Pandas
Python for Data Analysis
雲端計算.
Python Primer 1: Types and Operators
雲端計算.
Slides based on Z. Ives’ at University of Pennsylvania and
Advanced Python Idioms
Dr. Sampath Jayarathna Cal Poly Pomona
Advanced Python Idioms
Dr. Sampath Jayarathna Cal Poly Pomona
Dr. Sampath Jayarathna Old Dominion University
Dr. Sampath Jayarathna Old Dominion University
Data Wrangling with pandas
Introduction to Spreadsheet Terminology
PYTHON PANDAS FUNCTION APPLICATIONS
More 2D Array and Loop Examples Functions and Parameters
Lesson 13 Working with Tables
regex (Regular Expressions) Examples Logic in Python (and pandas)
CSE 231 Lab 15.
DATAFRAME.
INTRODUCING PYTHON PANDAS:-SERIES
Introduction to SQL Server and the Structure Query Language
By : Mrs Sangeeta M Chauhan , Gwalior
Presentation transcript:

Pandas John R. Woodward

Ndarray in numpy are like matrix/vector in maths. Pandas – panel data Like spreadsheets.

Import statement import pandas as pd import numpy as np import matplotlib.pyplot as plt

spreadsheet

DataFrame A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index).

Create DataFrame from Dictionary data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = pd.DataFrame(data) print frame

OUTPUT pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002

Adding dataframes df1 = DataFrame( np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2

df1, df2, df1 + df2 b d e b c d Utah 0 1 2 Ohio 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 b c d Ohio 0 1 2 Texas 3 4 5 Colorado 6 7 8 b c d e Colorado NaN NaN NaN NaN Ohio 3 NaN 6 NaN Oregon NaN NaN NaN NaN Texas 9 NaN 12 NaN Utah NaN NaN NaN NaN

fill_value=0 df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde')) print df1 print df2 print df1 + df2 print df1.add(df2, fill_value=0)

df1 df2 a b c d e 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 a b c d 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11

Without and with fill_value = 0 a b c d e 0 0 2 4 6 NaN 1 9 11 13 15 NaN 2 18 20 22 24 NaN 3 NaN NaN NaN NaN NaN a b c d e 0 0 2 4 6 4 1 9 11 13 15 9 2 18 20 22 24 14 3 15 16 17 18 19

Operations between DataFrame and Series arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0]

Output (broadcasting) [[ 0. 1. 2. 3.] [ 4. 5. 6. 7.] [ 8. 9. 10. 11.]] [ 0. 1. 2. 3.] [[ 0. 0. 0. 0.] [ 4. 4. 4. 4.] [ 8. 8. 8. 8.]]

Function application and mapping frame = DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)

Calculates the absolute value b d e Utah -0.204708 0.478943 -0.519439 Ohio -0.555730 1.965781 1.393406 Texas 0.092908 0.281746 0.769023 Oregon 1.246435 1.007189 -1.296221 Utah 0.204708 0.478943 0.519439 Ohio 0.555730 1.965781 1.393406 Oregon 1.246435 1.007189 1.296221

Defining new functions f = lambda x: x.max() - x.min() print frame.apply(f) frame.apply(f, axis=1)

Output b 1.802165 d 1.684034 e 2.689627 dtype: float64

Sorting obj = Series([1,3,2,4], index=['d', 'a', 'b', 'c']) print obj print obj.sort_index() print obj.sort_values()

Sorted output d 1 a 3 b 2 c 4 dtype: int64

Sorting Dataframes frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c']) print frame.sort_index() print frame.sort_index(axis=1) print frame.sort_index(axis=1, ascending=False)

Sorted Dataframes d a b c one 4 5 6 7 three 0 1 2 3 a b c d d c b a three 0 3 2 1 one 4 7 6 5

Multiple indexes frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]}) print frame print frame.sort_index(by='b') print frame.sort_index(by=['a', 'b'])

output a b 0 0 4 1 1 7 2 0 -3 3 1 2 a b 2 0 -3 3 1 2 0 0 4 1 1 7 a b 2 0 -3 0 0 4 3 1 2 1 1 7

ranking obj = Series([7, -5, 7, 4, 2, 0, 4]) print obj print obj.rank()

Equal ties are split 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64

Axis indexes with duplicate values obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique print obj['a'] print obj['c']

Output can be series or scalar dtype: int64 False a 0 a 1 dtype: int64 4

Summarizing and computing descriptive statistics df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c', 'd'], columns=['one', 'two']) print df print df.sum() print df.sum(axis=1) (i.e. we sum horizontally and vertically)

Sum vertical and horizontal one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c NaN d -0.55

skipna=False print df.mean(axis=1, skipna=False) print df.idxmax() print df.cumsum() will not skip Nan calculates mean, id of max, and cumulative sum

Mean, idxmax, cumsum() a NaN b 1.300 c NaN d -0.275 dtype: float64 one two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8 one b two d dtype: object

print df.describe() one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000

Unique values, value counts, and membership obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj uniques = obj.unique() print uniques print obj.value_counts()

Unique and counts ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64 0 c dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64

membership mask = obj.isin(['b', 'c']) print mask print obj[mask]

Obj, membership, filtering 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: object

Handling missing data string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() string_data[0] = None

Isnull(), np.nan, None 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool

Filtering out missing data from numpy import nan as NA data = Series([1, NA, 3.5, NA, 7]) print data print data.dropna() print data[data.notnull()] (the last two are equivalent)

Drops NaN 0 1.0 0 1.0 1 NaN 2 3.5 2 3.5 4 7.0 3 NaN dtype: float64 0 1.0 1 NaN 2 3.5 3 NaN 4 7.0 0 1.0 2 3.5 4 7.0 dtype: float64

data = DataFrame( [[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned

With NaN dropped 0 1 2 0 1 2 0 1 6.5 3 0 1 6.5 3 1 1 NaN NaN 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 0 1 2 0 1 6.5 3 Any row with NaN is dropped

Passing how='all' will only drop rows that are all NA: data = DataFrame([[1., 6.5, 3.], [1., NA, NA], [NA, NA, NA], [NA, 6.5, 3.]]) cleaned = data.dropna() print data print cleaned

Only full NaN rows dropped 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3 0 1 2 0 1 6.5 3 1 1 NaN NaN 3 NaN 6.5 3 Rows with some NaN survive

Dropping columns data[4] = NA print data print data.dropna(axis=1, how='all')

Column 4 is dropped as all NaN 0 1 2 4 0 1 6.5 3 NaN 1 1 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3 NaN 0 1 2 0 1 6.5 3 1 1 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3

Filling in missing data data = Series(np.random.randn(10), index=[ ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) print data

Hierarchical indexing 2 0.478943 3 -0.519439 b 1 -0.555730 2 1.965781 3 1.393406 c 1 0.092908 2 0.281746 d 2 0.769023 3 1.246435

MultiIndex print data.index dtype: float64 MultiIndex( levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

Using hierarchical indexes. print data['b'] 1 -0.555730 2 1.965781 3 1.393406 dtype: float64 print data[:, 2] a 0.478943 b 1.965781 c 0.281746 d 0.769023 dtype: float64

unstack() stack() Unstack takes 1D data and converts into 2D 2 0.478943 3 -0.519439 b 1 -0.555730 2 1.965781 3 1.393406 c 1 0.092908 2 0.281746 d 2 0.769023 3 1.246435 dtype: float64 1 2 3 a -0.204708 0.478943 -0.519439 b -0.555730 1.965781 1.393406 c 0.092908 0.281746 NaN d NaN 0.769023 1.246435

Both Axis are hierarchical frame = DataFrame(np.arange(12).reshape((4, 3)),index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame

output Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 3 4 5 b 1 6 7 8 2 9 10 11

Give index and columns names frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print print frame

output state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11

print frame['Ohio'] color Green Red key1 key2 a 1 0 1 2 3 4 b 1 6 7 2 3 4 b 1 6 7 2 9 10

We can analyse the MultiIndex <class 'pandas.core. index.MultiIndex'> ('a', 1) ('a', 2) <type 'tuple'> b <type 'str'> <type 'numpy.int64'> 4 type(frame.index) frame.index[0] frame.index[1] type(frame.index[0]) frame.index[3][0] type(frame.index[3][0]) type(frame.index[3][1]) len(frame.index)

Reordering and sorting levels print frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11

Swap keys print frame.swaplevel('key1', 'key2') state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11

Sorting levels print frame.sortlevel(0) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 print frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11

Summary statistics by level frame.sum(level='key2') state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 frame.sum(level='color', axis=1) color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10