By : Mrs Sangeeta M Chauhan , Gwalior Python Pandas DataFrame By : Mrs Sangeeta M Chauhan , Gwalior https://pythonclassroomdiary.wordpress.com by Sangeeta M Chauhan, Gwalior
What is Data Frame? A Data frame is a 2D (two-dimensional) data structure, i.e., data is arranged in tabular form i.e. In the form of rows and columns. Or we can say that, Pandas DataFrame is similar to excel sheet Let’s understand it through an example
Parameter & Description Create DataFrame pandas DataFrame can be created using the following constructor pandas.DataFrame ( data[, index, columns, dtype, copy]) The parameters of the constructor are as follows Sr.No Parameter & Description 1 Data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. 2 Index For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. 3 Columns For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. 4 Dtype Data type of each column. 5 Copy This command is used for copying of data, if the default is False.
>>> import pandas as pd >>> df=pd.DataFrame() A pandas DataFrame can be created using various inputs like 1. Lists 2. dictionary 3. Series 4. Numpy ndarrays 5. Another DataFrame >>> import pandas as pd >>> df=pd.DataFrame() >>> df Empty DataFrame Columns: [] Index: [] Creating an Empty DataFrame
Create a DataFrame from Lists Example 1 (Simple List) >>> MyList=[10,20,30,40] >>> MyFrame=pd.DataFrame(MyList) >>> MyFrame 0 0 10 1 20 2 30 3 40 Create a DataFrame from Lists Example 1 (Simple List) >>> Friends = [['Shraddha','Doctor'],['Shanti','Teacher'],['Monica','Engineer']] >>> MyFrame=pd.DataFrame(Friends,columns=['Name','Occupation']) >>> MyFrame Name Occupation 0 Shraddha Doctor 1 Shanti Teacher 2 Monica Engineer
Creation of a DataFrame from Dictionary of ndarrays / Lists All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length.
Example 1 (without index) >>> data = {'Name':['Shraddha', 'Shanti','Monica','Yogita'], \ ‘Age’:[28,34,29,39]} >>> df = pd.DataFrame(data) >>> df Name Age 0 Shraddha 28 1 Shanti 34 2 Monica 29 3 Yogita 39
Example 1 (With Index) Name Age Friend1 Shraddha 28 Friend2 Shanti 34 >>> data = {'Name':['Shraddha', 'Shanti', 'Monica', 'Yogita'],\ 'Age':[28,34,29,39]} >>> df = pd.DataFrame(data, \ index=['Friend1','Friend2','Relative1','Relative2']) >>> df Name Age Friend1 Shraddha 28 Friend2 Shanti 34 Relative1 Monica 29 Relative2 Yogita 39
Create a DataFrame from List of Dictionaries Here we are passing list of dictionary to create a DataFrame. The dictionary keys are by default taken as column names. Example 1: >>> Mydict= [{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10} , {'Won': 8, 'Loose': 9},{'Won':4}] >>> df = pd.DataFrame(Mydict) >>> df Loose Won 0 2.0 15 1 10.0 5 2 9.0 8 3 NaN 4 Notice that Missing Value is stored as NaN (Not a Number)
Example 2: Changing Index >>> Mydict=[{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10},{'Won': 8, 'Loose': 9}] >>> df = pd.DataFrame(Mydict, index= ['India', 'Pakistan', 'Australia' ]) >>> df Loose Won India 2 15 Pakistan 10 5 Autralia 9 8
Example 3 We can also create a DataFrame by specifying list of dictionaries, row indices, and column indices >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87}] Physics Chemistry Maths Student1 87.0 78.0 78 Student2 NaN 70 67 Student3 77.0 NaN 87 >>> df1 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['Physics', 'Chemistry','Maths'])
df2 is created with only 2 columns >>> df2 = pd.DataFrame (L_dict, index=['Student1', 'Student2','Student3'], columns=['Chemistry','Maths']) Chemistry Maths Student1 78.0 78 Student2 NaN 67 Student3 NaN 87 df2 is created with only 2 columns
Creating df3 by specifying 3 column name (New Column English) >>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['English','Chemistry','Maths']) Creating df3 by specifying 3 column name (New Column English) >>> df3 English Chemistry Maths Student1 NaN 78.0 78 Student2 NaN NaN 67 Student3 NaN NaN 87
Addition of New Column & Row Column Addition >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87,'Chemistry':90}] >>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2', 'Student3'], columns=['English','Chemistry','Maths']) >>> df3['Physics']=[45,56,65] English Chemistry Maths Physics Student1 NaN 78 78 45 Student2 NaN 70 67 56 Student3 NaN 90 87 65 New Column Physics added
We can Update column Data also by using same method >>> df3['English']=[78,98,89] English Chemistry Maths Physics Student1 78 78 78 45 Student2 98 70 67 56 Student3 89 90 87 65 Values of column English (NaN ) is replaced with new values
We can also add new column using Data ,stored in existing Frame df3['Total']=df3.English+df3.Chemistry+df3.Maths+df3.Physics Look ,a new Column Total has been added with total of marks in other subjects English Chemistry Maths Physics Total Student1 78 78 78 45 279 Student2 98 70 67 56 291 Student3 89 90 87 65 331
ASSIGNING AND COPYING OF DATAFRAME
Changes reflected in both dataframes Changes reflected in 2nd dataframeonly
SELECTION AND INDEXING Methods covered in this section are: Selecting data by row numbers (.iloc) Selecting data by label or by a conditional statment (.loc) Selecting data at particular row and column(.at) loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
ACCESSING DIFFERENT ROWS/ COLUMNS
Here end index is excluded Use of iloc Here end index is excluded
Use of at with dataFrame Access a single value for a row/column label pair. <df>.at[rowname,colname] <df>.loc[rowname].at[colname] XI.at['stu2','Name'] XI.at['stu2','Physics'] XI.loc['stu2'].at['Chemistry']
Row Addition To add a row , by specifying row index >>> df3.loc['Student4']=[45,67,45] >>> df3 English Chemistry Maths Student1 78 78 78 Student2 98 70 67 Student3 89 90 87 Student4 45 67 45
Use of iloc()
Deletion of an Existing Column/Row from Data Frame
Row Deletion >>> df3.drop('Student3') English Chemistry Maths Physics Student1 78 78 78 45 Student2 98 70 67 56
Descriptive Statistics with Pandas Pandas also offer many useful Statistical and Aggregate Functions. Out of which we are going to discuss following functions
DataFrame.min (axis=None, skipna=None, numeric_only=None,) Parameters : axis : Align object with threshold along the given axis. skipna : Exclude NA/null values when computing the result numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Use of min()
Calculating Mode (बहुलक)
Compare both Data & Result
Calculating median
Count () will count element for each column (by default)
Sum() by column
APPLYING FUNCTIONS ON SUBSET OF DATA FRAME <DF>[[<COL NAME1,COLNAME2….>].<fun_nm> Example : XI[‘Chemitry’].min()
APPLYING FUNCTIONS ON ROW OF DATA FRAME <DF>.loc[<row_index>,….].<function_name>
ADVANCED OPERATION ON DATAFRAME 1. PIVOTTING :In Pandas, the pivot table function takes simple data frame as input, and performs grouped operations that provides a multidimensional summary of the data.
pivot() function pandas.pivot(index, columns, values) Function produces pivot table based on 3 columns of the DataFrame. Uses unique values from index / columns and fills with values. Parameters: index : Labels to use to make new frame’s index columns : Labels to use to make new frame’s columns values : Values to use for populating new frame’s values
Csv file : resultAnalyiTr S.no Class Teacher Name Subject ResultPer 1 XII RK Saxena Physics 98 2 Neelam Sharma Chemistry 100 3 PP Singh Maths 99.5 4 Ravita CS
What if we have duplicate values in column and index????? Yes we have solution….. We can use pivot_table() instead of pivot() pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)
Lets consider another table (with duplicate values)
Lets consider modified csv file
pivotTableEx2.py
SORTING DataFrame.sort_values(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last’) Parameters: columns : on which data is sorted ascending : default True(Ascending) Specify list for multiple sort orders axis : {0 or ‘index’, 1 or ‘columns’}, default 0, Sort index/rows versus columns inplace : boolean, default False, Sort the DataFrame without creating a new instance kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, optional This option is only applied when sorting on a single column or label. na_position : {‘first’, ‘last’} (optional, default=’last’) ‘first’ puts NaNs at the beginning ‘last’ puts NaNs at the end
Use of df.sort_values() DataFrame df Ascending Order Descending
Descending order with Nan values at first position
Sorting on Multiple Columns print("Sorting on Multiple(2) Columns") print(df.sort_values(by=['Class','Teacher Name'])) print("Sorting on Multiple(3) Columns") print(df.sort_values(['Class','Teacher Name','Year']))
Aggregate Functions Function Description count Number of non-null observations sum Sum of values mean Mean of values mad Mean absolute deviation median Arithmetic median of values min Minimum max Maximum mode Mode std Unbiased standard deviation
<Df>.<agg_fun_name>()
FUNCTION APPLICATION On the whole data frame – pipe() Function (UDF or Library) can be applied on a dataframe in multipleways On the whole data frame – pipe() Row/column wise - apply() On individual element – applymap()
Output of previous pipe Data Input Output of previous pipe Input to next pipe
Pipe(func_name,*args) d1={'Sal':[50000,60000,55000],'bonus':[3000,4000,5000]} df1=pd.DataFrame(d1) print(df1.pipe(np.add,2).pipe(np.power,2).pipe(np.divide,3)) print(df1.pipe(df1.add,2).pipe(df1.divide,3)) for : divide(power( add(df,2),2),3)
Pandas.DataFrame.info() function is used to get a concise summary of the dataframe. Pandas.DataFrame.describe() Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
Grouping on Dataframe’s Column <DF>.groupby(by=None,axis=0) Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition
import pandas as pd import numpy as np df=pd.read_csv("resultAnalysis.csv") print(df) print("Grouping on Class") grp=df.groupby(by='Teacher Name') print(grp)
print(grp.groups)
for name,group in grp: print (name) print (group)
print(gp.get_group('Namrata')) SELECT A GROUP Using the get_group() method, we can select a single group. print(gp.get_group('Namrata'))
Aggregation print (grp['ResultPer'].agg(np.mean))
print(gp['ResultPer'].agg([np.sum, np.mean, np.std])) Applying Multiple Aggregation Functions at Once print(gp['ResultPer'].agg([np.sum, np.mean, np.std]))
To see the size of each group is by applying the size() function print(gp.agg(np.size))
Grouping on multiple columns and aggregation gp=df.groupby(by=['Teacher Name', 'Class‘, ’Year’] ). agg(np.mean)
transform()) It transforms the aggregate data by repeating the summary result for each row of the group and makes the result have the same shape as original data
Output before/after transform
Reindexing and Altering Labels
Changing column labels Rename(): It simply renames the index and/or column labels in a dataframe. Changing column labels
Changing Column label and row index
<DF>.reindex(index=None,Columns=None, fill_value=nan Reindex(): It helps to specify new order of by reordering records to be displayed existing indexes and column labels. <DF>.reindex(index=None,Columns=None, fill_value=nan
Observe that it contains values of common column ‘Physics’ Reindex_like() : Used to create indexes/columns labels based on the other dataframe object Observe that it contains values of common column ‘Physics’
knowledge is of no value unless you put it into practice So, Keep Practicing THANKS