Download presentation
Presentation is loading. Please wait.
1
By : Mrs Sangeeta M Chauhan , Gwalior
Python Pandas DataFrame By : Mrs Sangeeta M Chauhan , Gwalior by Sangeeta M Chauhan, Gwalior
2
What is Data Frame? A Data frame is a 2D (two-dimensional) data structure, i.e., data is arranged in tabular form i.e. In the form of rows and columns. Or we can say that, Pandas DataFrame is similar to excel sheet Let’s understand it through an example
3
Parameter & Description
Create DataFrame pandas DataFrame can be created using the following constructor pandas.DataFrame ( data[, index, columns, dtype, copy]) The parameters of the constructor are as follows Sr.No Parameter & Description 1 Data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. 2 Index For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. 3 Columns For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. 4 Dtype Data type of each column. 5 Copy This command is used for copying of data, if the default is False.
4
>>> import pandas as pd >>> df=pd.DataFrame()
A pandas DataFrame can be created using various inputs like 1. Lists 2. dictionary 3. Series 4. Numpy ndarrays 5. Another DataFrame >>> import pandas as pd >>> df=pd.DataFrame() >>> df Empty DataFrame Columns: [] Index: [] Creating an Empty DataFrame
5
Create a DataFrame from Lists
Example 1 (Simple List) >>> MyList=[10,20,30,40] >>> MyFrame=pd.DataFrame(MyList) >>> MyFrame Create a DataFrame from Lists Example 1 (Simple List) >>> Friends = [['Shraddha','Doctor'],['Shanti','Teacher'],['Monica','Engineer']] >>> MyFrame=pd.DataFrame(Friends,columns=['Name','Occupation']) >>> MyFrame Name Occupation 0 Shraddha Doctor 1 Shanti Teacher 2 Monica Engineer
6
Creation of a DataFrame from Dictionary of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length.
7
Example 1 (without index)
>>> data = {'Name':['Shraddha', 'Shanti','Monica','Yogita'], \ ‘Age’:[28,34,29,39]} >>> df = pd.DataFrame(data) >>> df Name Age 0 Shraddha 28 1 Shanti 2 Monica 3 Yogita
8
Example 1 (With Index) Name Age Friend1 Shraddha 28 Friend2 Shanti 34
>>> data = {'Name':['Shraddha', 'Shanti', 'Monica', 'Yogita'],\ 'Age':[28,34,29,39]} >>> df = pd.DataFrame(data, \ index=['Friend1','Friend2','Relative1','Relative2']) >>> df Name Age Friend Shraddha 28 Friend Shanti Relative1 Monica Relative2 Yogita
9
Create a DataFrame from List of Dictionaries
Here we are passing list of dictionary to create a DataFrame. The dictionary keys are by default taken as column names. Example 1: >>> Mydict= [{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10} , {'Won': 8, 'Loose': 9},{'Won':4}] >>> df = pd.DataFrame(Mydict) >>> df Loose Won 3 NaN Notice that Missing Value is stored as NaN (Not a Number)
10
Example 2: Changing Index
>>> Mydict=[{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10},{'Won': 8, 'Loose': 9}] >>> df = pd.DataFrame(Mydict, index= ['India', 'Pakistan', 'Australia' ]) >>> df Loose Won India Pakistan Autralia
11
Example 3 We can also create a DataFrame by specifying list of dictionaries, row indices, and column indices >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87}] Physics Chemistry Maths Student Student NaN Student NaN >>> df1 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['Physics', 'Chemistry','Maths'])
12
df2 is created with only 2 columns
>>> df2 = pd.DataFrame (L_dict, index=['Student1', 'Student2','Student3'], columns=['Chemistry','Maths']) Chemistry Maths Student Student2 NaN 67 Student3 NaN 87 df2 is created with only 2 columns
13
Creating df3 by specifying 3 column name (New Column English)
>>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['English','Chemistry','Maths']) Creating df3 by specifying 3 column name (New Column English) >>> df3 English Chemistry Maths Student1 NaN Student2 NaN NaN 67 Student3 NaN NaN 87
14
Addition of New Column & Row
Column Addition >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87,'Chemistry':90}] >>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2', 'Student3'], columns=['English','Chemistry','Maths']) >>> df3['Physics']=[45,56,65] English Chemistry Maths Physics Student NaN Student NaN Student NaN New Column Physics added
15
We can Update column Data also by using same method
>>> df3['English']=[78,98,89] English Chemistry Maths Physics Student Student Student Values of column English (NaN ) is replaced with new values
16
We can also add new column using Data ,stored in existing Frame
df3['Total']=df3.English+df3.Chemistry+df3.Maths+df3.Physics Look ,a new Column Total has been added with total of marks in other subjects English Chemistry Maths Physics Total Student Student Student
17
ASSIGNING AND COPYING OF DATAFRAME
18
Changes reflected in both dataframes
Changes reflected in 2nd dataframeonly
19
SELECTION AND INDEXING
Methods covered in this section are: Selecting data by row numbers (.iloc) Selecting data by label or by a conditional statment (.loc) Selecting data at particular row and column(.at) loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
20
ACCESSING DIFFERENT ROWS/ COLUMNS
21
Here end index is excluded
Use of iloc Here end index is excluded
22
Use of at with dataFrame
Access a single value for a row/column label pair. <df>.at[rowname,colname] <df>.loc[rowname].at[colname] XI.at['stu2','Name'] XI.at['stu2','Physics'] XI.loc['stu2'].at['Chemistry']
23
Row Addition To add a row , by specifying row index
>>> df3.loc['Student4']=[45,67,45] >>> df3 English Chemistry Maths Student Student Student Student
24
Use of iloc()
25
Deletion of an Existing Column/Row from Data Frame
26
Row Deletion >>> df3.drop('Student3') English Chemistry Maths Physics Student Student
27
Descriptive Statistics with Pandas
Pandas also offer many useful Statistical and Aggregate Functions. Out of which we are going to discuss following functions
28
DataFrame.min (axis=None, skipna=None, numeric_only=None,)
Parameters : axis : Align object with threshold along the given axis. skipna : Exclude NA/null values when computing the result numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
29
Use of min()
30
Calculating Mode (बहुलक)
31
Compare both Data & Result
32
Calculating median
33
Count () will count element for each column (by default)
34
Sum() by column
35
APPLYING FUNCTIONS ON SUBSET OF DATA FRAME
<DF>[[<COL NAME1,COLNAME2….>].<fun_nm> Example : XI[‘Chemitry’].min()
36
APPLYING FUNCTIONS ON ROW OF DATA FRAME
<DF>.loc[<row_index>,….].<function_name>
37
ADVANCED OPERATION ON DATAFRAME
1. PIVOTTING :In Pandas, the pivot table function takes simple data frame as input, and performs grouped operations that provides a multidimensional summary of the data.
38
pivot() function pandas.pivot(index, columns, values)
Function produces pivot table based on 3 columns of the DataFrame. Uses unique values from index / columns and fills with values. Parameters: index : Labels to use to make new frame’s index columns : Labels to use to make new frame’s columns values : Values to use for populating new frame’s values
39
Csv file : resultAnalyiTr
S.no Class Teacher Name Subject ResultPer 1 XII RK Saxena Physics 98 2 Neelam Sharma Chemistry 100 3 PP Singh Maths 99.5 4 Ravita CS
40
What if we have duplicate values in column and index?????
Yes we have solution….. We can use pivot_table() instead of pivot() pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)
41
Lets consider another table (with duplicate values)
43
Lets consider modified csv file
44
pivotTableEx2.py
45
SORTING DataFrame.sort_values(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last’) Parameters: columns : on which data is sorted ascending : default True(Ascending) Specify list for multiple sort orders axis : {0 or ‘index’, 1 or ‘columns’}, default 0, Sort index/rows versus columns inplace : boolean, default False, Sort the DataFrame without creating a new instance kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, optional This option is only applied when sorting on a single column or label. na_position : {‘first’, ‘last’} (optional, default=’last’) ‘first’ puts NaNs at the beginning ‘last’ puts NaNs at the end
46
Use of df.sort_values()
DataFrame df Ascending Order Descending
47
Descending order with Nan values at first position
48
Sorting on Multiple Columns
print("Sorting on Multiple(2) Columns") print(df.sort_values(by=['Class','Teacher Name'])) print("Sorting on Multiple(3) Columns") print(df.sort_values(['Class','Teacher Name','Year']))
49
Aggregate Functions Function Description count
Number of non-null observations sum Sum of values mean Mean of values mad Mean absolute deviation median Arithmetic median of values min Minimum max Maximum mode Mode std Unbiased standard deviation
50
<Df>.<agg_fun_name>()
51
FUNCTION APPLICATION On the whole data frame – pipe()
Function (UDF or Library) can be applied on a dataframe in multipleways On the whole data frame – pipe() Row/column wise - apply() On individual element – applymap()
52
Output of previous pipe
Data Input Output of previous pipe Input to next pipe
53
Pipe(func_name,*args)
d1={'Sal':[50000,60000,55000],'bonus':[3000,4000,5000]} df1=pd.DataFrame(d1) print(df1.pipe(np.add,2).pipe(np.power,2).pipe(np.divide,3)) print(df1.pipe(df1.add,2).pipe(df1.divide,3)) for : divide(power( add(df,2),2),3)
56
Pandas.DataFrame.info() function is used to get a concise summary of the dataframe. Pandas.DataFrame.describe() Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
57
Grouping on Dataframe’s Column <DF>.groupby(by=None,axis=0)
Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition
58
import pandas as pd import numpy as np df=pd.read_csv("resultAnalysis.csv") print(df) print("Grouping on Class") grp=df.groupby(by='Teacher Name') print(grp)
59
print(grp.groups)
60
for name,group in grp: print (name) print (group)
61
print(gp.get_group('Namrata'))
SELECT A GROUP Using the get_group() method, we can select a single group. print(gp.get_group('Namrata'))
62
Aggregation print (grp['ResultPer'].agg(np.mean))
63
print(gp['ResultPer'].agg([np.sum, np.mean, np.std]))
Applying Multiple Aggregation Functions at Once print(gp['ResultPer'].agg([np.sum, np.mean, np.std]))
64
To see the size of each group is by applying the size() function
print(gp.agg(np.size))
65
Grouping on multiple columns and aggregation
gp=df.groupby(by=['Teacher Name', 'Class‘, ’Year’] ). agg(np.mean)
66
transform()) It transforms the aggregate data by repeating the summary result for each row of the group and makes the result have the same shape as original data
67
Output before/after transform
68
Reindexing and Altering Labels
69
Changing column labels
Rename(): It simply renames the index and/or column labels in a dataframe. Changing column labels
70
Changing Column label and row index
71
<DF>.reindex(index=None,Columns=None, fill_value=nan
Reindex(): It helps to specify new order of by reordering records to be displayed existing indexes and column labels. <DF>.reindex(index=None,Columns=None, fill_value=nan
72
Observe that it contains values of common column ‘Physics’
Reindex_like() : Used to create indexes/columns labels based on the other dataframe object Observe that it contains values of common column ‘Physics’
73
knowledge is of no value unless you put it into practice
So, Keep Practicing THANKS
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.