By : Mrs Sangeeta M Chauhan , Gwalior

Slides:



Advertisements
Similar presentations
CS1100: Computer Science and Its Applications Building Flexible Models in Microsoft Excel.
Advertisements

Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Exploring Microsoft Excel 2002 Chapter 7 Chapter 7 List and Data Management: Converting Data to Information By Robert T. Grauer Maryann Barber Exploring.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Introduction to SQL Session 1 Retrieving Data From a Single Table.
A Guide to SQL, Seventh Edition. Objectives Retrieve data from a database using SQL commands Use compound conditions Use computed columns Use the SQL.
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Chapter 3 Single-Table Queries
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
XP 1 Excel Tables Purpose of tables – Process data in a group – Used to facilitate calculations – Used to enhance readability of output Types of tables.
Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015.
1 Single Table Queries. 2 Objectives  SELECT, WHERE  AND / OR / NOT conditions  Computed columns  LIKE, IN, BETWEEN operators  ORDER BY, GROUP BY,
Built-in Data Structures in Python An Introduction.
XP. Objectives Sort data and filter data Summarize an Excel table Insert subtotals into a range of data Outline buttons to show or hide details Create.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
Statistics 1: Introduction to Probability and Statistics Section 3-2.
Excel part 5 Working with Excel Tables, PivotTables, and PivotCharts.
A Guide to SQL, Eighth Edition Chapter Four Single-Table Queries.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
 The term “spreadsheet” covers a wide variety of elements useful for quantitative analysis of all kinds. Essentially, a spreadsheet is a simple tool.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Microsoft ® Excel ® 2013 Enhanced Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts.
Arrays Chapter 7.
Session 1 Retrieving Data From a Single Table
MSAA PRESENTS: AN EXCEL TUTORIAL
Python for Data Analysis
INTRODUCTION TO STATISTICS
Relational Database Design
Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts
Module 5 Working with Data
Computer Programming BCT 1113
Containers and Lists CIS 40 – Introduction to Programming in Python
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
Statistical Analysis with Excel
SUB-ICT STD-10 Working with Data in Microsoft Excel
Performing What-if Analysis
Analyzing Data with PivotTables
Introduction to Summary Statistics
Introduction to Summary Statistics
Statistical Analysis with Excel
ASPIRE Workshop 5: Analysis Supplementary Slides
regex (Regular Expressions) Examples Logic in Python (and pandas)
Introduction to Summary Statistics
Statistical Analysis with Excel
Introduction to Python
Introduction to Summary Statistics
Introduction to Summary Statistics
1.
Data Types and Data Structures
Python for Data Analysis
Pandas John R. Woodward.
Introduction to Summary Statistics
Access: Queries III Participation Project
Spreadsheets, Modelling & Databases
Dr. Sampath Jayarathna Cal Poly Pomona
Introduction to Summary Statistics
Dr. Sampath Jayarathna Cal Poly Pomona
EET 2259 Unit 9 Arrays Read Bishop, Sections 6.1 to 6.3.
Dr. Sampath Jayarathna Old Dominion University
Dr. Sampath Jayarathna Old Dominion University
Introduction to Summary Statistics
ESRM 250/CFR 520 Autumn 2009 Phil Hurvitz
Shelly Cashman: Microsoft Access 2016
PYTHON PANDAS FUNCTION APPLICATIONS
Lesson 13 Working with Tables
regex (Regular Expressions) Examples Logic in Python (and pandas)
Arrays.
DATAFRAME.
INTRODUCING PYTHON PANDAS:-SERIES
Presentation transcript:

By : Mrs Sangeeta M Chauhan , Gwalior Python Pandas   DataFrame By : Mrs Sangeeta M Chauhan , Gwalior https://pythonclassroomdiary.wordpress.com by Sangeeta M Chauhan, Gwalior

What is Data Frame? A Data frame is a 2D (two-dimensional) data structure, i.e., data is arranged in tabular form i.e. In the form of rows and columns. Or we can say that, Pandas DataFrame is similar to excel sheet Let’s understand it through an example

Parameter & Description Create DataFrame pandas DataFrame can be created using the following constructor pandas.DataFrame ( data[, index, columns, dtype, copy]) The parameters of the constructor are as follows Sr.No Parameter & Description 1 Data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. 2 Index For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. 3 Columns For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. 4 Dtype Data type of each column. 5 Copy This command is used for copying of data, if the default is False.

>>> import pandas as pd >>> df=pd.DataFrame() A pandas DataFrame can be created using various inputs like 1. Lists 2. dictionary 3. Series 4. Numpy ndarrays 5. Another DataFrame >>> import pandas as pd >>> df=pd.DataFrame() >>> df Empty DataFrame Columns: [] Index: [] Creating an Empty DataFrame

Create a DataFrame from Lists Example 1 (Simple List) >>> MyList=[10,20,30,40] >>> MyFrame=pd.DataFrame(MyList) >>> MyFrame 0 0 10 1 20 2 30 3 40 Create a DataFrame from Lists Example 1 (Simple List) >>> Friends = [['Shraddha','Doctor'],['Shanti','Teacher'],['Monica','Engineer']] >>> MyFrame=pd.DataFrame(Friends,columns=['Name','Occupation']) >>> MyFrame   Name Occupation 0 Shraddha Doctor 1 Shanti Teacher 2 Monica Engineer

Creation of a DataFrame from Dictionary of ndarrays / Lists All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length.

Example 1 (without index) >>> data = {'Name':['Shraddha', 'Shanti','Monica','Yogita'], \ ‘Age’:[28,34,29,39]} >>> df = pd.DataFrame(data) >>> df Name Age 0 Shraddha 28 1 Shanti 34 2 Monica 29 3 Yogita 39

Example 1 (With Index) Name Age Friend1 Shraddha 28 Friend2 Shanti 34 >>> data = {'Name':['Shraddha', 'Shanti', 'Monica', 'Yogita'],\ 'Age':[28,34,29,39]} >>> df = pd.DataFrame(data, \ index=['Friend1','Friend2','Relative1','Relative2']) >>> df Name Age Friend1 Shraddha 28 Friend2 Shanti 34 Relative1 Monica 29 Relative2 Yogita 39

Create a DataFrame from List of Dictionaries Here we are passing list of dictionary to create a DataFrame. The dictionary keys are by default taken as column names. Example 1: >>> Mydict= [{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10} , {'Won': 8, 'Loose': 9},{'Won':4}] >>> df = pd.DataFrame(Mydict) >>> df Loose Won 0 2.0 15 1 10.0 5 2 9.0 8 3 NaN 4 Notice that Missing Value is stored as NaN (Not a Number)

Example 2: Changing Index >>> Mydict=[{'Won': 15, 'Loose': 2},{'Won': 5, 'Loose': 10},{'Won': 8, 'Loose': 9}] >>> df = pd.DataFrame(Mydict, index= ['India', 'Pakistan', 'Australia' ]) >>> df Loose Won India 2 15 Pakistan 10 5 Autralia 9 8

Example 3 We can also create a DataFrame by specifying list of dictionaries, row indices, and column indices >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87}] Physics Chemistry Maths Student1 87.0 78.0 78 Student2 NaN 70 67 Student3 77.0 NaN 87 >>> df1 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['Physics', 'Chemistry','Maths'])

df2 is created with only 2 columns >>> df2 = pd.DataFrame (L_dict, index=['Student1', 'Student2','Student3'], columns=['Chemistry','Maths']) Chemistry Maths Student1 78.0 78 Student2 NaN 67 Student3 NaN 87 df2 is created with only 2 columns

Creating df3 by specifying 3 column name (New Column English) >>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2','Student3'], columns=['English','Chemistry','Maths']) Creating df3 by specifying 3 column name (New Column English) >>> df3 English Chemistry Maths Student1 NaN 78.0 78 Student2 NaN NaN 67 Student3 NaN NaN 87

Addition of New Column & Row Column Addition >>> L_dict = [{'Maths': 78, 'Chemistry': 78,'Physics':87},{'Maths': 67, 'Chemistry': 70},{'Physics':77,'Maths':87,'Chemistry':90}] >>> df3 = pd.DataFrame(L_dict, index=['Student1', 'Student2', 'Student3'], columns=['English','Chemistry','Maths']) >>> df3['Physics']=[45,56,65] English Chemistry Maths Physics Student1 NaN 78 78 45 Student2 NaN 70 67 56 Student3 NaN 90 87 65 New Column Physics added

We can Update column Data also by using same method >>> df3['English']=[78,98,89] English Chemistry Maths Physics Student1 78 78 78 45 Student2 98 70 67 56 Student3 89 90 87 65   Values of column English (NaN ) is replaced with new values

We can also add new column using Data ,stored in existing Frame df3['Total']=df3.English+df3.Chemistry+df3.Maths+df3.Physics Look ,a new Column Total has been added with total of marks in other subjects English Chemistry Maths Physics Total Student1 78 78 78 45 279 Student2 98 70 67 56 291 Student3 89 90 87 65 331

ASSIGNING AND COPYING OF DATAFRAME

Changes reflected in both dataframes Changes reflected in 2nd dataframeonly

SELECTION AND INDEXING Methods covered in this section are: Selecting data by row numbers (.iloc) Selecting data by label or by a conditional statment (.loc) Selecting data at particular row and column(.at) loc gets rows (or columns) with particular labels from the index. iloc gets rows (or columns) at particular positions in the index (so it only takes integers).

ACCESSING DIFFERENT ROWS/ COLUMNS

Here end index is excluded Use of iloc Here end index is excluded

Use of at with dataFrame Access a single value for a row/column label pair. <df>.at[rowname,colname] <df>.loc[rowname].at[colname] XI.at['stu2','Name'] XI.at['stu2','Physics'] XI.loc['stu2'].at['Chemistry']

Row Addition To add a row , by specifying row index >>> df3.loc['Student4']=[45,67,45] >>> df3 English Chemistry Maths Student1 78 78 78 Student2 98 70 67 Student3 89 90 87 Student4 45 67 45

Use of iloc()

Deletion of an Existing Column/Row from Data Frame

Row Deletion >>> df3.drop('Student3') English Chemistry Maths Physics Student1 78 78 78 45 Student2 98 70 67 56

Descriptive Statistics with Pandas Pandas also offer many useful Statistical and Aggregate Functions. Out of which we are going to discuss following functions

DataFrame.min (axis=None, skipna=None, numeric_only=None,) Parameters :  axis : Align object with threshold along the given axis. skipna : Exclude NA/null values when computing the result numeric_only : Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Use of min()

Calculating Mode (बहुलक)

Compare both Data & Result

Calculating median

Count () will count element for each column (by default)

Sum() by column

APPLYING FUNCTIONS ON SUBSET OF DATA FRAME <DF>[[<COL NAME1,COLNAME2….>].<fun_nm> Example : XI[‘Chemitry’].min()

APPLYING FUNCTIONS ON ROW OF DATA FRAME <DF>.loc[<row_index>,….].<function_name>

ADVANCED OPERATION ON DATAFRAME 1. PIVOTTING :In Pandas, the pivot table function takes simple data frame as input, and performs grouped operations that provides a multidimensional summary of the data.

pivot() function pandas.pivot(index, columns, values) Function produces pivot table based on 3 columns of the DataFrame. Uses unique values from index / columns and fills with values. Parameters: index : Labels to use to make new frame’s index columns : Labels to use to make new frame’s columns values : Values to use for populating new frame’s values

Csv file : resultAnalyiTr S.no Class Teacher Name Subject ResultPer 1 XII RK Saxena Physics 98 2 Neelam Sharma Chemistry 100 3 PP Singh Maths 99.5 4 Ravita CS

What if we have duplicate values in column and index????? Yes we have solution….. We can use pivot_table() instead of pivot() pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)

Lets consider another table (with duplicate values)

Lets consider modified csv file

pivotTableEx2.py

SORTING DataFrame.sort_values(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last’) Parameters: columns : on which data is sorted ascending : default True(Ascending) Specify list for multiple sort orders axis : {0 or ‘index’, 1 or ‘columns’}, default 0, Sort index/rows versus columns inplace : boolean, default False, Sort the DataFrame without creating a new instance kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}, optional This option is only applied when sorting on a single column or label. na_position : {‘first’, ‘last’} (optional, default=’last’) ‘first’ puts NaNs at the beginning ‘last’ puts NaNs at the end

Use of df.sort_values() DataFrame df Ascending Order Descending

Descending order with Nan values at first position

Sorting on Multiple Columns print("Sorting on Multiple(2) Columns") print(df.sort_values(by=['Class','Teacher Name'])) print("Sorting on Multiple(3) Columns") print(df.sort_values(['Class','Teacher Name','Year']))

Aggregate Functions Function Description count Number of non-null observations sum Sum of values mean Mean of values mad Mean absolute deviation median Arithmetic median of values min Minimum max Maximum mode Mode std Unbiased standard deviation

<Df>.<agg_fun_name>()

FUNCTION APPLICATION On the whole data frame – pipe() Function (UDF or Library) can be applied on a dataframe in multipleways On the whole data frame – pipe() Row/column wise - apply() On individual element – applymap()

Output of previous pipe Data Input Output of previous pipe Input to next pipe

Pipe(func_name,*args) d1={'Sal':[50000,60000,55000],'bonus':[3000,4000,5000]} df1=pd.DataFrame(d1) print(df1.pipe(np.add,2).pipe(np.power,2).pipe(np.divide,3)) print(df1.pipe(df1.add,2).pipe(df1.divide,3)) for : divide(power( add(df,2),2),3)

Pandas.DataFrame.info() function is used to get a concise summary of the dataframe. Pandas.DataFrame.describe() Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Grouping on Dataframe’s Column <DF>.groupby(by=None,axis=0) Any groupby operation involves one of the following operations on the original object. They are − Splitting the Object Applying a function Combining the results In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations − Aggregation − computing a summary statistic Transformation − perform some group-specific operation Filtration − discarding the data with some condition

import pandas as pd import numpy as np df=pd.read_csv("resultAnalysis.csv") print(df) print("Grouping on Class") grp=df.groupby(by='Teacher Name') print(grp)

print(grp.groups)

for name,group in grp: print (name) print (group)

print(gp.get_group('Namrata')) SELECT A GROUP Using the get_group() method, we can select a single group. print(gp.get_group('Namrata'))

Aggregation print (grp['ResultPer'].agg(np.mean))

print(gp['ResultPer'].agg([np.sum, np.mean, np.std])) Applying Multiple Aggregation Functions at Once print(gp['ResultPer'].agg([np.sum, np.mean, np.std]))

To see the size of each group is by applying the size() function print(gp.agg(np.size))

Grouping on multiple columns and aggregation gp=df.groupby(by=['Teacher Name', 'Class‘, ’Year’] ). agg(np.mean)

transform()) It transforms the aggregate data by repeating the summary result for each row of the group and makes the result have the same shape as original data

Output before/after transform

Reindexing and Altering Labels

Changing column labels Rename(): It simply renames the index and/or column labels in a dataframe. Changing column labels

Changing Column label and row index

<DF>.reindex(index=None,Columns=None, fill_value=nan Reindex(): It helps to specify new order of by reordering records to be displayed existing indexes and column labels. <DF>.reindex(index=None,Columns=None, fill_value=nan

Observe that it contains values of common column ‘Physics’ Reindex_like() : Used to create indexes/columns labels based on the other dataframe object Observe that it contains values of common column ‘Physics’

knowledge is of no value unless you put it into practice So, Keep Practicing THANKS