DATAFRAME.

DATAFRAME

Example of a DataFrame:-
A DataFrame is a pandas data structure which stores data in a two-dimensional way. A two-dimensional array is an array in which each element is itself an array. Example of a DataFrame:- COLUMNS INDEX /ROWS 1 2 10 20 30 40 50 60 70 80 90

CHARACTERISTICS OF DATAFRAME
It has two index/axes - row index & column index. Each value is identifiable with the combination of row index & column index. Indexes can be numbers & strings. Its columns can have data of different types. It is both value mutable & size mutable. (Mutable means changeable)

CREATING & DISPLAYING DATAFRAME
A DataFrame can be created with the help of following: 1) 2-D Dictionary(having items as (key:value) pair) 2) 2-D ndarray (numpy array) 3) Series type object 4) Another DataFrame

1) Creating DataFrame from a 2-D Dictionary:
Example:- import pandas as pd import numpy as np d1={ ‘ Student ’:[‘ Ajay ’,‘ Aman ’,‘ Anita ’] ,‘ Marks ’: [40,20,30] } dtf=pd.DataFrame (d1.index = [‘ One ’,‘ Two ’,‘ Three ’] ) print ( dtf ) Output: Student Marks One Two Three Note:- If we don’t specify the index, then by default they are started from zero onwards, and keys of the dictionary become columns. Ajay 40 Aman 20 Anita 30

2) Creating a DataFrame object from 2-D ndarray:
Example:- import pandas as pd import numpy as np a1=np.array( [ [ 10,20,30 ] , [ 40,50,60 ] ] ) dtf= pd.DataFrame (a1,column = [‘ One ’,‘ Two ’,‘ Three ’] ) print(dtf) Output:- dtf One Two Three 1 Note:- In case we don’t specify column, by default they are started from zero onwards. Also, be careful while writing DataFrame. Always remember D and F are capital in DataFrame. 10 20 30 40 50 60

3) Creating a DataFrame from Series type object:
Example (1) :- import pandas as pd import numpy as np salary= pd.Series( [ , , ] ) bonus= pd.Series ( [ 100 , 200 , 300 ] ) dtf= pd.DataFrame ( { 0:‘salary’ , 1:‘bonus’} ) print(dtf) Output: 1 2 Note:- Always remember the ‘S’ in Series is always capital. 10000 100 20000 200 30000 300

Example(2) :- import pandas as pd import numpy as np salary= pd
Example(2) :- import pandas as pd import numpy as np salary= pd.Series ( [ , , ] ) bonus= pd.Series ( [ 100 , 200 , 300 ] ) total salary= salary + bonus dtf= pd.DataFrame ( { 0:‘salary’, 1:‘bonus’, 2:‘total salary’} ) print(dtf) Output: 10000 100 10100 20000 200 20200 30000 300 30300

4) Creating DataFrame using another DataFrame:
We can pass an existing DataFrame object to DataFrame( ) and it will create another dataframe object having similar data. Here we are considering dataframe ‘dtf’ which we created using numpy array. Example:- import pandas as pd import numpy as np dtf1= pd.DataFrame(dtf) print(dtf1) Output: One Two Three 1 10 20 30 40 50 60

DATAFRAME ATTRIBUTES Attribute Description
Index The index (row labels) of the DataFrame. Columns The column labels of a DataFrame. Axes Return a list representing both axes of DataFrame. Dtypes Return data types of data in the DataFrame. Size Return number of elements of object. Shape Return number of rows & number of columns. Values Return values of the DataFrame. Empty Checks whether DataFrame is empty or not. Ndim Return number of dimensions. T Transpose index and columns. HasNans Checks for the presence of NaNs in dataframe. Len It will return number of rows in dataframe. Attribute Description Index The index (row labels) of the DataFrame. Columns The column labels of a DataFrame. Axes Return a list representing both axes of DataFrame. Dtypes Return data types of data in the DataFrame. Size Return number of elements of object. Shape Return number of rows & number of columns. Values Return values of the DataFrame. Empty Checks whether DataFrame is empty or not. Ndim Return number of dimensions. T Transpose index and columns. HasNans Checks for the presence of NaNs in dataframe. Len It will return number of rows in dataframe.

SELECTING / ACCESSING DATA
dtf Population AvgIncome Delhi Mumbai Kolkata * To access single row, write the name of the column as shown below print(dtf [‘Population’]) OR print( dtf.Population) Output:- Population 100 30.2 500 40 600 50 100 500 600

* To access multiple columns, use the following syntax:
print (dtf [‘Population’,‘AvgIncome’]) Output: Population AvgIncome Delhi Mumbai Kolkata *Accessing Subset using row / column names: (a) To access row, just give the row name. It is used as : dtf.loc [ <row-name> : <column-name>] Make sure not to miss colon after comma. Example(1) : print(dtf.loc [‘Delhi’, : ]) Output: Population AvgIncome 100 30.2 500 40 600 50 100 30.2

Example(2):- print(dtf.loc [ ‘Delhi’,‘Mumbai’, : Population])
(Row names) (Column name) Output: Population Delhi Mumbai Example(3):- print (dtf.iloc [0:2, 0:1]) Output: Population AvgIncome Note: loc is used to access data, whereas iloc is used to access data through slice. iloc means integer location. 100 500 100 30.2

* Accessing individual data / value:
To access individual data, you can use the syntax given below- 1) print( dtf.population [‘Delhi’]) Output: 2) print( dtf.population [1]) Output: 3) print( dtf.at [‘Delhi’,‘Population]) 4) print( iat [0 ,1 ]) Output: ** In examples (1) & (2), we have to write the name of column first, and then the row. In examples (3) & (4), we have to write the name of row first, and then column as we have used ‘at’ and ‘iat’ functions in here, which are similar to ‘loc’ & ‘iloc’.

ADDING & DELETING COLUMNS
(i) ADD (ii) DELETE (a) dtf [ : , Density ] = 1200 del dtf [ AvgIncome] (b) dtf.at[ : , Density ] = 1200 (c)dtf.loc[ : , Density] = 1200 Output (i): Population AvgIncome Density Delhi Mumbai Kolkta Output (ii) Population Density Kolkata 100 30.2 1200 500 40 600 50 100 1200 500 600

ADDING & DELETING ROWS Output(i) :- Population AvgIncome
Delhi Mumbai Kolkata Chennai Output(ii) : Population AvgIncome (i) Add (ii) Delete (a) dtf.at [Chennai, : ] = 1500 (a) dtf.drop (range(0,1)) (b) dtf.loc [Chennai, : ] = 1500 (b) dtf.drop(0) 100 30.2 500 40 600 50 1500 500 40 600 50 1500

BINARY FUNCTIONS IN A DATAFRAME
(i) Addition (Add) Example1: dtf A B C dtf A B C print(dtf1 + dtf2 ) OR print(dtf1.add (dtf2)) Output: A B C 1 2 Note: Add means addition and radd means reverse addition. In radd, df2 will be added in df1. 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 11 22 33 44 55 66 77 88 99

Example 2: dtf1 A B C dtf2 A B C print (dtf1+dtf2) OR print( dtf1.add(dtf2)) Output:- A B C 0 1 ** NaN means “Not a Number”. If there are unequal number of rows or columns in the given DataFrames then NaN is filled in the blank spaces. 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 11 22 33 44 55 66 NaN

Example: dtf1 A B C dtf2 A B C (ii) Subtraction (Sub) 0 0 1 1 2 2
print (dtf1-dtf2) OR print( dtf1.sub(dtf2)) Output: A B C 1 2 Note: sub means subtraction & rsub means reverse subtraction. In rsub, reverse subtraction will take place. In rsub, df1 will be subtracted from df2. 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 -9 -18 -27 -36 -45 -54 -63 -72 -81

(iii) Multiplication(Mul) Example:
dtf A B C dtf A B C print( dtf1 * dtf2) OR print ( dtf1.mul(dtf2)) Output: A B C 1 2 Note: Mul means multiplication and rmul means reverse multiplication. In rmul, dtf2 will be multiplied from dtf1. 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 10 40 90 160 250 360 490 640 810

print( dtf2 / dtf1) OR print(dtf2. div(dtf 1)) Output:- A B C
(iv) Division(Div) dtf A B C dtf A B C print( dtf2 / dtf1) OR print(dtf2. div(dtf 1)) Output: A B C 1 2 Note: Div means division and rdiv means reverse division. In rdiv, reverse division will take place. In rdiv, dtf1 will be divided from dtf2. 1 2 3 4 5 6 7 8 9 11 22 33 44 55 66 77 88 99 11

Funcions in dataframe (i) Info ( ) :- info ( ) gives information about dataframe objects like , values , dtypes , data columns , number of rows , memory usage , data type of each column . It is used as df.info ( ). (ii) Describe ( ) :- describe ( ) provides summary for numerical columns. If we give argument , include = ‘all’ to describe , it will provide summary of all the columns. It is used as df.describe ( ) or df.describe (include = ‘all’ ).

Cumulative calculation function
Cumsum ( ) calculates cumulative sum i.e., in the output of this function , the value of each row is replaced by sum of all prior rows including this row. It is used as df.cumsum ( ). Example: df A B C 1 2 print (df.cumsum( )) Output: A B C 1 2 3 4 5 6 7 8 9 1 2 3 5 7 9 12 15 18

Handling missing data (a) Dropna( ):- This will drop all the rows that have NaN values in them, even row with a single NaN in it. It is used as- dtf.dropna( ). If we give argument how=‘all’ ,it will drop only those rows that have all NaN values. (b) Fillna( ):- If we want to fill the missing data with some appropriate value, we can use fillna( ). It is used as, for example- dtf.fillna(50). With this, all the NaNs will be filled by 50. We can also use a dictionary with fillna( ) to specify fill values for each column separately. It is used as, for example- dtf.fillna({‘First_name’ : ‘x’, ‘last_name’ : ‘y’}). (c) Isnull( ):- It is used to detect missing values, it returns true or false for each value. It is used as- dtf.isnull( ).

Combining dataframes (i) Combine_first( ):- Combine_first combines the two dataframes in a way that uses the method of patching the data. It means if in a dataframe a certain cell has missing data and corresponding cell (the one with same index & column id) in other dataframe has some valid data then, this method will pick the data from second dataframe and patch it in the first dataframe. For example:- dtf Price Qty dtf Price Qty Rate print( df1.combine_first (df2)) Output: Price Qty Rate 1 1000 10 2000 Nan 1000 10 2 2000 20 4 1000 10 2 2000 20 4

(ii) Concat( ):- The concat( ) can concatenate two dataframes along the rows or along the columns. Row’s axis is 0, and column’s axis is 1. For example:- df Price Qty df Price Qty df3= pd.concat (df1,df2) print (df3) Output: Price Qty By default, axis is 0, i.e., by default dataframes are concatenated row-wise. 1 1000 10 2000 NaN 1000 10 2000 20 1000 10 2000 NaN 20

Output:- Price_x Qty_x Name Qty_y Price_y
(iii) Merge( ):- Merge( ) combines two dataframes such that the two rows with some common values are merged together. With the help of merge( ), we can specify the field on the basis of which we want to combine two dataframes. For example:- dtf Price Qty dtf Name Qty Price dtf3= pd.merge (dtf1, dtf2, on= ‘price’) print(dtf3) Output: Price_x Qty_x Name Qty_y Price_y 1 1000 10 2000 NaN xyz 10 1000 abc 20 2000 1000 10 xyz 2000 NaN abc 20

.…THE END….

DATAFRAME.

Similar presentations

Presentation on theme: "DATAFRAME."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DATAFRAME.

Similar presentations

Presentation on theme: "DATAFRAME."— Presentation transcript:

Similar presentations

About project

Feedback