Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports Sweetwater ISD presents:
Advertisements

Managing Grades with Excel Viewing Help To view Help 1.Open Excel on your computer. 2.In the top right hand corner of the Excel Screen type in the.
Spreadsheets With Microsoft Excel ® as an example.
CS1100: Computer Science and Its Applications Creating Graphs and Charts in Excel.
End Show Introduction to Electronic Spreadsheets Unit 3.
ADVANCED MICROSOFT POWERPOINT Lesson 6 – Creating Tables and Charts
Computer Literacy BASICS
STATISTICS Microsoft Excel “Frequency Distribution”
European Computer Driving Licence Syllabus version 5.0 Module 4 – Spreadsheets Chapter 22 – Functions Pass ECDL5 for Office 2007 Module 4 Spreadsheets.
CHAPTER 13 Creating a Workbook Part 2. Learning Objectives Work with cells and ranges Work with formulas and functions Preview and print a workbook 2.
AP/H SCIENCE SKILLS: EXCEL & SIG FIG Suggested summer work for incoming students.
P366: Lecture #1 Use of Excel for analysis Lei Chen, MD Jan 6, 2002.
Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015.
Examples of different formulas and their uses....
SPREADSHEET BASICS SPREADSHEET BASICS What are the benefits of using a spreadsheet to solve a problem?
Copyright © 2008 Pearson Prentice Hall. All rights reserved. 1 1 Copyright © 2008 Prentice-Hall. All rights reserved. What Can I Do with a Spreadsheet.
Key Words: Functional Skills. Key Words: Spreadsheets.
A Powerful Python Library for Data Analysis BY BADRI PRUDHVI BADRI PRUDHVI.
Unit 42 : Spreadsheet Modelling
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
McGraw-Hill Career Education© 2008 by the McGraw-Hill Companies, Inc. All Rights Reserved. Office Excel 2007 Lab 2 Charting Worksheet Data.
1 Lesson 13 Organizing and Enhancing Worksheets Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Lesson 6 Formatting Cells and Ranges. Objectives:  Insert and delete cells  Manually format cell contents  Copy cell formatting with the Format Painter.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Word Create a basic TOC. Course contents Overview: table of contents basics Lesson 1: About tables of contents Lesson 2: Format your table of contents.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Chapter 11 - JavaScript: Arrays
Python for Data Analysis
Microsoft Word Objectives: Word processing using Microsoft Word
GO! with Microsoft Office 2016
Formulas, Functions, and other Useful Features
Formulas and Functions
Office tool for creating tables and charts
CSE111 Introduction to Computer Applications
Computer Fundamentals
Module 5 Working with Data
Lesson 2 Tables and Charts
Statistical Analysis with Excel
Tutorial 4: Enhancing a Workbook with Charts and Graphs
SUB-ICT STD-10 Working with Data in Microsoft Excel
QM222 A1 Visualizing data using Excel graphs
External libraries A very complete list can be found at PyPi the Python Package Index: To install, use pip, which comes with.
Microsoft Office 2013 Coming to a PC near you!.
Microsoft Excel 2003 Illustrated Complete
ECONOMETRICS ii – spring 2018
Creating a Workbook Part 2
Statistical Analysis with Excel
Network Visualization
Introduction to MATLAB
Python Visualization Tools: Pandas, Seaborn, ggplot
Lab 2 Data Manipulation and Descriptive Stats in R
Statistical Analysis with Excel
Topics Introduction to File Input and Output
Chapter 9 Structuring System Requirements: Logic Modeling
1.
Numpy and Pandas Dr Andy Evans
Sorting "There's nothing in your head the sorting hat can't see. So try me on and I will tell you where you ought to be." -The Sorting Hat, Harry Potter.
Topic 7: Visualization Lesson 1 – Creating Charts in Excel
Scipy 'Ecosystem' containing a variety of scientific packages including iPython, numpy, matplotlib, and pandas. numpy is both a system for constructing.
Exploring Microsoft Office Access 2010
Simulation And Modeling
Chapter 9 Structuring System Requirements: Logic Modeling
EET 2259 Unit 9 Arrays Read Bishop, Sections 6.1 to 6.3.
Python for Data Analysis
Charts A chart is a graphic or visual representation of data
Topics Introduction to File Input and Output
Excel for Educators Class Notes
DATAFRAME.
Mapping packages Unfortunately none come with Anaconda (only geoprocessing is which does lat/long to Cartesian conversions). matplotlib.
Presentation transcript:

Pandas Based on: Series: 1D labelled single-type arrays DataFrames: 2D labelled multi-type arrays Generally in 2D arrays, one can have the first dimension as rows or columns, the computer doesn't care. Pandas data is labelled in the sense of having column names and row indices (which can be names). This forces the direction of data (rows are rows, and can't contain columns). This makes things easier. data.info() gives info of labels and datatypes. While numpy forms the solid foundation of many packages involved in data analysis, its multi-dimensional nature makes some aspects of analysis over-complicated. Because of this, many people use pandas, a library specifically set up to work on one and two dimensional data, and timeseries data especially. Pandas is very widely used in the scientific and data analytical communities. It is built on ndarrays. One specific thing which makes pandas attractive is that it forces the first dimension of any data to be considered as rows, and the second as columns, avoiding the general ambiguity with arrays. The second useful thing is that the array rows and columns can be given names (for rows, these are called "indices"). While this can be done in numpy, it isn't especially simple. Pandas makes it very intuitive. As well as the standard ndarray functions for finding out about arrays, the "info" function will tell you about pandas arrays.

Creating Series data = [1,2,3,numpy.nan,5,6] # nan == Not a Number unindexed = pandas.Series(data) indices = ['a', 'b', 'c', 'd', 'e'] indexed = pandas.Series(data, index=indices) data_dict = {'a' : 1, 'b' : 2, 'c' : 3} indexed = pandas.Series(data_dict) fives = pandas.Series(5, indices) # Fill with 5s. named = pandas.Series(data, name='mydata') named.rename('my_data') print(named.name) Pandas data series are 1D arrays. The main difference from creating a ndarray is that each row can have its own 'index' label, and the series can have an overall name. The above shows how to set up: an array without indices; an array with indices using a list of indices; an array using a dict; an array filled with a single value repeated; a named series. We'll see shortly how to set up a multi-series dataframe.

Datetime data Treated specially date_indices = pandas.date_range('20130101', periods=6, freq='D') Generates six days, '2013-01-01' to '2013-01-06'. Although dates have hyphens when printed, they should be used without for indexing etc. Frequency is days by default, but can be 'Y', 'M', 'W' or 'H', 'min', 'S', 'ms', 'um', 'N' and others: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases Dataseries having indices representing time and date data are treated specially in pandas, especially when part of dataframes. The simplest way to set up a data series as a index is above.

Print As with numpy. Also: df.head() First few lines df.tail(5) Last 5 lines df.index df.columns df.values df.sort_index(axis=1, ascending=False) df.sort_values(by='col1') Pandas prints as with numpy, but there are some addition attributes of dataseries and functions as above for printing subsets of the information in series.

Indexing Can also take equations: named[0] 1 named[:0] a 1 b 2 named["a"] 1 "a" in named == True named.get('f', numpy.nan) Returns default if index not found Can also take equations: a[a > a.median()] Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label. Note that you can also use "where"-like statements, for example in the above "give me all a, where there value in a is greater than the median of a".

Looping for value in series: etc. But generally unneeded: most operations take in series and process them elementwise. Series will act as iterators, but in general this isn't used, as most functions and operators will act on them elementwise and return a single or array value containing the appropriate answers.

DataFrames Dict of 1D ndarrays, lists, dicts, or Series 2-D numpy.ndarray Series data_dict = {'col1' : [1, 2, 3, 4], 'col2' : [10, 20, 30, 40]} Lists here could be ndarrays or Series. indices = ['a', 'b', 'c', 'd'] df = pandas.DataFrame(data_dict, index = indices) If no indices are passed in, numbered from zero. If data shorter than current columns, filled with numpy.nan. DataFrames are multiple dataseries with zero, one, or many indices and column/series names. The above shows how to set these up using a dict.

a = pandas.DataFrame.from_items( [('col1', [1, 2, 3]), col1 col2 0 1 4 1 2 5 3 6 [('A', [1, 2, 3]), ('B', [4, 5, 6])], orient='index', columns=['col1', 'col2', 'col3']) col1 col2 col3 A 1 2 3 B 4 5 6 An alternative when using lists is to use "from_items()", which gives more control on how list data is orientated.

I/O Can then be written to a text file. df = pandas.read_csv('data.csv') df.to_csv('data.csv') df = pandas.read_excel( 'data.xlsx', 'Sheet1', index_col=None, na_values=['NA']) df.to_excel('data.xlsx', sheet_name='Sheet1') df = pandas.read_json('data.json') json = df.to_json() Can then be written to a text file. Wide variety of other formats including HTML and the local CTRL-C copy clipboard: http://pandas.pydata.org/pandas-docs/stable/io.html It is usual to read a dataframe in. The above shows three ways to do this, but there are a large number of others. For excel, see: http://pandas.pydata.org/pandas-docs/stable/io.html#io-excel

Adding to DataFrames Remove concat() adds dataframes join() joins SQL style append() adds rows insert() inserts columns at a specific location Remove df.sub(df['col1'], axis=0) (though you might also see df - df['col1']) The above shows the various methods for adding and subtracting. For more information on this, see: https://pandas.pydata.org/pandas-docs/stable/10min.html

Broadcasting When working with a single series and a dataframe, the series will sometimes be assumed a row. Check. This doesn't happen with time series data. More details on broadcasting, see: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#data-alignment-and-arithmetic Again, there are a variety of rules for broadcasting. It's better, again, not to rely on these, but worth understanding them so you recognise when it is happening. In particular, different functions can treat dataseries as a set of rows, or a column, and this sometimes makes a difference.

Indexing By name: named.loc['row1',['col1','col2']] (you may also see at() used) Note that with slicing on labels, both start and end returned. Note also, not function. iloc[1,1] used for positions. Indexing is as with numpy, including being able to pull out specific values using arrays. However, you can also get hold of rows in a dataseries using their index label and column names. Above are the basics of indexing using both names and locations.

Indexing Operation Syntax Result Select column df[col] Series Select row by label df.loc[label] Series representing row Select row by integer location df.iloc[loc] Slice rows df[5:10] DataFrame Select rows by boolean 1D array df[bool_array] Pull out specific rows df.query('Col1 < 10') (See also isin() ) The above gives more detail, and is taken from: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#indexing-selection query and isin allow for filtering of data by values. For more info on the latter, see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html#pandas.Series.isin

Sparse data df1.dropna(how='any') Drop rows associated with numpy.nan df1.fillna(value=5) a = pd.isna(df1) Boolean mask array Pandas is especially well set up for sparce data, that is data where lots of the cells are not numbers (explicitly in pandas given the value numpy.nan). This includes various memory-efficient storage options, but also functions to trim down the datasets (see above) and the fact that functions will skip nans.

Stack df.stack() Combines the last two columns into one with a column of labels: A B 10 20 30 40 A 10 B 20 A 30 B 40 unstack() does the opposite. For more sophistication, see pivot tables: http://pandas.pydata.org/pandas-docs/stable/reshaping.html#pivot-tables There are also a wide range of ways of reshaping arrays. This include array_name.T, for getting the array transposed, but also a wide range of ways of producing pivot tables where rows and columns are manipulated against each other. Simple pivot-like tables can be produced with stack (above), but there are more sophisticated versions.

Looping for col in df.columns: series = df[col] for index in df.indices: print(series[index]) But, again, generally unneeded: most operations take in series and process them elementwise. If column names are sound variable names, they can be accessed like: df.col1 Again, looping is largely redundant because most functions and operations work elementwise.

Columns as results df['col3'] = df['col1'] + df['col2'] df['col2'] = df['col1'] < 10 # Boolean values df['col1'] = 10 df2 = df.assign(col3 = someFuntion + df['col1'], col4 = 10) Assign always returns a copy. Note names not strings. Cols inserted alphabetically, and you can't use cols created elsewhere in the statement. Columns can be created as the result of equations or functions using other columns or maths, either by including them in standard assignments, or using the assign() function which takes in either functions or equations. In all cases, these work elementwise. Note that assign assumes columns for the results will be given using variable names, so the column names have to be good variable names.

Functions Generally functions on two series result in the union of the labels. Generally both series and dataframes work well with numpy functions. Note that the documentation sometimes calls functions "operations". Operations generally exclude nan. df.mean() Per column df.mean(1) Per row Complete API of functions at: http://pandas.pydata.org/pandas-docs/stable/api.html Note that most functions work on columns by default, but can be forced to work per-row.

Useful operations df.describe() Quick stats summary df.apply(function) Applies function,lambda etc. to data in cols df.value_counts() Histogram df.T or pandas.transpose(df) Transpose rows for columns. Here are a few useful operations.

Categorical data Categorical data can be: renamed sorted grouped by (histogram-like) merged and unioned http://pandas.pydata.org/pandas-docs/stable/categorical.html This is a brief introduction to pandas, but it should be said that there is a lot more sophistication in the data handling, especially if you explicitly define data as either categorical or timeseries. The above outlines some things you can do with categorical data.

Time series data Time series data can be: resampled at lesser frequencies converted between time zones and representations sub-sampled and used in maths and indexing analysed for holidays and business days shifted/lagged http://pandas.pydata.org/pandas-docs/stable/timeseries.html While the above does the same for timeseries.

Plotting matlibplot wrapped in pandas: Complex plots may need matplotlib.pyplot.figure() calling first. series.plot() Line plot a series dataframe.plot() Line plot a set of series dataframe.plot(x='col1', y='col2') Plot data against another bar or barh bar plots hist histogram box boxplot kde or density density plots area area plots scatter scatter plots hexbin hexagonal bin plots pie pie graphs http://pandas.pydata.org/pandas-docs/stable/visualization.html Finally, it is worth noting that pandas integrates matplotlib. There are some basic wrappers for easy data plotting, though you may find you need to import pyplot from matplotlib for more complicated layouts and call matplotlib.pyplot.figure().

Packages built on pandas Statsmodels Statistics and econometrics sklearn-pandas Scikit-learn (machine learning) with pandas Bokeh Big data visualisation seaborn Data visualisation yhat/ggplot Grammar of Graphics visualisation Plotly Web visualisation using D3.js IPython/Spyder Both allow integration of pandas dataframes GeoPandas Pandas for mapping http://pandas.pydata.org/pandas-docs/stable/ecosystem.html As noted, pandas is widely used, and has its own ecosystem of packages that are built on it. Above are some of the main packages. Perhaps the most notable for geographers is geopandas, which we'll look at shortly.