Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015.

Slides:



Advertisements
Similar presentations
MICS Data Processing Workshop
Advertisements

The essentials managers need to know about Excel
Part 2.  Enter formulas  Select Cells  Format Cell Contents  Insert Borders  Standard Error Values & How to Correct  Format Numbers.
UNLOCKING THE SECRETS HIDDEN IN YOUR DATA Part 3 Data Analysis.
Muhammad Qasim Rafique MS. EXCEL 2007.
1 CA202 Spreadsheet Application Combining Data from Multiple Sources Lecture # 6.
Microsoft Excel Computers Week 4.
Microsoft Excel The Basics. spreadsheet A type of application program which manipulates numerical and string data in rows and columns of cells. The value.
Understanding Microsoft Excel
Microsoft Excel 2013 An Overview. Environment Quick Access Toolbar Customizable toolbar for one-click shortcuts Tabs Backstage View Tools located outside.
Managing Grades with Excel Viewing Help To view Help 1.Open Excel on your computer. 2.In the top right hand corner of the Excel Screen type in the.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Chapter 1 Introduction to Spreadsheet. Agenda Download the practice files Spreadsheet application Workbook and worksheet Toolbar Cell Formatting Printing.
Google Confidential and Proprietary 1 Intro to Docs Google Apps Apps.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 15 Working with Tables 1 Morrison / Wells / Ruffolo.
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
Excel 2007 Part (2) Dr. Susan Al Naqshbandi
Mr Shum Spreadsheets eBooklet. Key Words Key Word CellAn individual box on a spreadsheet RowCells going across in an horizontal line. All rows have a.
Introduction to Python Lecture 1. CS 484 – Artificial Intelligence2 Big Picture Language Features Python is interpreted Not compiled Object-oriented language.
Coding for Excel Analysis Optional Exercise Map Your Hazards! Module, Unit 2 Map Your Hazards! Combining Natural Hazards with Societal Issues.
Microsoft Word 2000: Mail Merge Basics Peggy Serfazo Marple Molly Calvello Support Professionals Business Applications - Desktop Microsoft Corporation.
Google Confidential and Proprietary 1 Advanced Docs Google Apps.
Spreadsheet A spreadsheet is the computer equivalent of a paper ledger sheet. It consists of a grid made from columns and rows. It is an environment that.
Handling Lists F. Duveau 16/12/11 Chapter 9.2. Objectives of the session: Tools: Everything will be done with the Python interpreter in the Terminal Learning.
First Program  Open a file  In Shell  Type into the file: 3  You did it!!! You wrote your first instruction, or code, in python!
Lesson 5 Using FunctionUsing Function. Objectives.
1 Lesson 18 Organizing and Enhancing Worksheets Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.
Chapter 17 Creating a Database.
Getting Started With Stata Session 1 Jim Anthony John Troost Department of Epidemiology Michigan State University.
CCS – Mail Merge Mail Merge This presentation is incomplete without the associated discussion 1 Coloma Community Schools In-service 21 March 2014.
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
Key Applications Module Lesson 17 — Organizing Worksheets Computer Literacy BASICS.
Lesson 1 – Microsoft Excel * The goal of this lesson is for students to successfully explore and describe the Excel window and to create a new worksheet.
1 Lesson 13 Organizing and Enhancing Worksheets Computer Literacy BASICS: A Comprehensive Guide to IC 3, 3 rd Edition Morrison / Wells.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
CTC Guidelines CTC : Casuarina Transcriptome Compendium.
Basic Session. Course contents Overview: A hands-on introduction Section 1: What’s changed, and why Section 2: Get to work in Excel Section 3: Edit data.
Mathematical Formulas and Excel
1 EPIB 698C Lecture 1 Instructor: Raul Cruz-Cano
Functions Commands Programming Steps Examples Printing Homework Hints.
Exporting & Formatting Budgets from FlexGen, NextGen & Zortec into Excel.
Today… Modularity, or Writing Functions. Winter 2016CISC101 - Prof. McLeod1.
VERIFYING SPECIAL ED DATA TAMMY SOLTIS IU 5 DATA SUPERVISOR.
For Datatel and other applications Presented by Cheryl Sullivan.
Computer/LMS Access To log onto one of these computers: Enter your Username, for example: 2014BNS099 followed So a complete login.
Python Fundamentals: Complex Data Structures Eric Shook Department of Geography Kent State University.
Formulas, Functions, and other Useful Features
Finalizing a Worksheet
Key Applications Module Lesson 17 — Organizing Worksheets
Logan-Hocking Schools
Performing Mail Merges
European Computer Driving Licence
Functions CIS 40 – Introduction to Programming in Python
Microsoft Excel 101.
Unit 4: Using Spreadsheets to Make Economic Choices Lessons 20–26
Exporting & Formatting Budgets from NextGen o Excel
Python I/O.
Microsoft Excel 101.
Navya Thum January 30, 2013 Day 5: MICROSOFT EXCEL Navya Thum January 30, 2013.
ICT Spreadsheets Lesson 1: Introduction to Spreadsheets
Lesson 15 Working with Tables
Lesson 19 Organizing and Enhancing Worksheets
Spreadsheets and Data Management
PYTHON PANDAS FUNCTION APPLICATIONS
CSE 231 Lab 15.
Assignment resource Working with Excel Tables, PivotTables, and Pivot Charts Fairhurst pp The commands on these slides work with the Week 2 Excel.
DATAFRAME.
Presentation transcript:

Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015

Getting Started Excel is easy to use, but scientists need more powerful tools GraphPad can only handle so many rows before crashing Today we'll learn how to  Quickly get stats on all of your samples  Merge data from multiple rows (i.e. transcripts to a gene)  Filter data by various criteria  Merge data from multiple sheets (i.e. UCSC annotation to HUGO) All of this comes from a module called “pandas”, which is included in Anaconda, so it should already be installed on your machine import pandas as pd import numpy as np

Pandas Objects Like lists, dictionaries, etc., Pandas has two objects:  Series: like a column in a spreadsheet  DataFrame: like a spreadsheet – a dictionary of Series objects Let's make some columns and spreadsheets s = pd.Series([1,3,5,np.nan,6,8]) # What is np.nan? data = [['ABC', -3.5, 0.01], ['ABC', -2.3, 0.12], ['DEF', 1.8, 0.03], ['DEF', 3.7, 0.01], ['GHI', 0.04, 0.43], ['GHI', -0.1, 0.67]] df = pd.DataFrame(data, columns=['gene', 'log2FC', 'pval']) Now type s and df into your terminal and see what it outputs

Pandas Objects

Viewing Data Try the following: df.head() df.tail() df.tail(2) df['log2FC'] df.columns df.index df.values You should see, in order: the first 5 lines, the last 5 lines, the last 2 lines, only the column 'log2FC', the columns, the indices, and the data Unlike other Python data objects, if you print a Pandas object to the terminal, it won't flood your screen because it was designed to be readable What you'll find in the following sections is that Pandas objects have a logic that is quite different from regular Python For example, operations happen on entire columns and rows The new Pandas rules exist to make your life easier, but it means you have to hold two sets of rules in your head

Basic Operations We'll go back to our df in a moment, but first, create this spreadsheet: nums = [[1, 2], [4, 5], [7, 8], [10, 11]] numdf = pd.DataFrame(nums, columns=['c1', 'c2']) Add a column: numdf['c3'] = [3, 6, 9, 12] Multiply all elements of a column (give just the name of the column): numdf['c1'] = numdf['c1']*2 Divide all elements of multiple columns (give the DF a list of columns): numdf[['c2', 'c3']] = numdf[['c2', 'c3']]/2

Basic Metrics Your DataFrame should look like this: Now try the following: numdf.describe() save_stats = numdf.describe() What if you want to calculate those yourself? numdf.max(axis=0) # across all rows: the default numdf.max(axis=1) # across all columns Now try the above for numdf.min(), numdf.mean(), numdf.std(), numdf.median(), and numdf.sum() Use what we learned to normalize all columns: normdf = (numdf - numdf.mean())/numdf.std()

Indexing and Iterating Remember indexing? How does it work with DFs? numdf.ix[1, 'c2'] numdf.ix[1, ['c1', 'c2']] numdf.ix[1] numdf.ix['c2'] # error Exercise: get me 14 from numdf Exercise: get me the column c2 for real. Hint: on another slide numdf.ix[1, 'c2'] = 5.0 numdf['c2'][1] = 5.0 # How are they different?

Filtering Data Let's go back to our original DF, df We only want to see the p-values that passed df['pval'] < 0.05 # this is a boolean Series df[df['pval'] < 0.05] # this is called boolean indexing df['gene'].isin(['ABC', 'GHI']) df[df['gene'].isin(['ABC', 'GHI'])] df[(df['pval'] < 0.05) & (df['gene'].isin(['ABC', 'GHI']))] How do you save these to variables? Boolean indexing can also do assignments df.ix[df['pval'] > 0.05, 'log2FC'] = np.nan

Concat and Merge If two DFs share the same columns, they can be concatenated: pd.concat([numdf, numdf]) More interestingly, two DFs can be joined by column values: annodf = pd.DataFrame([['DEF', 'Leppard'], ['GHI', 'Ghost Hunters International'], ['ABC', 'Always Be Closing']], columns=['acronym', 'association']) resdf = df.merge(annodf, left_on='gene', right_on='acronym') Write resdf in your console. You've just annotated your dataset!

Sort and Groupby Let's sort by p-value: df = df.sort(columns='pval') # apparently deprecated, but the documentation on sort_values isn't up yet Exercise: sort it by gene name In our data example, we have multiple rows with the same name. How do we easily combine their values? smdf = df.groupby('gene').mean()

Input and Output How do you get data into and out of Pandas as spreadsheets? Pandas can now work with XLS or XLSX files (they didn't use to)  A tab looks like this: '\t', but on your file it looks like a big space  Can also be comma-delimited, but bioinformatics people always like to use tabs because there are sometimes commas in our data  Check which delimiter your file is using before import! Import to Pandas: df = pd.read_csv('data.csv', sep='\t', header=0) # or header=None if there is no header For Excel files, it's the same thing but with read_excel Export to text file: df.to_csv('data.csv', sep='\t', header=True, index=False) # the values of header and index depend on if you want to print the column and/or row names

Assignment Time! You should have two data files, data.csv and anno.csv I want you to:  Import data.csv and anno.csv  Annotate the data with gene names and loci  Assign all log fold changes with unpassed q-values to NaN  Obtain the mean fold change for each unique gene  Make one DataFrame with positive log fold change and q-value < 0.05, sort it by gene name, and save it to file Each step should only be one or two lines, and the trick is looking up and executing the correct commands