MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:

Slides:



Advertisements
Similar presentations
Programming Paradigms and languages
Advertisements

CS 100: Roadmap to Computing Fall 2014 Lecture 0.
Chapter 2: Using Objects Part 1. To learn about variables To understand the concepts of classes and objects To be able to call methods To learn about.
Fundamentals of Python: From First Programs Through Data Structures Chapter 2 Software Development, Data Types, and Expressions.
 2005 Pearson Education, Inc. All rights reserved Introduction.
©The McGraw-Hill Companies, Inc. Permission required for reproduction or display. slide 1 CS 125 Introduction to Computers and Object- Oriented Programming.
Python November 14, Unit 7. Python Hello world, in class.
 2002 Prentice Hall. All rights reserved. 1 Chapter 2 – Introduction to Python Programming Outline 2.1 Introduction 2.2 First Program in Python: Printing.
Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.
Chapter 2: Algorithm Discovery and Design
Computer Science 101 Introduction to Programming.
COMPUTER SCIENCE I C++ INTRODUCTION
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
2/4/2003CSCI 150: Introduction to Computer Science1 Introduction to Computer Science CSCI 150 Section 002 Session 6 Dr. Richard J. Bonneau IONA Technologies.
Introduction to Python
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
Fundamentals of Python: First Programs
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Introduction to Programming Peggy Batchelor.
Computer Science 101 Introduction to Programming.
1 CSC 221: Introduction to Programming Fall 2012 Functions & Modules  standard modules: math, random  Python documentation, help  user-defined functions,
An Introduction to Programming and Algorithms. Course Objectives A basic understanding of engineering problem solving process. A basic understanding of.
Lecture 2 Object Oriented Programming Basics of Java Language MBY.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley STARTING OUT WITH Python Python First Edition by Tony Gaddis Chapter 2 Input,
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Input, Output, and Processing
Python From the book “Think Python”
Computer Science 101 Introduction to Programming.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Object-Oriented Program Development Using Java: A Class-Centered Approach, Enhanced Edition.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
Variables, Expressions and Statements
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Introduction to Python Dr. José M. Reyes Álamo. 2 Three Rules of Programming Rule 1: Think before you program Rule 2: A program is a human-readable set.
The Hashemite University Computer Engineering Department
BING 6004: Intro to Computational BioEngineering Spring 2016 Lecture 1: Using Python Expressions and Variables Bienvenido Vélez UPR Mayaguez Reference:
Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning with Python 1 Introduction to Python programming for Bioinformatics.
Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Using Molecular Biology to Teach Computer Science High-level Programming with Python Finding Patterns.
High-level Programming with Python Expressions and Variables Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
1 A Short Introduction to Analyzing Biological Data Using Relational Databases Part II: Creating a Relational Database to Model Biological Data Alex Ropelewski.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Fundamentals of Programming I Overview of Programming
Topics Designing a Program Input, Processing, and Output
Introduction to Python
Introduction to Python programming for biologists
Revision Lecture
The Selection Structure
Variables, Expressions, and IO
Functions CIS 40 – Introduction to Programming in Python
CS 100: Roadmap to Computing
Topics Designing a Program Input, Processing, and Output
Topics Designing a Program Input, Processing, and Output
General Computer Science for Engineers CISC 106 Lecture 03
CS 100: Roadmap to Computing
Presentation transcript:

MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning with Python Essential Computing for Bioinformatics First Steps in Computing: Course Overview 1

The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo González Méndez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Vélez, University of Puerto Rico at Mayagüez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. Most recent versions of these presentations can be found at

 Course Overview  Introduction to Programming (Today)  Why learn to Program?  The Python Interpreter  Software Development Process  Numbers, Strings, Operators, Expressions  Control structures, decisions, iteration and recursion Outline 3

Course Overview Essential Computing for Bioinformatics – Course Description – Educational Objectives – Major Course Modules – Module Descriptions 4 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Essential Computing for Bioinformatics Course Description This course provides a broad introductory discussion of essential computer science concepts that have wide applicability in the natural sciences. Particular emphasis will be placed on applications to Bioinformatics. The concepts will be motivated by practical problems arising from the use of bioinformatics research tools such as genetic sequence databases. Concepts will be discussed in a weekly lecture and will be practiced via simple programming exercises using Python, an easy to learn and widely available scripting language. 5 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Educational Objectives Awareness of the mathematical models of computation and their fundamental limits Basic understanding of the inner workings of a computer system Ability to extract useful information from various bioinformatics data sources Ability to design computer programs in a modern high level language to analyze bioinformatics data. Experience with commonly used software development environments and operating systems Experience applying computer programming to solve bioinformatics problems 6 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Major Course Modules ModuleLecture MARC Lecture First Steps in Computing: Course Overview1 Using Bioinformatics Data Sources2 Mathematical Computing Models35 High-level Programming (Python): Flow Control62 High-level Programming (Python): Container Objects73 High-level Programming (Python): Files84 High-level Programming (Python): BioPython9 7 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Main Advantages of Python Familiar to C/C++/C#/Java Programmers Very High Level Interpreted and Multi-platform Dynamic Object-Oriented Modular Strong string manipulation Lots of libraries available Runs everywhere Free and Open Source Track record in bioInformatics (BioPython) 8 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

Using BioInformatics Data Sources Searching Nucleotide Sequence Databases Searching Amino Acid Sequence Database Performing BLAST Searches Using Specialized Data Sources IDEA: How can we expedite data collection and analysis?... writing programs to automate parts of the process. 9 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center Reference: Bioinformatics for Dummies (Ch 1-4) Goal:Basic Experience

Mathematical Computing Goal:General Awareness What is Computing? Mathematical Models of Computing – Finite Automata – Turing Machines The Limits of Computation Church/Turing Thesis What is an Algorithm? Big O Notation 10 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

High-Level Programming (Python) Downloading and Installing the Interpreter Values, Expressions and Naming Designing your own Functional Building Blocks Controlling the Flow of your Program String Manipulation (Sequence Processing) Container Data Structures File Manipulation 11 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center Goal:Knowledge and Experience

CS Fundamentals will be Interleaved Throughout the Course Information Representation and Encoding Computer Architecture Programming Language Translation Methods The Software Development Cycle Fundamental Principles of Software Engineering Basic Data Structures for Bioinformatics Design and Analysis of Bioinformatics Algorithms 12 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center

US Department of Labor, Bureau of Labor Statistics Engineers, Life and Physical Scientists and Related Occupations. Occupational Outlook Handbook, Edition. Biological scientists “…usually study allied disciplines such as mathematics, physics, engineering and computer science. Computer courses are beneficial for modeling and simulating biological processes, operating some laboratory equipment and performing research in the emerging field of bioinformatics” Why Learn to Program? 13

Why Learn to Program? 14  Need to compare output from a new run with an old run. (new hits in database search)  Need to compare results of runs using different parameters. (Pam120 vs Blosum62)  Need to compare results of different programs (Fasta, Blast, Smith- Waterman)  Need to modify existing scripts to work with new/updated programs and web sites.  Need to use an existing program's output as input to a different program, not designed for that program:  Database search -> Multiple Alignment  Multiple Alignment -> Pattern search  Need to Organize your data

Bioinformatics Assembly Analyst Responsibilities:  Assembling genome sequence data using a variety of tools and parameters and performing the experiments needed to evaluate sequencing strategies  Using existing software and databases to analyze genomic data and correlating assemblies and sequences with a variety of genetic and physical maps and other biological information  Identifying problems and serving as point of contact for various groups to propose and implement solutions  Proposing and implementing upgrades to existing tools and processes to enhance analysis techniques and quality of results  Developing and implementing scripts to manipulate, format, parse, analyze, and display genome sequence data; and developing new strategies for analysis and presentation of results. Requirements:  A bachelor's degree in biology or related field  At least three years of experience in DNA sequencing and sequence analysis.  Must possess solid knowledge of sequencing software and public sequencing databases.  Knowledge of bioinformatics tools helpful. Why Learn to Program? 15

 C/C++  Language of choice for most large development projects  FORTRAN  Excellent language for math, not used much anymore  Java  Popular modern object oriented language  PERL  Excellent language for text-processing (bioperl.org)  PHP  Popular language used to program web interfaces  Python  Language easy to pick up and learn (biopython.org)  SQL  Language used to communicate with a relational database Good Languages to Learn In no particular order…. 16

 “Object Oriented” is simply a convenient way to organize your data and the functions that operate on that data  A biological example of organizing data:  Human.CytochromeC.protein.sequence  Human.CytochromeC.RNA.sequence  Human.CytochromeC.DNA.sequence  Some things only make sense in the context that they are used:  Human,CytochromeC.DNA.intron  Human.CytochromeC.DNA.exon  Human.CytochromeC.DNA.sequence  Human.CytochromeC.protein.sequence  Human.CytochromeC.protein.intron  Human.CytochromeC.protein.exon Python is Object Oriented 17 Meaningful Meaningless

 Go to  Go to DOWNLOAD section  Click on Python Windows installerPython Windows installer  Save ~10MB file into your hard drive  Double click on file to install  Follow instructions  Start -> All Programs -> Python 2.6 -> Idle Downloading and Installing Python 18

Idle: The Python Shell 19

Python as a Number Cruncher 20 >>> print >>> print 6 * 7 42 >>> print 6 * >>> print * 7 44 >>> print >>> print 6 - ( 2 - 3) 7 >>> print 1 / 3 0 >>> / and * higher precedence than + and - integer division truncates fractional part Operators are left associative Parenthesis can override precedence

Floating Point Expressions 21 >>> print 1.0 / >>> print >>> print 3.3 * >>> print 3.3e23 * 2 6.6e+023 >>> print float(1) / >>> Mixed operations converted to float Scientific notation allowed 12 decimal digits default precision Explicit conversion

String Expressions 22 >>> print "aaa" aaa >>> print "aaa" + "ccc" aaaccc >>> len("aaa") 3 >>> len ("aaa" + "ccc") 6 >>> print "aaa" * 4 aaaaaaaaaaaa >>> "aaa" 'aaa' >>> "c" in "atc" True >>> "g" in "atc" False >>> + concatenates string len is a function that returns the length of its argument string any expression can be an argument * replicates strings a value is an expression that yields itself in operator finds a string inside another And returns a boolean result

MEANINGFUL Values Can Have (MEANINGFUL) Names 23 >>> numAminoAcids = 20 >>> eValue = 6.022e23 >>> prompt = "Enter a sequence ->" >>> print numAminoAcids 20 >>> print eValue 6.022e+023 >>> print prompt Enter a sequence -> >>> print "prompt" prompt >>> >>> prompt = 5 >>> print prompt 5 >>> = binds a name to a value prints the value bound to a name = can change the value associated with a name even to a different type use Camel case for compound names

Values Have Types 24 >>> type("hello") >>> type(3) >>> type(3.0) >>> type(eValue) >>> type (prompt) >>> type(numAminoAcids) >>> type is another function the type of a name is the type of the value bound to it the “type” is itself a value

In Bioinformatics Words … 25 >>> codon=“atg” >>> codon * 3 ’atgatgatg’ >>> seq1 =“agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaaga” >>> seq2 = “cggggagtggggagttgagtcgcaagatgagcgagcggatgtccactatgagcgataata” >>> seq = seq1 + seq2 >>> seq 'agcgccttgaattcggcaccaggcaaatctcaaggagaagttccggggagaaggtgaagacggggagtggggagttgagtc gcaagatgagcgagcggatgtccactatgagcgataata‘ >>> seq[1] 'g' >>> seq[0] 'a' >>> “a” in seq True >>> len(seq1) 60 >>> len(seq) 120 First nucleotide starts at 0

More Bioinformatics Extracting Information from Sequences 26 >>> seq[0] + seq[1] + seq[2] ’agc’ >>> seq[0:3] ’agc’ >>> seq[3:6] ’gcc’ >>> seq.count(’a’) 35 >>> seq.count(’c’) 21 >>> seq.count(’g’) 44 >>> seq.count(’t’) 12 >>> long = len(seq) >>> pctA = seq.count(’a’) >>> float(pctA) / long * Find the first codon from the sequence get ’slices’ from strings: How many of each base does this sequence contain? Count the percentage of each base on the sequence.

Additional Note About Python Strings 27 >>> seq=“ACGT” >>> print seq ACGT >>> seq=“TATATA” >>> print seq TATATA >>> seq[0] = seq[1] Traceback (most recent call last): File " ", line 1, in seq[0]=seq[1] TypeError: 'str' object does not support item assignment seq = seq[1] + seq[1:] Can replace one whole string with another whole string Can NOT simply replace a sequence character with another sequence character, but… Can replace a whole string using substrings

 How?  Precede comment with # sign  Interpreter ignores rest of the line  Why?  Make code more readable by others AND yourself?  When?  When code by itself is not evident # compute the percentage of the hour that has elapsed percentage = (minute * 100) / 60  Need to say something but Python cannot express it, such as documenting code changes percentage = (minute * 100) / 60 # FIX: handle float division Commenting Your Code! 28 Please do not over do it X = 5 # Assign 5 to x

 Problem Identification  What is the problem that we are solving  Algorithm Development  How can we solve the problem in a step-by-step manner?  Coding  Place algorithm into a computer language  Testing/Debugging  Make sure the code works on data that you already know the answer to  Run Program  Use program with data that you do not already know the answer to. Software Development Cycle 29

 First, lets learn to SAVE our programs in a file:  From Python Shell: File -> New Window  From New Window: File->Save  Then, To run the program in the new window:  From New Window: Run->Run Module Lets Try It With Some Examples! 30

 What is the percentage composition of a nucleic acid sequence  DNA sequences have four residues, A, C, G, and T  Percentage composition means the percentage of the residues that make up of the sequence Problem Identification 31

 Print the sequence  Count characters to determine how many A, C, G and T’s make up the sequence  Divide the individual counts by the length of the sequence and take this result and multiply it by 100 to get the percentage  Print the results Algorithm Development 32

Coding 33 seq="ACTGTCGTAT" print seq Acount= seq.count('A') Ccount= seq.count('C') Gcount= seq.count('G') Tcount= seq.count('T') Total = len(seq) APct = int((Acount/Total) * 100) print 'A percent = %d ' % APct CPct = int((Ccount/Total) * 100) print 'C percent = %d ' % CPct GPct = int((Gcount/Total) * 100) print 'G percent = %d ' % GPct TPct = int((Tcount/Total) * 100) print 'T percent = %d ' % TPct

 First SAVE the program:  From New Window: File->Save Let’s Test The Program 34

 Six Common Python Coding Errors:  Delimiter mismatch: check for matches and proper use.  Single and double quotes: ‘ ’ “ ”  Parenthesis and brackets: { } [ ] ( )  Spelling errors:  Among keywords  Among variables  Among function names  Improper indentation  Import statement missing  Function calling parameters are mismatched  Math errors:  Automatic type conversion: Integer vs floating point  Incorrect order of operations – always use parenthesis. Testing / Debugging 35

seq='ACTGTCGTAT" print seq; Acount= seq.count('A') Ccount= seq.count('C') Gcount= seq.count('G') Tcount= seq,count('T') Total = Len(seq) APct = int((Acount/Total) * 100) print 'A percent = %d ' % APct CPct = int((Ccount/Total) * 100) print 'C percent = %d ' % Cpct GPct = int(Gcount/Total) * 100) primt 'G percent = %d ' % GPct TPct = int((Tcount/Total) * 100) print 'T percent = %d ' % TPct Testing / Debugging 36

 First, re-SAVE the program:  File->Save  Then RUN the program:  Run->Run Module  Then LOOK at the Python Shell Window:  If successful, the results are displayed  If unsuccessful, error messages will be displayed Let’s Test The Program 37

 The program says that the composition is:  0%A, 0%C, 0%G, 0%T  The real answer should be:  20%A, 20%C, 20%G, 40%T  The problem is in the coding step:  Integer math is causing undesired rounding! Testing/Debugging 38

seq="ACTGTCGTAT" print seq Acount= seq.count('A') Ccount= seq.count('C') Gcount= seq.count('G') Tcount= seq.count('T') Total = float(len(seq)) APct = int((Acount/Total) * 100) print 'A percent = %d ' % APct CPct = int((Ccount/Total) * 100) print 'C percent = %d ' % CPct GPct = int((Gcount/Total) * 100) print 'G percent = %d ' % GPct TPct = int((Tcount/Total) * 100) print 'T percent = %d ' % TPct Testing/Debugging 39

 If the first line was changed to:  seq = “ACUGCUGUAU”  Would we get the desired result? Let’s change the nucleic acid sequence from DNA to RNA… 40

 The program says that the composition is:  20%A, 20%C, 20%G, 0%T  The real answer should be:  20%A, 20%C, 20%G, 40%U  The problem is that we have not defined the problem correctly!  We designed our code assuming input would be DNA sequences  We fed the program RNA sequences Testing/Debugging 41

 What is the percentage composition of a nucleic acid sequence  DNA sequences have four residues, A, C, G, and T  In RNA sequences “U” is used in place of “T”  Percentage composition means the percentage of the residues that make up of the sequence Problem Identification 42

 Print the sequence  Count characters to determine how many A, C, G, T and U’s make up the sequence  Divide the individual A,C,G counts and the sum of T’s and U’s by the length of the sequence and take this result and multiply it by 100 to get the percentage  Print the results Algorithm Development 43

Testing/Debugging 44 seq="ACUGUCGUAU" print seq Acount= seq.count('A') Ccount= seq.count('C') Gcount= seq.count('G') TUcount= seq.count('T') + seq.count(‘U') Total = float(len(seq)) APct = int((Acount/Total) * 100) print 'A percent = %d ' % APct CPct = int((Ccount/Total) * 100) print 'C percent = %d ' % CPct GPct = int((Gcount/Total) * 100) print 'G percent = %d ' % GPct TUPct = int((TUcount/Total) * 100) print 'T/U percent = %d ' % TUPct

 Extend your code to handle the nucleic acid ambiguous sequence characters “N” and “X”  Extend your code to handle protein sequences What’s Next 45