Computational Biology, Part A More on Sequence Operations Robert F. Murphy Copyright  1997, 2001. All rights reserved.

Slides:



Advertisements
Similar presentations
Computational Biology, Part 2 Sequence Motifs Robert F. Murphy Copyright  1996, All rights reserved.
Advertisements

Liang, Introduction to Java Programming, Ninth Edition, (c) 2013 Pearson Education, Inc. All rights reserved. 1 Chapter 9 Strings.
Making Choices in C if/else statement logical operators break and continue statements switch statement the conditional operator.
The Assembly Language Level
CS1010 Programming Methodology
Chapter 4 Control Structures I. Objectives ► Examine relational and logical operators ► Explore how to form and evaluate logical (Boolean) expressions.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Chapter 4 – C Program Control Outline 4.1Introduction.
C++ Programming: From Problem Analysis to Program Design, Third Edition Chapter 4: Control Structures I (Selection)
DNA/RNA Protein Expression Interaction
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Java Software Solutions Foundations of Program Design Sixth Edition by Lewis.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Recursion A recursive function is a function that calls itself either directly or indirectly through another function. The problems that can be solved.
Java Programming: From Problem Analysis to Program Design, 4e Chapter 4 Control Structures I: Selection.
C++ for Engineers and Scientists Third Edition
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved The switch Multiple-Selection Statement switch.
Presented by Joaquin Vila Prepared by Sally Scott ACS 168 Problem Solving Using the Computer Week 12 Boolean Expressions, Switches, For-Loops Chapter 7.
INTRO TO PROGRAMMING Chapter 2. M-files While commands can be entered directly to the command window, MATLAB also allows you to put commands in text files.
Instructor: Alexander Stoytchev CprE 185: Intro to Problem Solving (using C)
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
Outlines Chapter 3 –Chapter 3 – Loops & Revision –Loops while do … while – revision 1.
Computer Science Selection Structures.
1 Chapter 4: Selection Structures. In this chapter, you will learn about: – Selection criteria – The if-else statement – Nested if statements – The switch.
Chapter 4: Control Structures I J ava P rogramming: From Problem Analysis to Program Design, From Problem Analysis to Program Design, Second Edition Second.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.
Chapter 3 Control Flow Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
© Copyright 1992–2004 by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Flow Control (Switch, do-while, break) Outline 4.7The.
1 If statement and relational operators –, >=, ==, != Finding min/max of 2 numbers Finding the min of 3 numbers Forming Complex relational expressions.
PLLab, NTHU,Cs2403 Programming Languages Expression and control structure Kun-Yuan Hsieh Programming Language Lab., NTHU.
Selection Control Structures. Simple Program Design, Fourth Edition Chapter 4 2 Objectives In this chapter you will be able to: Elaborate on the uses.
PROBLEM SOLVING & ALGORITHMS CHAPTER 5: CONTROL STRUCTURES - SELECTION.
A First Book of ANSI C Fourth Edition Chapter 4 Selection.
 Learn about control structures  Examine relational and logical operators  Explore how to form and evaluate logical (Boolean) expressions  Learn how.
CSE 1301 Lecture 8 Conditionals & Boolean Expressions Figures from Lewis, “C# Software Solutions”, Addison Wesley Richard Gesick.
Engineering H192 - Computer Programming The Ohio State University Gateway Engineering Education Coalition Lect 5P. 1Winter Quarter C Programming Basics.
Java Programming: From Problem Analysis to Program Design, 3e Chapter 4 Control Structures I: Selection.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational and logical operators.
CONTROL STRUCTURE. 2 CHAPTER OBJECTIVES  Learn about control structures.  Examine relational and logical operators.  Explore how to form and evaluate.
Gator Engineering Copyright © 2008 W. W. Norton & Company. All rights reserved. 1 Chapter 3 Formatted Input/Output.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Instructor: Alexander Stoytchev CprE 185: Intro to Problem Solving (using C)
Java Basics. Tokens: 1.Keywords int test12 = 10, i; int TEst12 = 20; Int keyword is used to declare integer variables All Key words are lower case java.
Programming Principles Operators and Expressions.
Making Decisions in c. 1.if statement Imagine that you could translate a statement such as “If it is not raining, then I will go swimming” into the C.
Windows Programming Lecture 06. Data Types Classification Data types are classified in two categories that is, – those data types which stores decimal.
C Program Control September 15, OBJECTIVES The essentials of counter-controlled repetition. To use the for and do...while repetition statements.
C++ for Engineers and Scientists Second Edition Chapter 4 Selection Structures.
Chapter 9 Recursion. Copyright ©2004 Pearson Addison-Wesley. All rights reserved.10-2 Recursive Function recursive functionThe recursive function is –a.
Chapter 4: Control Structures I (Selection). Objectives In this chapter, you will: – Learn about control structures – Examine relational operators – Discover.
Lesson #4 Logical Operators and Selection Statements.
Lesson #4 Logical Operators and Selection Statements.
Chapter 4 – C Program Control
CNG 140 C Programming (Lecture set 3)
Lesson #6 Modular Programming and Functions.
Chapter 4: Control Structures I
Decisions Chapter 4.
Tutorial 8 Pointers and Strings
EGR 2261 Unit 4 Control Structures I: Selection
13 Text Processing Hongfei Yan June 1, 2016.
Expressions and Control Flow in JavaScript
Chapter 4: Control Structures I
Chapter 4: Control Structures I (Selection)
Algorithm Discovery and Design
Chapter 4: Control Structures I (Selection)
Control Structure Chapter 3.
The Java switch Statement
Control Structure.
REPETITION Why Repetition?
Presentation transcript:

Computational Biology, Part A More on Sequence Operations Robert F. Murphy Copyright  1997, All rights reserved.

Representation and Matching of Sequences

Representation of Sequences characters characters  simplest  easy to read, edit, etc. bit-coding bit-coding  more compact, both on disk and in memory  comparisons more efficient

Matching one character - with character variables Assume two character variables "C” and “Q” Assume two character variables "C” and “Q”  test for exact match  If(Q=C) {...}  need complicated statements to handle wildcards  If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’| C=‘V’ | C=‘N’)|Q=‘C’&...)) {...}  can build into a function  If(TestBase(Q,C)) {...}

Efficient method to match one character Convert char to int 0-25 Convert char to int 0-25 Create 26x26 matrix showing which matches which Create 26x26 matrix showing which matches which Lookup two characters to be compared to find value Lookup two characters to be compared to find value

Bit-coding let the following binary values represent each base let the following binary values represent each base  A="0001  C="0010  G="0100  T="1000 then then  G = 4  A or C = "0011 = 3  A,G or T = "1101 = 13  etc.

Matching one character - with bit coding Assume two integer variables “I” and “J” Assume two integer variables “I” and “J”  test for exact match  If(I=J) {...}  test for match with wildcards (no lookup!)  If(I&J) {...}

Matching more than one character - pattern matching Example: recognition site for a restriction enzyme Example: recognition site for a restriction enzyme  Input sequence string into variable Seq  Define Site as string of characters or masks  EcoRI recognizes GAATTC  AccI recognizes GTMKAC  Create function to search a sequence for that site  Find(Site,LenSite,Seq,LenSeq)  for each position in Seq, see if Site matches starting there

Automating Probability Calculations using Nucleotide Frequencies

Automating the Calculation Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations When using dinucleotide probabilities, have to be careful about how the probabilities are combined When using dinucleotide probabilities, have to be careful about how the probabilities are combined

Illustration Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities? Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities?

Which is right? p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2]

Expansions p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A p * AA p * AT + p A p * AA p * GT p ART =p A p * AA p * AT + p A p * AA p * GT + p A p * AG p * AT + p A p * AG p * GT ) + p A p * AG p * AT + p A p * AG p * GT ) p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART= p A p * AA p * AT + p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT

Proof p ART =p AAT +p AGT p ART =p AAT +p AGT p AAT =p A p * AA p * AT p AAT =p A p * AA p * AT p AGT =p A p * AG p * GT p AGT =p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT This matches equation 2 on previous slide This matches equation 2 on previous slide

Need further convincing? Imagine that p * AA =0 and p * GT =0 (but all other p * are non-zero) Imagine that p * AA =0 and p * GT =0 (but all other p * are non-zero) Then p ART should be zero since there is no way to create either AAT or AGT Then p ART should be zero since there is no way to create either AAT or AGT This is predicted by eq. 2 but not by eq. 1 This is predicted by eq. 2 but not by eq. 1

More complicated probability illustration What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? Using equal mononucleotide frequencies Using equal mononucleotide frequencies  p A = p C = p G = p T = 1/4  p ARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64

Illustration (continued) Using observed mononucleotide frequencies: Using observed mononucleotide frequencies:  p ARYT = p A (p A + p G ) (p C + p T ) p T Using dinucleotide frequencies: Using dinucleotide frequencies:  p ARYT = p A (p * AA (p * AC p * CT + p * AT p * TT ) + p * AG (p * GC p * CT + p * GT p * TT ) )

Illustration (continued) Using dinucleotide frequencies: Using dinucleotide frequencies: A +A=AA +G=AG +C=AAC +T=AAT +C=AGC +T=AGT +T=AACT +T=AATT +T=AGCT +T=AGTT ARYT

Multiply then add We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results” We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results”

How do we program this? “ for ” loops? “ for ” loops? Nested “ if ” structure? Nested “ if ” structure? Other? Other?

Will this work? result=monoprob(seq(1)); for i=2 to n {temp=0. for j=1 to 4 /*for each base*/ { if(seq(i)&mask(j)) temp=temp+diprob(seq(i-1),seq(i)) }result=result*temp}

No to for No, it generates add then multiply No, it generates add then multiply

A recursive solution Some programming languages allow recursion - the calling (invoking) of a function by itself Some programming languages allow recursion - the calling (invoking) of a function by itself This is useful here because we can branch when we encounter an ambiguous base and consider all alternatives separately This is useful here because we can branch when we encounter an ambiguous base and consider all alternatives separately Allows multiplication down the branches and then addition Allows multiplication down the branches and then addition

Site Probability Calculation via Recursion Illustration: Make a function that prints out all possible sequences that can match a restriction site Illustration: Make a function that prints out all possible sequences that can match a restriction site (Demo Program PossibleSites.c) (Demo Program PossibleSites.c)  (found in /afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra ms or Mellon: BioServer: Comp. Biol : Demo Programs: PossibleSites ƒ)

PossibleSites.c /* PossibleSites.c Prints out all possible sites that can match a string of IUB codes January 22, R.F. Murphy */ #include void PossibleSites(char SiteString[], int Index); short Test1(char SiteString[], int Index); short Test2(char SiteString[], int Index); short Test3(char SiteString[], int Index); short Test4(char SiteString[], int Index); void main(void) { char Site[10]; do { printf("Enter a string of IUB codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); }

void PossibleSites(char SiteString[], int Index) { if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString, Index)) ; else if (Test3(SiteString, Index)) ; else if (Test4(SiteString, Index)) ; else { printf("Illegal character (%c) encountered\n",SiteString[Index]) ; PossibleSites(SiteString,Index+1); } return; } short Test1(char SiteString[], int Index) { /* printf("In Test1: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; } Unwind here Test for each type of ambiguous base

short Test2(char SiteString[], int Index) { char Save; /* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'W': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }

short Test3(char SiteString[], int Index) { char Save; /* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }

short Test4(char SiteString[], int Index) { char Save; /* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U (iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }