Download presentation
Presentation is loading. Please wait.
Published byOlivia Benson Modified over 9 years ago
1
Computational Biology, Part A More on Sequence Operations Robert F. Murphy Copyright 1997, 2001. All rights reserved.
2
Representation and Matching of Sequences
3
Representation of Sequences characters characters simplest easy to read, edit, etc. bit-coding bit-coding more compact, both on disk and in memory comparisons more efficient
4
Matching one character - with character variables Assume two character variables "C” and “Q” Assume two character variables "C” and “Q” test for exact match If(Q=C) {...} need complicated statements to handle wildcards If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’| C=‘V’ | C=‘N’)|Q=‘C’&...)) {...} can build into a function If(TestBase(Q,C)) {...}
5
Efficient method to match one character Convert char to int 0-25 Convert char to int 0-25 Create 26x26 matrix showing which matches which Create 26x26 matrix showing which matches which Lookup two characters to be compared to find value Lookup two characters to be compared to find value
6
Bit-coding let the following binary values represent each base let the following binary values represent each base A="0001 C="0010 G="0100 T="1000 then then G = 4 A or C = "0011 = 3 A,G or T = "1101 = 13 etc.
7
Matching one character - with bit coding Assume two integer variables “I” and “J” Assume two integer variables “I” and “J” test for exact match If(I=J) {...} test for match with wildcards (no lookup!) If(I&J) {...}
8
Matching more than one character - pattern matching Example: recognition site for a restriction enzyme Example: recognition site for a restriction enzyme Input sequence string into variable Seq Define Site as string of characters or masks EcoRI recognizes GAATTC AccI recognizes GTMKAC Create function to search a sequence for that site Find(Site,LenSite,Seq,LenSeq) for each position in Seq, see if Site matches starting there
9
Automating Probability Calculations using Nucleotide Frequencies
10
Automating the Calculation Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases Goal: Calculate probability of occurrence of a sequence that may include ambiguous bases What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations What we need is a way to consider all possible allowed nucleotides at each position in all allowed combinations When using dinucleotide probabilities, have to be careful about how the probabilities are combined When using dinucleotide probabilities, have to be careful about how the probabilities are combined
11
Illustration Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities? Question: What is the probability of observing sequence feature ART (A followed by a purine {either A or G}, followed by a T) using dinucleotide probabilities?
12
Which is right? p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2]
13
Expansions p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A (p * AA +p * AG )(p * AT +p * GT ) [eq.1] p ART =p A p * AA p * AT + p A p * AA p * GT p ART =p A p * AA p * AT + p A p * AA p * GT + p A p * AG p * AT + p A p * AG p * GT ) + p A p * AG p * AT + p A p * AG p * GT ) p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART =p A (p * AA p * AT +p * AG p * GT ) [eq.2] p ART= p A p * AA p * AT + p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT
14
Proof p ART =p AAT +p AGT p ART =p AAT +p AGT p AAT =p A p * AA p * AT p AAT =p A p * AA p * AT p AGT =p A p * AG p * GT p AGT =p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT p ART= p A p * AA p * AT + p A p * AG p * GT This matches equation 2 on previous slide This matches equation 2 on previous slide
15
Need further convincing? Imagine that p * AA =0 and p * GT =0 (but all other p * are non-zero) Imagine that p * AA =0 and p * GT =0 (but all other p * are non-zero) Then p ART should be zero since there is no way to create either AAT or AGT Then p ART should be zero since there is no way to create either AAT or AGT This is predicted by eq. 2 but not by eq. 1 This is predicted by eq. 2 but not by eq. 1
16
More complicated probability illustration What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? What is the probability of observing the sequence feature ARYT (A followed by a purine {either A or G}, followed by a pyrimidine {either C or T}, followed by a T)? Using equal mononucleotide frequencies Using equal mononucleotide frequencies p A = p C = p G = p T = 1/4 p ARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = 1/64
17
Illustration (continued) Using observed mononucleotide frequencies: Using observed mononucleotide frequencies: p ARYT = p A (p A + p G ) (p C + p T ) p T Using dinucleotide frequencies: Using dinucleotide frequencies: p ARYT = p A (p * AA (p * AC p * CT + p * AT p * TT ) + p * AG (p * GC p * CT + p * GT p * TT ) )
18
Illustration (continued) Using dinucleotide frequencies: Using dinucleotide frequencies: A +A=AA +G=AG +C=AAC +T=AAT +C=AGC +T=AGT +T=AACT +T=AATT +T=AGCT +T=AGTT ARYT
19
Multiply then add We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results” We conclude that for such strings our rule should be “multiply dinucleotide probabilities along each allowed path and then add the results”
20
How do we program this? “ for ” loops? “ for ” loops? Nested “ if ” structure? Nested “ if ” structure? Other? Other?
21
Will this work? result=monoprob(seq(1)); for i=2 to n {temp=0. for j=1 to 4 /*for each base*/ { if(seq(i)&mask(j)) temp=temp+diprob(seq(i-1),seq(i)) }result=result*temp}
22
No to for No, it generates add then multiply No, it generates add then multiply
23
A recursive solution Some programming languages allow recursion - the calling (invoking) of a function by itself Some programming languages allow recursion - the calling (invoking) of a function by itself This is useful here because we can branch when we encounter an ambiguous base and consider all alternatives separately This is useful here because we can branch when we encounter an ambiguous base and consider all alternatives separately Allows multiplication down the branches and then addition Allows multiplication down the branches and then addition
24
Site Probability Calculation via Recursion Illustration: Make a function that prints out all possible sequences that can match a restriction site Illustration: Make a function that prints out all possible sequences that can match a restriction site (Demo Program PossibleSites.c) (Demo Program PossibleSites.c) (found in /afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra ms or Mellon: BioServer: Comp. Biol. 03-310: Demo Programs: PossibleSites ƒ)
25
PossibleSites.c /* PossibleSites.c Prints out all possible sites that can match a string of IUB codes January 22, 1997 - R.F. Murphy */ #include void PossibleSites(char SiteString[], int Index); short Test1(char SiteString[], int Index); short Test2(char SiteString[], int Index); short Test3(char SiteString[], int Index); short Test4(char SiteString[], int Index); void main(void) { char Site[10]; do { printf("Enter a string of IUB codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); }
26
void PossibleSites(char SiteString[], int Index) { if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString, Index)) ; else if (Test3(SiteString, Index)) ; else if (Test4(SiteString, Index)) ; else { printf("Illegal character (%c) encountered\n",SiteString[Index]) ; PossibleSites(SiteString,Index+1); } return; } short Test1(char SiteString[], int Index) { /* printf("In Test1: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; } Unwind here Test for each type of ambiguous base
27
short Test2(char SiteString[], int Index) { char Save; /* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'W': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }
28
short Test3(char SiteString[], int Index) { char Save; /* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }
29
short Test4(char SiteString[], int Index) { char Save; /* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */ Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U (iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.