Motif Finding [1]: Ch 4.4-4.6, 4.8-4.10, 5.5, 12.2-12.4.

Slides:



Advertisements
Similar presentations
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Character and String definitions, algorithms, library functions Characters and Strings.
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
The Gold-Bug – E.A. Poe Algorithmic Functions of Computational Biology Professor Istrail.
Born on 19 January 1809 in Boston, Massachusetts. Edgar studied literature at the University of Virginia. In 1833 he met the first success. In 1836 he.
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Multiple Sequence Alignment Motif Finding and Gene Prediction.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Molecular Evolution Revised 29/12/06
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Identification of Distinguishing Motifs Zhanyong WANG (Master Degree Student) Dept. of Computer Science, City University of Hong Kong
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Similar Sequence Similar Function Charles Yan Spring 2006.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Finding Regulatory Motifs in DNA Sequences
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Chapter 5 Multiple Sequence Alignment.
An Introduction to Bioinformatics
Representing text Each of different symbol on the text (alphabet letter) is assigned a unique bit patterns the text is then representing as.
Motif Finding Kun-Pin Wu ( 巫坤品 ) Institute of Biomedical Informatics National Yang Ming University 2007/11/20.
MOTIFS MOTIFS MOTIFSMARTIFAMORIFSMOOTIFSMICIFC A sequence motif is a nucleotide or amino-acid sequence pattern that is widespread (repeated) and has.
Finding Regulatory Motifs in DNA Sequences An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Outline More exhaustive search algorithms Today: Motif finding
Construction of Substitution Matrices
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Lecture 5 Motif discovery. Signals in DNA Genes Promoter regions Binding sites for regulatory proteins (transcription factors, enhancer modules, motifs)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Genetic Algorithms CSCI-2300 Introduction to Algorithms
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Cryptology in Cinema, Literature, and Music Team 6 Joseph Gamble and Lakhbir Singh.
Exhaustive search (cont’d) CS 466 Saurabh Sinha. Finding motifs ab initio Enumerate all possible strings of some fixed (small) length For each such string.
Introduction to Bioinformatics Algorithms Randomized Algorithms and Motif Finding.
Finding Regulatory Motifs in DNA Sequences
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Construction of Substitution matrices
Compression of Protein Sequences EE-591 Information Theory FEI NAN, SUMIT SHARMA May 3, 2003.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
HW4: sites that look like transcription start sites Nucleotide histogram Background frequency Count matrix for translation start sites (-10 to 10) Frequency.
Bioinformatics Overview
13 Text Processing Hongfei Yan June 1, 2016.
Learning Sequence Motif Models Using Expectation Maximization (EM)
CSE 5290: Algorithms for Bioinformatics Fall 2011
Probabilities and Probabilistic Models
The Huffman Algorithm We use Huffman algorithm to encode a long message as a long bit string - by assigning a bit string code to each symbol of the alphabet.
Finding Regulatory Motifs in DNA Sequences
BNFO 602 Phylogenetics Usman Roshan.
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
CSE 589 Applied Algorithms Spring 1999
CSE 5290: Algorithms for Bioinformatics Fall 2009
Encoding and Ciphering
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

Motif Finding [1]: Ch , , 5.5,

Biological Motivation Infection from Bacteria and Pathogens (germs) Organisms have immunity genes, usually dormant Immunity genes “switched on” when organism is infected and produce proteins that destroy Bacteria and Pathogens, and cure Biologist want to know “Who turned them on?” For fly substring similar to TCGGGGATTTCC within the gene (i.e., DNA sequence) turn them on TCGGGGATTTCC is called regulatory motif

Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca

Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Where is the Implanted Motif? atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga

Implanting Motif AAAAAAGGGGGGG with Four Mutations/Changes atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa

Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga

How to Find Regulatory Motif? How to find regulatory motif from immunity genes What we know and what we don’t and what we want to find? We know: –At least one regulatory motif in each immunity gene DNA sequence –They looks similar –Length l of the motif We don’t know: –The exact pattern of the motif –The location of the motif –Number of occurrence Want to find –A substring of size l that is close to all regulatory motifs

A Similar Problem The Motif Finding Problem is similar to the problem posed by Edgar Allan Poe (1809 – 1849) in his Gold Bug story

The Gold Bug Problem Given a secret message: 53++!305))6*;4826)4+.)4+);806*;48!8`60))85;]8*:+*8!83(88)5*!; 46(;88*96*?;8)*+(;485);5*!2:*+(;4956*2(5*-4)8`8*; );)6 !8)4++;1(+9;48081;8:8+1;48!85;4)485!528806*81(+9;48;(88;4(+?3 4;48)4+;161;:188;+?; Decipher the message encrypted in the fragment

Symbol Frequencies in the Gold Bug Message Gold Bug Message: English Language: e t a o i n s r h l d c u m f p g w y b v k x j q z Most frequent Least frequent Symbol 8;4)+*56(!10293:?`-]. Frequency

First Attempt By simply mapping the most frequent symbols to the most frequent letters of the alphabet: sfiilfcsoorntaeuroaikoaiotecrntaeleyrcooestvenpinelefheeosnlt arhteenmrnwteonihtaesotsnlupnihtamsrnuhsnbaoeyentacrmuesotorl eoaiitdhimtaecedtepeidtaelestaoaeslsueecrnedhimtaetheetahiwfa taeoaitdrdtpdeetiwt The result does not make sense

l-tuple count A better approach: –Examine frequencies of l-tuples, combinations of 2 symbols, 3 symbols, etc. –“The” is the most frequent 3-tuple in English and “;48” is the most frequent 3-tuple in the encrypted text –Make inferences of unknown symbols by examining other frequent l-tuples

The ;48 clue Mapping “the” to “;48” and substituting all occurrences of the symbols: 53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t h6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e )h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3ht he)h+t161t:1eet+?t

Second Attempt Make inferences: 53++!305))6*the26)h+.)h+)te06*the!e`60))e5t]e*:+*e!e3(ee)5*!t h6(tee*96*?te)*+(the5)t5*!2:*+(th956*2(5*h)e`e*th0692e5)t)6!e )h++t1(+9the0e1te:e+1the!e5th)he5!52ee06*e1(+9thet(eeth(+?3ht he)h+t161t:1eet+?t “thet(ee” most likely means “the tree” –Infer “(“ = “r” –“th(+?3h” becomes “thr+?3h” –Can we guess “+” and “?”?

The Solution After figuring out all the mappings, the final message is: AGOODGLASSINTHEBISHOPSHOSTELINTHEDEVILSSEATWENYONEDEGRE ESANDTHIRTEENMINUTESNORTHEASTANDBYNORTHMAINBRANCHSEVENT HLIMBEASTSIDESHOOTFROMTHELEFTEYEOFTHEDEATHSHEADABEELINE FROMTHETREETHROUGHTHESHOTFIFTYFEETOUT

The Solution A GOOD GLASS IN THE BISHOP’S HOSTEL IN THE DEVIL’S SEA, TWENY ONE DEGREES AND THIRTEEN MINUTES NORTHEAST AND BY NORTH, MAIN BRANCH SEVENTH LIMB, EAST SIDE, SHOOT FROM THE LEFT EYE OF THE DEATH’S HEAD A BEE LINE FROM THE TREE THROUGH THE SHOT, FIFTY FEET OUT.

Motif Finding is harder than Gold Bug problem We don’t have the complete dictionary of motifs yet The “genetic” language does not have a standard “grammar” Only a small fraction of nucleotide sequences encode for motifs; the size of data is enormous

The Motif Finding Problem Given random samples of DNA sequences: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc Find the pattern/motif of length l that is implanted in each of the individual sequences

The Motif Finding Problem The patterns revealed with no mutations: cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtacgtc acgtacgt Consensus String, this is the motif

The Motif Finding Problem The patterns with 2 mutations: cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc What is the consensus string here?

Parameters cctgatagacgctatctggctatccaGgtacTtaggtcctctgtgcgaatctatgcgtttccaaccat agtactggtgtacatttgatCcAtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc aaacgtTAgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtCcAtataca ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaCcgtacgGc l = 8 t=5 s 1 = 26 s 2 = 21 s 3 = 3 s 4 = 56 s 5 = 60 s DNA n = 69

Scoring Motifs For s = (s 1, … s t ) and DNA Score(s,DNA)= Find s with maximum score What is the best/worst score? a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ A C G T _________________ Consensus a c g t a c g t Score = = 30 l t

BruteForceMotifSearch 1.BruteForceMotifSearch(DNA, t, n, l) 2.bestScore  0 3.for each s=(s 1,s 2,..., s t ) from (1,1... 1) to (n-l+1,..., n-l+1) 4.if (Score(s,DNA) > bestScore) 5.bestScore  score(s, DNA) 6.bestMotif  (s 1,s 2,..., s t ) 7.return bestMotif Cost (n - l + 1) t possible sets of starting positions In each iteration O(lt) operations for scoring, total O(lt n t )

A Different Look Given v = “ acgtacgt ” and s acgtacgt cctgatagacgctatctggctatccacgtacAtaggtcctctgtgcgaatctatgcgtttccaaccat acgtacgt agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaacgctcagaaccagaagtgc acgtacgt aaaAgtCcgtgcaccctctttcttcgtggctctggccaacgagggctgatgtataagacgaaaatttt acgtacgt agcctccgatgtaagtcatagctgtaactattacctgccacccctattacatcttacgtacgtataca acgtacgt ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgctcgatcgttaacgtaGgtc TotalDistance(v,DNA) =  (min for each sequence over all positions)

The Problem Input: A t x n matrix DNA, and l, the length of the pattern to find Output: A string v of l nucleotides that minimizes TotalDistance(v,DNA) over all strings of that length

Median String Search Brute Force Algorithm 1.MedianStringSearch (DNA, t, n, l) 2.bestString  AAA…A 3.bestDistance  ∞ 4. for each l-mer s from AAA…A to TTT…T 5. if TotalDistance(s,DNA) < bestDistance 6. bestDistance  TotalDistance(s,DNA) 7. bestWord  s 8.return bestWord Cost 4 l possible l-mer Time to compute minimum distance for each string O(n) Total O(nt 4 l )

Motif Finding Problem == Median String Problem a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A Profile C G T _________________ Consensus a c g t a c g t Score TotalDistance Sum At any column i Score i + TotalDistance i = t For l columns Score + TotalDistance = l * t Score = l * t - TotalDistance Motif Finding = O(l n t ) Median String = O(nt 4 l ) l t

Self Study Can you convert the two brute force algorithms to branch and bound algorithms to reduce the # cheking ?