Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif discovery and Phylogenetic trees.

Similar presentations


Presentation on theme: "Motif discovery and Phylogenetic trees."— Presentation transcript:

1 Motif discovery and Phylogenetic trees

2 Motif Discovery Challenges
How to recognize a known regulatory motif in genome scale? Can we discover new motifs within upstream sequences of genes?

3 Scenario 1 : Binding motif is known

4 Building a PSSM for short motifs
Collect all known sequences that bind a certain TF. Align all sequences (using multiple sequence alignment). Compute the frequency of each nucleotide in each position (PSPM). Incorporate background frequency for each nucleotide (PSSM).

5 PROBLEMS… When searching for a motif in a genome using PSSM or other methods – the motif is usually found all over the place ->The motif is considered real if found in the vicinity of a gene. Checking experimentally for the binding sites of a specific TF (location analysis) – the sites that bind the motif are in some cases similar to the PSSM and sometimes not!

6 Scenario 2 : Binding motif is unknown

7 Finding new Motifs We are given a group of genes, which presumably contain a common regulatory motif. We know nothing of the TF that binds to the putative motif. The problem: discover the motif.

8 Motif Discovery Motif Discovery

9 Computational Methods
Methods include: Probabilistic methods – hidden Markov models (HMMs), expectation maximization (EM), Gibbs sampling, etc. Enumeration methods – problematic for inexact motifs of length k> … Current status: Problem is still open.

10 Tools on the Web MEME – Multiple EM for Motif Elicitation. metaMEME- Uses HMM method MAST-Motif Alignment and Search Tool TRANSFAC - database of eukaryotic cis-acting regulatory DNA elements and trans-acting factors. eMotif - allows to scan, make and search for motifs at the protein level.

11 From MSA To Phylogenetic trees

12 Phylogeny is the inference of evolutionary relationships.
Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses. One tree of life A sketch Darwin made soon after returning from his voyage on HMS Beagle (1831–36) showed his thinking about the diversification of species from a single stock (see Figure, overleaf). This branching, extended by the concept of common descent,

13 Haeckel (1879) Pace (2001)

14 Molecular phylogeny uses trees to depict evolutionary
relationships among organisms. These trees are based upon DNA and protein sequence data Human Chimpanzee Gorilla Orangutan Gorilla Chimpanzee Orangutan Human Pre-Molecular analysis: The great apes (chimpanzee, Gorilla & orangutan) Separate from the human Molecular analysis: Chimpanzee is related more closely to human than the gorilla

15 What can we learn from phylogenetics tree?

16 1. Determine the closest relatives of one organism
in which we are interested Was the extinct quagga more like a zebra or a horse?

17 Which species are closest to Human?
Gorilla Human Chimpanzee Gorilla Orangutan Chimpanzee Orangutan Human

18 2. Help to find the relationship between the species and identify new species
Example Metagenomics A new field in genomics aims the study the genomes recovered from environmental samples. A powerful tool to access the wealthy biodiversity of native environmental samples

19 106 cells/ ml seawater 107 virus particles/ ml seawater >99% uncultivated microbes

20 From : “The Sorcerer II Global Ocean Sampling Expedition: Metagenomic Characterization of Viruses within Aquatic Microbial Samples” Williamson et al, PLOS ONE 2008

21 3. Discover a function of an unknown gene or protein
RBP1_HS RBP2_pig Hypothetical protein RBP_RAT ALP_HS ALPEC_BV ALPA1_RAT ECBLC Hypothetical protein X Hypothetical protein

22 Relationships can be represented by Phylogenetic Tree or Dendrogram
F E B D A C

23 Phylogenetic Tree Terminology
Graph composed of nodes & branches Each branch connects two adjacent nodes R F E B D A C

24 Phylogenetic Tree Terminology
Rooted tree Un-rooted tree Human Chicken Gorilla Chimp Gorilla Human Chicken Chimp

25 Rooted vs. unrooted trees
3 1 2 3 1 2

26 How can we build a tree with molecular data?
-Trees based on DNA sequence (rRNA) -Trees based on Protein sequences atcgatcgtgatcgatcgtagcatcgatgcatcgtacg MWRCPYCGKRQWCMWG

27 Approach 1 - Distance methods
Algorithms : - UPGMA (rooted) , - Neighbor joining (unrooted) Approach 2 - State methods Algorithms: Maximum parsimony (MP) Maximum likelihood (ML)

28 Basic algorithm for constructing a rooted tree Unweighted Pair Group Method using Arithmetic Averages (UPGMA) Assumption: Divergence of sequences is assumed to occur at a constant rate  Distance to root is equal Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACGCGTTGGGCGACGGTAAT Sequence c ACGCATTGAATGATGATAAT Sequence d ACACATTGAGTGTGATAATA a b c d

29 Basic Algorithm UPGMA Sequences Distance Table a b c d 8 7 5 3 9
Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACACATTGAGTGTGATCAAC Sequence c ACACATTGAGTGAGGACAAC Sequence d ACGCGTTGGGCGACGGTAAT a b c d 8 7 5 3 9 Distances * Dab = 8 Dac = 7 Dad = 5 Dbc = 3 Dbd = 9 Dcd = 8 * Can be calculated using different distance metrics 29

30 Selection step a d c b a b c d 8 7 5 3 9
8 7 5 3 9 Choose the nodes with the shortest distance and fuse them. 30

31 a Next Step a b c d 8 7 5 3 9 a d e 5 6 7 d c,b e a Then recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodes from the table. D (EA) = (D(AC)+ D(AB)-D(CB))/2 D (ED) = (D(DC)+ D(DB)-D(CB))/2

32 In order to get a tree, un-fuse c and b by
Next Step a c a d e 5 6 7 Dce e d Dde b In order to get a tree, un-fuse c and b by calculating their distance to the new node (e) !!!The distances Dce and Dde are calculated independently (formula will be given in tirgul)

33 We want to fuse the next closest nodes
5 6 7 Dce a,d e f Dde b

34 We need to calculate the distance between e and f
Finally We need to calculate the distance between e and f c a f e 4 Daf e Dce f Dbf Dde b d D (EF) = (D(EA)+ D(ED)-D(AD))/2

35 a d c b f e b c a d

36 IMPORTANT !!! Neighbor Joining (NJ)- is an algorithm which is
Usually we don’t assume a constant mutation rate and in order to choose the nodes to fuse we have to calculate the relative distance of each node to all other nodes . Neighbor Joining (NJ)- is an algorithm which is suitable to cases when the rate of evolution varies

37 Neighbor Joining (NJ) Reconstructs an unrooted tree
Calculates branch lengths Based on pairwise distances In each stage, the two nearest nodes of the tree are chosen and defined as neighbors in our tree. This is done recursively until all of the nodes are paired together.

38 Advantages and disadvantages of the neighbor-joining method
-It is fast and thus suited for large datasets -Permits lineages with largely different branch lengths Disadvantages - Sequence information is reduced - Gives only one possible tree

39 Problems with phylogenetic trees
- Using different regions from a same alignment may produce different trees.

40 Problems with phylogenetic trees

41 Problems with phylogenetic trees
Bacillus Bacillus Burkholderias Aeromonas Aeromonas Pseudomonas Pseudomonas Burkholderias Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella Bacillus Pseudomonas Pseudomonas Aeromonas Burkholderias Burkholderias Aeromonas Bacillus Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella

42 Problems with phylogenetic trees
What to do ?

43 Bootstrapping We create new data sets by sampling N positions with replacement. We generate such pseudo-data sets. For each such data set we reconstruct a tree, using the same method. We note the agreement between the tree reconstructed from the pseudo-data set to the original tree. Note: we do not change the number of sequences !

44 Bootstrapped tree Less reliable Branch Highly reliable branch

45 Open Questions Do DNA and proteins from the same gene produce different trees ? Can different genes have different evolutionary history ?

46 Tools for tree reconstruction
CLUSTALX (NJ method) Phylip -PHYLogeny Inference Package includes parsimony, distance matrix, and likelihood methods, including bootstrapping. Phyml (maximum likelihood method) MEGA (Molecular Evolution Genomic Analysis) More phylogeny programs

47 362

48


Download ppt "Motif discovery and Phylogenetic trees."

Similar presentations


Ads by Google