Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
An Introduction to Phylogenetic Methods
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
UNIT-III By Mr. M. V. Nikum (B.E.I.T). Programming Language Lexical and Syntactic features of a programming Language are specified by its grammar Language:-
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Summer Bioinformatics Workshop 2008 Comparative Genomics and Phylogenetics Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State.
Phylogenetic reconstruction
Molecular Evolution Revised 29/12/06
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Estimation of Distribution Algorithms Ata Kaban School of Computer Science The University of Birmingham.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Heuristic alignment algorithms and cost matrices
CS273a Lecture 8, Win07, Batzoglou Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion.
Some new sequencing technologies. Molecular Inversion Probes.
Bioinformatics and Phylogenetic Analysis
What is Alignment ? One of the oldest techniques used in computational biology The goal of alignment is to establish the degree of similarity between two.
Protein Sequence Classification Using Neighbor-Joining Method
Multiple alignment: heuristics
. Class 9: Phylogenetic Trees. The Tree of Life D’après Ernst Haeckel, 1891.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Phylogenetic trees Sushmita Roy BMI/CS 576
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Molecular basis of evolution. Goal – to reconstruct the evolutionary history of all organisms in the form of phylogenetic trees. Classical approach: phylogenetic.
Classification and Systematics Tracing phylogeny is one of the main goals of systematics, the study of biological diversity in an evolutionary context.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Pollen transcript unigene identifier log 2 -fold change Annotation (BLAST) Unigene L. longiflorum chloroplast, complete genome Unigene
Prokaryote Taxonomy & Diversity
Presenter: Yang Ruan Indiana University Bloomington
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Introduction to Phylogenetics
Calculating branch lengths from distances. ABC A B C----- a b c.
Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.
Sequence Alignment Only things that are homologous should be compared in a phylogenetic analysis Homologous – sharing a common ancestor This is true for.
Chapter 24: Molecular and Genomic Evolution CHAPTER 24 Molecular and Genomic Evolution.
Molecular Phylogeny. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Teaching about time a biologists perspective Biochemistry Physiology Ecology Evolution Origins of biodiversity and estimates of divergence times TIME Integrative.
Coalescent Models for Genetic Demography
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Phylogeny Ch. 7 & 8.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Phylogenetic trees. 2 Phylogeny is the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features.
Introduction to Bioinformatics Resources for DNA Barcoding
Evolutionary history of gorillas inferred from complete mitochondrial DNA sequences Das R1, Hergenrother SD1, Lurie-Marino M1, Soto-Calderón ID2,3, Anthony.
Multi-level predictive analytics and motif discovery across large dynamic spatiotemporal networks and in complex sociotechnical systems: An organizational.
MtActinopterygii: Analysing evolution of mitogenomes belonging to the most dominant class of vertebrates Sevgi Kaynar1, Esra Mine Ünal1, Tuğçe Aygen1,
Inferring a phylogeny is an estimation procedure.
Very important to know the difference between the trees!
Department of Computer Science
Molecular basis of evolution.
Example: Applying EC to the TSP Problem
Chapter 19 Molecular Phylogenetics
Maternal History of Oceania from Complete mtDNA Genomes: Contrasting Ancient Diversity with Recent Homogenization Due to the Austronesian Expansion  Ana T.
Identifying NUMT contamination in mtDNA analyses
Phylogenetic comparison among selected Pasteurella multocida and Haemophilus influenzae species with completed genome sequences. Phylogenetic comparison.
Anne C. Stone, Mark Stoneking  The American Journal of Human Genetics 
Presentation transcript:

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António Amorim “Genome Anatomy”

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 2 Phylogenetic inference comprises three steps: 1)Retrieval of homologous sequences 2)Sequence comparison 3)Phylogenetic tree construction Critical step S e q u e n c e C o m p a r i s o n s t i l l a n u n s o l v e d t a s k State of the Art

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 3 Traditionally, sequence comparison is based on sequence alignment 1.The quality of the sequences - due to documentation/annotation or sequencing errors 2.Uncertainty of homologous characters – only characters of common ancestry can be used to infer the evolutionary history 3.Ambiguous evolutionary events – Indels (insertion/deletion), mismatches and genomic rearrangements (like inversions and duplications/replications) 4.Heavily time consuming task Recent literature shows different approaches to address the alignment problem showing that this task is still not yet satisfactory solved, remaining a challenge. Why?

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 4 To avoid the classical alignment problem, alignment-free methods have been proposed.

Aims  Compare mtDNA Human sequences  Test the current haplogroups classification  Circumvent the Indels interpretation problem (By analyzing just coding sequences)  Reduce running times Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 5

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 6 Material  104 complete Human mtDNA sequences: 3 Haplogroup A7 Haplogroup N 6 Haplogroup B3 Haplogroup T 9 Haplogroup C4 Haplogroup UK 7 Haplogroup D4 Haplogroup V 2 Haplogroup F2 Haplogroup W 3 Haplogroup G2 Haplogroup X 12 Haplogroup H1 Haplogroup Y 4 Haplogroup J1 Haplogroup Z 8 Haplogroup L 0 9 Haplogroup L 1 7 Haplogroup L 2 7 Haplogroup L 3 3 Haplogroup M

 Circular genome  ≈ 16 Kb  10% non coding region: D-loop  90% coding region:  13 protein coding regions  22 tRNA  2 rRNA Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 7 Many ambiguous evolutionary events – Indels Mutation Model generating diversity Possibility of biased or erroneous analysis/conclusions

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 8 In order to avoid complexities/ambiguities resulting from recurrence and insertion/deletion phenomena and thus improving evolutionary signal-to-noise ratio, the protein coding regions were extracted and concatenated.

13 protein coding regions 22 tRNA 2 rRNA 9 ND1 957 bp ND bp CO bp CO2 684 bp ATP8 207bp ATP6 681bp CO3 784bp ND3 346bp ND4L 297bp ND4 1378bp ND5 1812bp ND6 525bp CytB 1141bp bp 104 Human mtDNA protein coding regions 104 complete Human mtDNA sequences The resulting sequences shared the same length easier analysis Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity

10 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Methods 1. Proportion of string lenght identity I.Each sequence X is converted into a long number A – 1C – 3G – 5T – 7 Example: CACTACAATCTTCGTAGGAACAACATATGA II.Each pair of numbers X and Y is compared Example: X = CACTACAATCTTCGTAGGAACAACATATGA Y = CACTATAATCTTCCTAGGAACAACGTATGA X = Y =  Lower number  Higher number

11 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Higher number  Lower number  – matches Higher number  Lower number  – matches Higher number  Lower number  – matches III.The identical extremal positions are determined Example: IV.The identical internal positions are determined Example: 10 matches 20 matches 27 matches

12 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity V.The similarity between each pair of sequences X and Y is determined Example: VI.The similarity between each pair of sequences X and Y is converted into a distance Example:

13 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 2. Proportion of vectorization identity I.Each sequence X is converted into a vector A – 1C – 3G – 5T – 7 Example: CACTACAATCTTCGTAGGAACAACATATGA [3,1,3,7,1,3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3,1,7,1,7,5,1] II.The difference between each pair of vectors X and Y is determined Example: X = CACTACAATCTTCGTAGGAACAACATATGA Y = CACTATAATCTTCCTAGGAACAACGTATGA X = [3,1,3,7,1, 3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3, 1,7,1,7,5,1] Y = [3,1,3,7,1, 7,1,1,7,3,7,7,3,3,7,1,5,5,1,1,3,1,1,3, 5,7,1,7,5,1] X-Y = [0,0,0,0,0, -4,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0, -4,0,0,0,0,0]

14 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Example: [3,1,3,7,1, 3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3, 1,7,1,7,5,1] -[3,1,3,7,1, 7,1,1,7,3,7,7,3,3,7,1,5,5,1,1,3,1,1,3, 5,7,1,7,5,1] [0,0,0,0,0, -4,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0, -4,0,0,0,0,0] 5 matches 7 matches10 matches III.The identical positions between each pair of vectors X and Y are determined 27 matches IV.The similarity between each pair of sequences X and Y is determined Example: V.The similarity between each pair of sequences X and Y is converted into a distance Example:

15 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 1.These are simple methods to compare and cluster mtDNA sequences 2.The present methods run very fast  The vectorial representation is faster than the numerical one 3.These methods require an absolute minimum of assumptions on the mutation model generating diversity  Discarding the possible “noise” enhances the analysis 4.Both methods allow the simultaneous feed and analysis of a full set of sequences (Pairwise comparison of 104 sequences – a total of 5356 pairwise comparisons) Advantages

Results Neighbor Joining - MEGA version 4 (Tamura, Dudley, Nei, and Kumar 2007) 16 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity The topology of this tree was compared to canonical haplogroups classification and to a network constructed using the same sequence data. V B

V B 17 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Canonical Haplogroups Classification Most ancient haplogroups Most recent haplogroups “Intermediate” haplogroups HUMAN MUTATION Mutation in Brief #1039, 30:E386-E394, (2008)

18 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Network Most ancient haplogroups Most recent haplogroups “Intermediate” haplogroups

19 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity  Our inferred tree and Network clusters are in agreement  Most ancient and most recent haplogroups are clustered according Canonical Haplogroups Classification  “Intermediate” haplogroups are not grouped in the same way as with Canonical Haplogroups Classification Are haplogroups classification criteria well defined? We ask Final Remarks

20 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity  Combine the two methods developed to apply to sequences with different lengths  Incorporate other evolutionary phenomena beyond mismatches, like Indels, in the study  Test the current criteria for classification of Human haplogroups Future Perspectives

Acknowledgements (grant SFRH/BD/38171/2007 and POCI 2010) Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity - Prof. Dr. António Guedes de Oliveira - Nádia Pinto - Population Genetics Group

22 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Example:A – 1C – 3G – 5T – 7 X = ATTCCX =  Higher number Y = AGTCGY =  Lower number Example:A – 1C – 2G – 3T – 4 X = ATTCC X =  Higher number Y = AGTCG Y =  Lower number – – A difference is hidden by the operation 0 in the result is always a match