Presentation is loading. Please wait.

Presentation is loading. Please wait.

Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António.

Similar presentations


Presentation on theme: "Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António."— Presentation transcript:

1 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António Amorim “Genome Anatomy”

2 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 2 Phylogenetic inference comprises three steps: 1)Retrieval of homologous sequences 2)Sequence comparison 3)Phylogenetic tree construction Critical step S e q u e n c e C o m p a r i s o n s t i l l a n u n s o l v e d t a s k State of the Art

3 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 3 Traditionally, sequence comparison is based on sequence alignment 1.The quality of the sequences - due to documentation/annotation or sequencing errors 2.Uncertainty of homologous characters – only characters of common ancestry can be used to infer the evolutionary history 3.Ambiguous evolutionary events – Indels (insertion/deletion), mismatches and genomic rearrangements (like inversions and duplications/replications) 4.Heavily time consuming task Recent literature shows different approaches to address the alignment problem showing that this task is still not yet satisfactory solved, remaining a challenge. Why?

4 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 4 To avoid the classical alignment problem, alignment-free methods have been proposed.

5 Aims  Compare mtDNA Human sequences  Test the current haplogroups classification  Circumvent the Indels interpretation problem (By analyzing just coding sequences)  Reduce running times Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 5

6 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 6 Material  104 complete Human mtDNA sequences: 3 Haplogroup A7 Haplogroup N 6 Haplogroup B3 Haplogroup T 9 Haplogroup C4 Haplogroup UK 7 Haplogroup D4 Haplogroup V 2 Haplogroup F2 Haplogroup W 3 Haplogroup G2 Haplogroup X 12 Haplogroup H1 Haplogroup Y 4 Haplogroup J1 Haplogroup Z 8 Haplogroup L 0 9 Haplogroup L 1 7 Haplogroup L 2 7 Haplogroup L 3 3 Haplogroup M

7  Circular genome  ≈ 16 Kb  10% non coding region: D-loop  90% coding region:  13 protein coding regions  22 tRNA  2 rRNA Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 7 Many ambiguous evolutionary events – Indels Mutation Model generating diversity Possibility of biased or erroneous analysis/conclusions http://herkules.oulu.fi/isbn9514268490/html/c347.html

8 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 8 In order to avoid complexities/ambiguities resulting from recurrence and insertion/deletion phenomena and thus improving evolutionary signal-to-noise ratio, the protein coding regions were extracted and concatenated.

9 13 protein coding regions 22 tRNA 2 rRNA 9 ND1 957 bp ND2 1044 bp CO1 1542 bp CO2 684 bp ATP8 207bp ATP6 681bp CO3 784bp ND3 346bp ND4L 297bp ND4 1378bp ND5 1812bp ND6 525bp CytB 1141bp 11344 bp 104 Human mtDNA protein coding regions 104 complete Human mtDNA sequences The resulting sequences shared the same length easier analysis Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity

10 10 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Methods 1. Proportion of string lenght identity I.Each sequence X is converted into a long number A – 1C – 3G – 5T – 7 Example: CACTACAATCTTCGTAGGAACAACATATGA 313713117377357155113113171751 II.Each pair of numbers X and Y is compared Example: X = CACTACAATCTTCGTAGGAACAACATATGA Y = CACTATAATCTTCCTAGGAACAACGTATGA X = 313713117377357155113113171751 Y = 313717117377337155113113571751  Lower number  Higher number

11 11 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Higher number  313717117377337155113113571751 Lower number  – 313713117377357155113113171751 000003999999980000000000400000 5 matches Higher number  7117377337155113113 Lower number  – 3117377357155113113 3999999980000000000 10 matches Higher number  71173773 Lower number  – 31173773 40000000 7 matches III.The identical extremal positions are determined Example: IV.The identical internal positions are determined Example: 10 matches 20 matches 27 matches

12 12 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity V.The similarity between each pair of sequences X and Y is determined Example: VI.The similarity between each pair of sequences X and Y is converted into a distance Example:

13 13 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 2. Proportion of vectorization identity I.Each sequence X is converted into a vector A – 1C – 3G – 5T – 7 Example: CACTACAATCTTCGTAGGAACAACATATGA [3,1,3,7,1,3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3,1,7,1,7,5,1] II.The difference between each pair of vectors X and Y is determined Example: X = CACTACAATCTTCGTAGGAACAACATATGA Y = CACTATAATCTTCCTAGGAACAACGTATGA X = [3,1,3,7,1, 3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3, 1,7,1,7,5,1] Y = [3,1,3,7,1, 7,1,1,7,3,7,7,3,3,7,1,5,5,1,1,3,1,1,3, 5,7,1,7,5,1] X-Y = [0,0,0,0,0, -4,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0, -4,0,0,0,0,0]

14 14 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Example: [3,1,3,7,1, 3,1,1,7,3,7,7,3,5,7,1,5,5,1,1,3,1,1,3, 1,7,1,7,5,1] -[3,1,3,7,1, 7,1,1,7,3,7,7,3,3,7,1,5,5,1,1,3,1,1,3, 5,7,1,7,5,1] [0,0,0,0,0, -4,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0, -4,0,0,0,0,0] 5 matches 7 matches10 matches III.The identical positions between each pair of vectors X and Y are determined 27 matches IV.The similarity between each pair of sequences X and Y is determined Example: V.The similarity between each pair of sequences X and Y is converted into a distance Example:

15 15 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity 1.These are simple methods to compare and cluster mtDNA sequences 2.The present methods run very fast  The vectorial representation is faster than the numerical one 3.These methods require an absolute minimum of assumptions on the mutation model generating diversity  Discarding the possible “noise” enhances the analysis 4.Both methods allow the simultaneous feed and analysis of a full set of sequences (Pairwise comparison of 104 sequences – a total of 5356 pairwise comparisons) Advantages

16 Results Neighbor Joining - MEGA version 4 (Tamura, Dudley, Nei, and Kumar 2007) 16 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity The topology of this tree was compared to canonical haplogroups classification and to a network constructed using the same sequence data. V B

17 V B 17 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Canonical Haplogroups Classification Most ancient haplogroups Most recent haplogroups “Intermediate” haplogroups HUMAN MUTATION Mutation in Brief #1039, 30:E386-E394, (2008)

18 18 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity http://www.fluxus-engineering.com/sharenet.htm Network Most ancient haplogroups Most recent haplogroups “Intermediate” haplogroups

19 19 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity  Our inferred tree and Network clusters are in agreement  Most ancient and most recent haplogroups are clustered according Canonical Haplogroups Classification  “Intermediate” haplogroups are not grouped in the same way as with Canonical Haplogroups Classification Are haplogroups classification criteria well defined? We ask Final Remarks

20 20 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity  Combine the two methods developed to apply to sequences with different lengths  Incorporate other evolutionary phenomena beyond mismatches, like Indels, in the study  Test the current criteria for classification of Human haplogroups Future Perspectives

21 Acknowledgements (grant SFRH/BD/38171/2007 and POCI 2010) Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity - Prof. Dr. António Guedes de Oliveira - Nádia Pinto - Population Genetics Group

22 22 Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Example:A – 1C – 3G – 5T – 7 X = ATTCCX = 17733  Higher number Y = AGTCGY = 15735  Lower number Example:A – 1C – 2G – 3T – 4 X = ATTCC X = 14422  Higher number Y = AGTCG Y = 13423  Lower number 14422 –13423 00999 17733 –15735 01998 A difference is hidden by the operation 0 in the result is always a match


Download ppt "Circumventing sequence alignment: Proportion of string length or vectorization identity as proxies for genetic similarity Inês Soares Ana Goios António."

Similar presentations


Ads by Google