Quan Zou ( PH.D. & Prof. ) Tianjin Univ, School of Computer Reconstructing phylogenetic trees for ultra-large unaligned DNA sequences via with Hadoop
Background: why /15
Phylogenetic Tree Genome-Genome Gene-Gene Population /15 Model Computation
Background: challenge Multiple sequence alignment Phylogenetic tree /15
Flow /15
Flow---Clustering
Sampling /15
/15
/15 Flow---MSA
A Trie Tree for a Sequence /15
More tricks in MSA /15 input sequences trie trees search sum up update final result
/15
Experiments Data –Human mtGenome –16s rRNA Measurement –Running time –Average SP score (For MSA) /15 datasetmax lengthmin lengthaverage lengthsequence numberfile size mt genome (1x) MB mt genome (20x) MB mt genome (50x) MB mt genome (100x) GB 16s rRNA (small) MB 16s rRNA (big) GB
Experiments---phylogenetic tree /15 1x20x50x100x HPTree1 m 12 s3 m 18 s14 m 28 s44 m 17 s IQ-TREE13 m 7 s18 m 4 s39 m 43 s67 m 3 s IQ-TREE(8-core)9 m 39 s12 m 27 s26 m56 m 7 s phangorn40 sMore than 3 h--- RAxML33 m 3 sMore than 8 h--- STELLSMore than 1 h--- SmallSetBigSet HPTree207 m 44 sMore than 24 h IQ-TREE---
Experiments---MSA (mtDNA) /15 10 M(1X)213 M(20X)532 M(50X)1.1G(100X) HAlign(Trie Tree)3 m16s HAlign(Hadoop)2 m21s10 m53s14 m14s28 m28s MAFFT1 m41s175 m984 m KAlign170 m44s M(1X)213 M(20X)532 M(50X)1.1G(100X) HAlign(Trie Tree) HAlign(Hadoop)191 MAFFT KAlign
Experiments---MSA (16s rRNA) / M1.4G HAlign54 m 32 s199 m 35 s MAFFT3584 m 52 s M1.4G HAlign MAFFT Best Alignment
Experiments Running time comparison between aligned and unaligned data /15
Software /15 Quan Zou, et al. HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment based on Center Star Strategy. Bioinformatics. Doi: /bioinformatics/btv177.
Discussion Summary –MSA with Hadoop –NJ phylogenetic tree with Hadoop From DNA to Protein RNA secondary structure is ignored Several complex issues in evolution are ignored /15
Quan Zou ( PH.D. & Prof. ) Tianjin Univ, School of Computer