CS 581 / BIOE 540: Algorithmic Computational Genomics

Slides:

Advertisements

Similar presentations

CS 598AGB What simulations can tell us. Questions that simulations cannot answer Simulations are on finite data. Some questions (e.g., whether a method.

Advertisements

Estimating species trees from multiple gene trees in the presence of ILS Tandy Warnow Joint work with Siavash Mirarab, Md. S. Bayzid, and others.

CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

Phylogenomics Symposium and Software School Co-Sponsored by the SSB and NSF grant

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

CS 394C Algorithms for Computational Biology Tandy Warnow Spring 2012.

Orangutan GorillaChimpanzee Human From the Tree of the Life Website, University of Arizona Species Tree.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

New methods for estimating species trees from genome-scale data Tandy Warnow The University of Illinois.

From Gene Trees to Species Trees Tandy Warnow The University of Texas at Austin.

Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.

394C, October 2, 2013 Topics: Multiple Sequence Alignment

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

BBCA: Improving the scalability of *BEAST using random binning Tandy Warnow The University of Illinois at Urbana-Champaign Co-authors: Theo Zimmermann.

Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

SupreFine, a new supertree method Shel Swenson September 17th 2009.

The Mathematics of Estimating the Tree of Life Tandy Warnow The University of Illinois.

CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.

CS/BIOE 598: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science

Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.

SEPP and TIPP for metagenomic analysis Tandy Warnow Department of Computer Science University of Texas.

Progress and Challenges for Large-Scale Phylogeny Estimation Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

New methods for inferring species trees in the presence of incomplete lineage sorting Tandy Warnow The University of Illinois.

Advancing Genome-Scale Phylogenomic Analysis Tandy Warnow Departments of Computer Science and Bioengineering Carl R. Woese Institute for Genomic Biology.

394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.

CS 466 and BIOE 498: Introduction to Bioinformatics

Constrained Exact Optimization in Phylogenetics

Advances in Ultra-large Phylogeny Estimation

Phylogenetic basis of systematics

New Approaches for Inferring the Tree of Life

394C, Spring 2012 Jan 23, 2012 Tandy Warnow.

Chalk Talk Tandy Warnow

Multiple Sequence Alignment Methods

Tandy Warnow Department of Computer Sciences

Challenges in constructing very large evolutionary trees

Techniques for MSA Tandy Warnow.

Algorithm Design and Phylogenomics

Mathematical and Computational Challenges in Reconstructing Evolution

Tandy Warnow The University of Illinois

New methods for simultaneous estimation of trees and alignments

Large-Scale Multiple Sequence Alignment

Mathematical and Computational Challenges in Reconstructing Evolution

CS 581 Tandy Warnow.

CS 581 Algorithmic Computational Genomics

Tandy Warnow Founder Professor of Engineering

Tandy Warnow Department of Computer Sciences

New methods for simultaneous estimation of trees and alignments

Texas, Nebraska, Georgia, Kansas

Ultra-Large Phylogeny Estimation Using SATé and DACTAL

Recent Breakthroughs in Mathematical and Computational Phylogenetics

CS 394C: Computational Biology Algorithms

September 1, 2009 Tandy Warnow

Taxonomic identification and phylogenetic profiling

Algorithms for Inferring the Tree of Life

Sequence alignment CS 394C Tandy Warnow Feb 15, 2012.

Tandy Warnow The University of Texas at Austin

Tandy Warnow The University of Texas at Austin

New methods for simultaneous estimation of trees and alignments

Ultra-large Multiple Sequence Alignment

New methods for estimating species trees from gene trees

Advances in Phylogenomic Estimation

Advances in Phylogenomic Estimation

Presentation transcript:

CS 581 / BIOE 540: Algorithmic Computational Genomics Tandy Warnow Departments of Bioengineering and Computer Science http://tandy.cs.illinois.edu

Course Details Office hours: Tuesdays 12:30-1:30 (Siebel 3235) Course webpage: http://tandy.cs.illinois.edu/581-2017.html Textbook: Computational Phylogenetics, available for download at http://tandy.cs.illinois.edu/textbook.pdf TA: Pranjal Vachaspati (to be confirmed)

Today •  Describe some important problems in computational biology, for which students in this course could develop improved methods •  Explain how the course will be run •  Answer questions

This Course Topics: computational and statistical problems in sequence analysis (e.g., multiple sequence alignment, phylogeny estimation, metagenomics, etc.). Focus: understanding the mathematical foundations, and designing algorithms with outstanding accuracy and speed on large, complex datasets. This is not a course about how to use the tools.

Prerequisites No background in biology is needed. However, the course has the following prerequisites: CS 374: computational complexity, algorithm design techniques, and proving theorems about algorithms CS 361: probability and statistics By recursion, CS 225: programming

If you haven’t satisfied the pre-reqs: You need permission to stay in the course. The first homework is due (by email) on Saturday at 1 PM. See the homework webpage http://tandy.cs.illinois.edu/cs581-2017-hw.html Then make an appointment to see me to review the homework.

This course Phylogeny estimation based on stochastic models of sequence evolution and genome evolution Multiple sequence alignment Applications to metagenomics, protein structure prediction, and other biological problems

Species Tree Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Evolution informs about everything in biology •  Big genome sequencing projects just produce data -‐-‐ so what? •  Evolutionary history relates all organisms and genes, and helps us understand and predict –  interactions between genes (genetic networks) –  drug design –  predicting functions of genes –  inﬂuenza vaccine development –  origins and spread of disease –  origins and migrations of humans

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes “Big data” complexity: model misspecification fragmentary sequences errors in input data streaming data

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Phylogenomic pipeline Select taxon set and markers Gather and screen sequence data, possibly identify orthologs Compute multiple sequence alignments for each locus Compute species tree or network: Compute gene trees on the alignments and combine the estimated gene trees, OR Estimate a tree from a concatenation of the multiple sequence alignments Get statistical support on each branch (e.g., bootstrapping) Estimate dates on the nodes of the phylogeny Use species tree with branch support and dates to understand biology

Research Strategies Improved algorithms through: Statistical modelling Divide-and-conquer “Bin-and-conquer” Iteration Bayesian statistics Hidden Markov Models Graph theory Combinatorial optimization Statistical modelling Massive Simulations High Performance Computing

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… Approx. 50 species, whole genomes 14,000 loci Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci)

DNA Sequence Evolution -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AGCGCTT AGCACAA TAGACTT TAGCCCA AAGACTT TGGACTT AAGGCCT AGGGCAT TAGCCCT AGCACTT AAGGCCT TGGACTT TAGCCCA TAGACTT AGCGCTT AGCACAA AGGGCAT TAGCCCT AGCACTT

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

Performance criteria Running time Space Statistical performance issues (e.g., statistical consistency) with respect to a Markov model of evolution “Topological accuracy” with respect to the underlying true tree or true alignment, typically studied in simulation Accuracy with respect to a particular criterion (e.g. maximum likelihood score), on real data

Quantifying Error FN FP FN: false negative (missing edge) FP: false positive (incorrect edge) 50% error rate FP

Statistical consistency, exponential convergence, and absolute fast convergence (afc)

Phylogenetic reconstruction methods Hill-climbing heuristics for hard optimization criteria (Maximum Parsimony and Maximum Likelihood) Local optimum Cost Global optimum Phylogenetic trees Polynomial time distance-based methods: Neighbor Joining, FastME, etc. 3. Bayesian methods

Solving maximum likelihood (and other hard optimization problems) is… unlikely # of Taxa # of Unrooted Trees 4 3 5 15 6 105 7 945 8 10395 9 135135 10 2027025 20 2.2 x 1020 100 4.5 x 10190 1000 2.7 x 102900

Quantifying Error FN: false negative (missing edge) FP: false positive (incorrect edge) FP 50% error rate

Neighbor joining has poor performance on large diameter trees [Nakhleh et al. ISMB 2001] 0.8 NJ Theorem (Atteson): Exponential sequence length requirement for Neighbor Joining! Error Rate 0.6 0.4 0.2 400 800 No. Taxa 1200 1600

Major Challenges •  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements)

Phylogeny Problem AGGGCAT TAGCCCA TAGACTT TGCACAA TGCGCTT U V W X Y X

The Real Problem! U V W X Y X U Y V W AGGGCATGA AGAT TAGACTT TGCACAA TGCGCTT X U Y V W

…ACGGTGCAGTTACCA… …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… Deletion Substitution …ACGGTGCAGTTACCA… Insertion …ACGGTGCAGTTACC-A… …AC----CAGTCACCTA… …ACCAGTCACCTA… The true multiple alignment –  Reflects historical substitution, insertion, and deletion events –  Defined using transitive closure of pairwise alignments computed on edges of the true tree

Input: unaligned sequences = AGGCTATCACCTGACCTCCA S2 TAGCTATCACGACCGC S3 TAGCTGACCGC S4 TCACGACCGACA

Phase 1: Alignment S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA TCACGACCGACA

Phase 2: Construct tree S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA TCACGACCGACA S1 S2 S4 S3

Two-phase estimation • Bayesian MCMC • Maximum parsimony Alignment methods •  Clustal •  POY (and POY*) •  Probcons (and Probtree) •  Probalign •  MAFFT •  Muscle •  Di-align •  T-Coffee •  Prank (PNAS 2005, Science 2008) •  Opal (ISMB and Bioinf. 2007) •  FSA (PLoS Comp. Bio. 2009) •  Infernal (Bioinf. 2009) •  Etc. Phylogeny methods •  Bayesian MCMC •  Maximum parsimony •  Maximum likelihood •  Neighbor joining •  FastME •  UPGMA •  Quartet puzzling •  Etc.

Estimated tree and alignment Simulation Studies S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC S4 = TCACGACCGACA Unaligned Sequences S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- S4 = -------TCAC--GACCGACA S1 S2 S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-C--T-----GACCGC-- S4 = T---C-A-CGACCGA----CA S1 S4 Compare S4 S3 True tree and alignment S2 S3 Estimated tree and alignment

1000 taxon models, ordered by diﬃculty (Liu et al., 2009)

Multiple Sequence Alignment (MSA): another grand challenge1 S1 = AGGCTATCACCTGACCTCCA S2 = TAGCTATCACGACCGC S3 = TAGCTGACCGC … Sn = TCACGACCGACA S1 = -AGGCTATCACCTGACCTCCA S2 = TAG-CTATCAC--GACCGC-- S3 = TAG-CT-------GACCGC-- … Sn = -------TCAC--GACCGACA Novel techniques needed for scalability and accuracy NP-hard problems and large datasets Current methods do not provide good accuracy Few methods can analyze even moderately large datasets Many important applications besides phylogenetic estimation 1 Frontiers in Massive Data Analysis, National Academies Press, 2013

Major Challenges •  Phylogenetic analyses: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements) •  Multiple sequence alignment: key step for many biological questions (protein structure and function, phylogenetic estimation), but few methods can run on large datasets. Alignment accuracy is generally poor for large datasets with high rates of evolution.

Phylogenomics (Phylogenetic estimation from whole genomes)

Species Tree Estimation requires multiple genes! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Two basic approaches for species tree estimation •  Concatenate (“combine”) sequence alignments for different genes, and run phylogeny estimation methods •  Compute trees on individual genes and combine gene trees

Using multiple genes gene 3 gene 2 gene 1 S1 S2 S3 S4 S7 S8 TCTAATGGAA GCTAAGGGAA TCTAAGGGAA TCTAACGGAA TCTAATGGAC TATAACGGAA gene 3 TATTGATACA TCTTGATACC TAGTGATGCA CATTCATACC S1 S3 S4 S7 S8 gene 2 GGTAACCCTC GCTAAACCTC GGTGACCATC S4 S5 S6 S7

Concatenation gene 2 gene 3 gene 1 S1 TCTAATGGAA TATTGATACA S2 ? ? ? ? ? ? ? ? ? ? TATTGATACA S2 GCTAAGGGAA ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? S3 TCTAAGGGAA ? ? ? ? ? ? ? ? ? ? TCTTGATACC S4 TCTAACGGAA GGTAACCCTC TAGTGATGCA S5 ? ? ? ? ? ? ? ? ? ? GCTAAACCTC ? ? ? ? ? ? ? ? ? ? S6 ? ? ? ? ? ? ? ? ? ? GGTGACCATC ? ? ? ? ? ? ? ? ? ? S7 TCTAATGGAC GCTAAACCTC TAGTGATGCA S8 TATAACGGAA ? ? ? ? ? ? ? ? ? ? CATTCATACC

Red gene tree ≠ species tree (green gene tree okay)

Gene Tree Incongruence Gene trees can differ from the species tree due to: Duplication and loss Horizontal gene transfer Incomplete lineage sorting (ILS)

Incomplete Lineage Sorting (ILS) Confounds phylogenetic analysis for many groups: Hominids Birds Yeast Animals Toads Fish Fungi There is substantial debate about how to analyze phylogenomic datasets in the presence of ILS.

Lineage Sorting Population-level process, also called the “Multi-species coalescent” (Kingman, 1982) Gene trees can differ from species trees due to short times between speciation events or large population size; this is called “Incomplete Lineage Sorting” or “Deep Coalescence”.

The Coalescent Present Past Courtesy James Degnan

Gene tree in a species tree Courtesy James Degnan

Key observation: Under the multi-species coalescent model, the species tree defines a probability distribution on the gene trees, and is identifiable from the distribution on gene trees Courtesy James Degnan

Two competing approaches gene 1 gene 2 . . . gene k . . . Species Concatenation point out that supertree methods take overlaping trees and produce a tree, and that the whole process of first generating small trees and then applying a supertree method is often referred to as the “supertree approach”. . . . Analyze separately Summary Method

Species tree estimation: difficult, even for small datasets! Orangutan Gorilla Chimpanzee Human From the Tree of the Life Website, University of Arizona

Major Challenges: large datasets, fragmentary sequences •  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. •  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). •  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Major Challenges: large datasets, fragmentary sequences •  Multiple sequence alignment: Few methods can run on large datasets, and alignment accuracy is generally poor for large datasets with high rates of evolution. •  Gene Tree Estimation: standard methods have poor accuracy on even moderately large datasets, and the most accurate methods are enormously computationally intensive (weeks or months, high memory requirements). •  Species Tree Estimation: gene tree incongruence makes accurate estimation of species tree challenging. Phylogenetic Network Estimation: Horizontal gene transfer and hybridization requires non-tree models of evolution Both phylogenetic estimation and multiple sequence alignment are also impacted by fragmentary data.

Avian Phylogenomics Project Erich Jarvis, HHMI MTP Gilbert, Copenhagen G Zhang, BGI T. Warnow UT-Austin S. Mirarab Md. S. Bayzid, UT-Austin UT-Austin Plus many many other people… Approx. 50 species, whole genomes 14,000 loci Challenges: Species tree estimation under the multi-species coalescent model, from 14,000 poor estimated gene trees, all with different topologies (we used “statistical binning”) Maximum likelihood estimation on a million-site genome-scale alignment – 250 CPU years Science, December 2014 (Jarvis, Mirarab, et al., and Mirarab et al.)

1kp: Thousand Transcriptome Project G. Ka-Shu Wong U Alberta J. Leebens-Mack U Georgia N. Wickett Northwestern N. Matasci iPlant T. Warnow, S. Mirarab, N. Nguyen, UIUC UT-Austin UT-Austin Plus many many other people… Plant Tree of Life based on transcriptomes of ~1200 species More than 13,000 gene families (most not single copy) First paper: PNAS 2014 (~100 species and ~800 loci) Gene Tree Incongruence Upcoming Challenges (~1200 species, ~400 loci): Species tree estimation under the multi-species coalescent from hundreds of conflicting gene trees on >1000 species (we will use ASTRAL – Mirarab et al. 2014, Mirarab & Warnow 2015) Multiple sequence alignment of >100,000 sequences (with lots of fragments!) – we will use UPP (Nguyen et al., 2015)

Constructing the Tree of Life: Hard Computational Problems NP-hard problems Large datasets 100,000+ sequences thousands of genes “Big data” complexity: model misspecification fragmentary sequences errors in input data streaming data

Research Strategies Improved algorithms through: Statistical modelling Divide-and-conquer “Bin-and-conquer” Iteration Bayesian statistics Hidden Markov Models Graph theory Combinatorial optimization Statistical modelling Massive Simulations High Performance Computing

Evolution informs about everything in biology •  Big genome sequencing projects just produce data -‐-‐ so what? •  Evolutionary history relates all organisms and genes, and helps us understand and predict –  interactions between genes (genetic networks) –  drug design –  predicting functions of genes –  inﬂuenza vaccine development –  origins and spread of disease –  origins and migrations of humans

Metagenomics: Venter et al., Exploring the Sargasso Sea: Scientists Discover One Million New Genes in Ocean Microbes

Metagenomic data analysis NGS data produce fragmentary sequence data Metagenomic analyses include unknown species Taxon identification: given short sequences, identify the species for each fragment Applications: Human Microbiome Issues: accuracy and speed Mihai Pop, Univ Maryland

Metagenomic taxon identification Objective: classify short reads in a metagenomic sample

Possible Indo-European tree (Ringe, Warnow and Taylor 2000) Anatolian Vedic Iranian Greek Italic Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

“Perfect Phylogenetic Network” for IE Nakhleh et al., Language 2005 Anatolian Vedic Iranian Greek Italic Celtic Tocharian Germanic Armenian Baltic Slavic Albanian

Grading Homework: 25% (one hw dropped) Midterm: 40% (March 30) Final Project: 25% (due May 6) Course Participation: 10% No final exam.

Homework Assignments Homework assignments are listed at http://tandy.cs.illinois.edu/cs581-2017-hw.html and are due at 1 PM (in person or via email) – late homeworks have reduced credit and will not be accepted after 48 hours past the deadline. You are encouraged to work with others on your homework, but you must write up solutions by yourself and indicate who you worked with on each homework.

Course Schedule A detailed course schedule is athttp://tandy.cs.illinois.edu/cs581-2017-detailed-syllabus.html This schedule includes material you are expected to have looked at before coming to class: assigned reading (from textbook and/or scientific literature) PPT and/or PDF of my lecture

Final Project and Class Presentation Either research project (can be with another student) or survey paper (done by yourself). Many interesting and publishable problems to address: see http://tandy.cs.illinois.edu/topics.html Your class presentation should be related to your final project.

Academic Integrity Please see course website at http://tandy.cs.illinois.edu/581-2017.html and also http://tandy.cs.illinois.edu/ethics.pdf For this course: Examine the policy about collaboration Learn and understand what plagiarism is (and then don’t do it). This applies to homework, all writing assignments, and the final project.

Course Research Projects Evaluating existing methods on simulated and real (biological or linguistic) datasets Designing a new method, and establishing its performance (using theory and data) Analyzing a biological dataset using several different methods, to address biology

Examples of published course projects Md S. Bayzid, T. Hunt, and T. Warnow. "Disk Covering Methods Improve Phylogenomic Analyses". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S7. T. Zimmermann, S. Mirarab and T. Warnow. "BBCA: Improving the scalability of *BEAST using random binning". Proceedings of RECOMB-CG (Comparative Genomics), 2014, and BMC Genomics 2014, 15(Suppl 6): S11. J. Chou, A. Gupta, S. Yaduvanshi, R. Davidson, M. Nute, S. Mirarab and T. Warnow. “A comparative study of SVDquartets and other coalescent-based species tree estimation methods”. RECOMB-Comparative Genomics and BMC Genomics, 2015., 2015, 16 (Suppl 10): S2. P. Vachaspati and T. Warnow (2016). FastRFS: Fast and Accurate Robinson-Foulds Supertrees using Constrained Exact Optimization Bioinformatics 2016; doi: 10.1093/bioinformatics/btw600. (Special issue for papers from RECOMB-CG)

Research Projects you could join Phylogenomics projects (Avian and the 1KP) Species tree and network estimation from conflicting genes Large-scale multiple sequence alignment Large-scale maximum likelihood tree estimation Improving gene tree estimation using whole genomes Metagenomics (with Mihai Pop, University of Maryland, and Bill Gropp) Identifying genes and taxa from short sequences Metagenomic assembly Applications to clinical diagnostics Protein Sequence Analysis (with Jian Peng) What function and structure does this protein have? How did structure and function evolve? Historical Linguistics (with Donald Ringe, UPenn) How did Indo-European evolve? Designing and implementing statistical estimation methods for language phylogenies