Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Challenges in computational phylogenetics Tandy Warnow Radcliffe Institute for Advanced Study University of Texas at Austin.
1 The TSP : Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell ( )
Computer Science and Reconstructing Evolutionary Trees Tandy Warnow Department of Computer Science University of Illinois at Urbana-Champaign.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
BNFO 602 Phylogenetics Usman Roshan.
The Theory of NP-Completeness
CIS786, Lecture 3 Usman Roshan.
Phylogeny reconstruction BNFO 602 Roshan. Simulation studies.
BNFO 602 Phylogenetics Usman Roshan. Summary of last time Models of evolution Distance based tree reconstruction –Neighbor joining –UPGMA.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Computing the Tree of Life The University of Texas at Austin Department of Computer Sciences Tandy Warnow.
Computational and mathematical challenges involved in very large-scale phylogenetics Tandy Warnow The University of Texas at Austin.
Combinatorial and graph-theoretic problems in evolutionary tree reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Phylogenetic Tree Reconstruction Tandy Warnow The Program in Evolutionary Dynamics at Harvard University The University of Texas at Austin.
Complexity and The Tree of Life Tandy Warnow The University of Texas at Austin.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Maximum Parsimony Input: Set S of n aligned sequences of length k Output: –A phylogenetic tree T leaf-labeled by sequences in S –additional sequences of.
Complexity Classes (Ch. 34) The class P: class of problems that can be solved in time that is polynomial in the size of the input, n. if input size is.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Evolutionary Trees Usman Roshan Department of Computer Science New Jersey Institute of.
NP-hardness and Phylogeny Reconstruction Tandy Warnow Department of Computer Sciences University of Texas at Austin.
Computational Biology, Part D Phylogenetic Trees Ramamoorthi Ravi/Robert F. Murphy Copyright  2000, All rights reserved.
Software for Scientists Tandy Warnow Department of Computer Science University of Texas at Austin.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
CSE 024: Design & Analysis of Algorithms Chapter 9: NP Completeness Sedgewick Chp:40 David Luebke’s Course Notes / University of Virginia, Computer Science.
Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro to MBIO.
CS 173, Lecture B August 25, 2015 Professor Tandy Warnow.
CSE332: Data Abstractions Lecture 24.5: Interlude on Intractability Dan Grossman Spring 2012.
CPSC 320: Intermediate Algorithm Design & Analysis Greedy Algorithms and Graphs Steve Wolfman 1.
394C: Algorithms for Computational Biology Tandy Warnow Sept 9, 2013.
394C, Spring 2013 Sept 4, 2013 Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
CSE373: Data Structures & Algorithms Lecture 22: The P vs. NP question, NP-Completeness Lauren Milne Summer 2015.
CIPRES: Enabling Tree of Life Projects Tandy Warnow The University of Texas at Austin.
Introduction to Phylogenetic Estimation Algorithms Tandy Warnow.
1.5 Graph Theory. Graph Theory The Branch of mathematics in which graphs and networks are used to solve problems.
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
CIRCUITS, PATHS, AND SCHEDULES Euler and Königsberg.
Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.
Week 13 - Monday.  What did we talk about last time?  B-trees  Hamiltonian tour  Traveling Salesman Problem.
LIMITATIONS OF ALGORITHM POWER
Algorithmic research in phylogeny reconstruction Tandy Warnow The University of Texas at Austin.
Constructing the Tree of Life: Divide-and-Conquer! Tandy Warnow University of Illinois at Urbana-Champaign.
598AGB Basics Tandy Warnow. DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.
CS 598 AGB Supertrees Tandy Warnow. Today’s Material Supertree construction: given set of trees on subsets of S (the full set of taxa), construct tree.
CS 395T: Computational phylogenetics January 18, 2006 Tandy Warnow.
Introduction to NP Instructor: Neelima Gupta 1.
Sequence alignment CS 394C: Fall 2009 Tandy Warnow September 24, 2009.
Simultaneous alignment and tree reconstruction Collaborative grant: Texas, Nebraska, Georgia, Kansas Penn State University, Huston-Tillotson, NJIT, and.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
394C: Algorithms for Computational Biology Tandy Warnow Jan 25, 2012.
394C, Spring 2012 Jan 23, 2012 Tandy Warnow.
Challenges in constructing very large evolutionary trees
Professor Tandy Warnow
Spanning Trees Discrete Mathematics.
Graphs Chapter 13.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
CS 581 Tandy Warnow.
CS 581 Tandy Warnow.
Tandy Warnow Department of Computer Sciences
CS 394C: Computational Biology Algorithms
September 1, 2009 Tandy Warnow
Algorithms for Inferring the Tree of Life
Presentation transcript:

Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin

How did life evolve on earth? An international effort to understand how life evolved on earth Biomedical applications: drug design, protein structure and function prediction, biodiversity Phylogenetic estimation is a “Grand Challenge”: millions of taxa, NP-hard optimization problems Courtesy of the Tree of Life project

DNA Sequence Evolution AAGACTT TGGACTTAAGGCCT -3 mil yrs -2 mil yrs -1 mil yrs today AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTTAGCGCTTAGCACAAAGGGCAT TAGCCCTAGCACTT AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Molecular Systematics TAGCCCATAGACTTTGCACAATGCGCTTAGGGCAT UVWXY U VW X Y

Computational biology research What is a computational problem? What is an algorithm? How to design and analyze algorithms What NP-hardness means (and what to do about it) Two computational problems in biology –Molecular sequence alignment –Evolutionary history reconstruction

Some computational problems 1.Given a list of numbers, put it into sorted order 2.Given a map and a collection of cities, find the shortest tour that visits every city 3.Given a collection of people, find the largest subset of them that all know each other 4.Given a collection of people, find the smallest number of groups so that no two people in the same group know each other.

Some computational problems 1.Given a list of numbers, put it into sorted order 2.Given a map and a collection of cities, find the shortest tour that visits every city 3.Given a collection of people, find the largest subset of them that all know each other 4.Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

Sorting Given a list of n numbers, put it into sorted order Algorithm: find smallest number, and put it in the front of the list. Repeat the process on the last n-1 numbers. Running time: O(n 2 ) (polynomial time)

Some computational problems 1.Given a list of numbers, put it into sorted order 2.Given a map and a collection of cities, find the shortest tour that visits every city 3.Given a collection of people, find the largest subset of them that all know each other 4.Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

Some computational problems 1.Given a list of numbers, put it into sorted order 2.Given a map and a collection of cities, find the shortest tour that visits every city 3.Given a collection of people, find the largest subset of them that all know each other 4.Given a collection of people, find the smallest number of groups so that no two people in the same group know each other. Which ones can be solved in polynomial time?

Is this problem polynomial? Problem: Given a collection of people, determine if they can be put into 2 groups so that no two people in the same group know each other Graph-theoretic representation: Create a graph with vertices for the people, and edges between vertices if the two people know each other! Mary Sue Tom Henry Carol

2-coloring 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

2-coloring 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

2-coloring 2-colorability: Given graph G = (V,E), determine if we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color. Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored. Running time: O(n 2 ) time, where n is the number of vertices.

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph? Mary Sue Tom Henry Carol

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph? Mary Sue Tom Henry Carol

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph? Mary Sue Tom Henry Carol

Can we group this set into two groups so that no two people know each other? Or Can we 2-color the graph? Mary Sue Tom Henry Carol No! We cannot!

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

What about this? 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. A brute-force solution seems to require O(3 n ) time, where n is the number of vertices.

Some decision problems can be solved in polynomial time: –Can graph G be 2-colored? Some decision problems seem to not be solvable in polynomial time: –Can graph G be 3-colored? –Does graph G have a Hamiltonian cycle (a cycle that visits every vertex exactly once)?

In fact, some problems are “NP-hard” 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color. 3 -colorability is provably NP-hard. What does this mean?

Most computer scientists are willing to bet that no NP-hard problem can be solved in polynomial time. Therefore, the options are: –Solve the problem exactly (but use lots of time on some inputs) –Use heuristics which may not solve the problem correctly (and which might be computationally expensive, anyway)

Computational problems in Biology are almost always NP-hard! In particular, inferring evolutionary trees generally involves trying to solve NP-hard problems.

Maximum Parsimony Given a set of DNA sequences Find a tree for the sequences with the minimum total number of changes

Maximum parsimony (example) Input: Four sequences –ACT –ACA –GTT –GTA Question: which of the three trees has the best MP scores?

Maximum Parsimony ACT GTTACA GTA ACA ACT GTA GTT ACT ACA GTT GTA

Maximum Parsimony ACT GTT GTA ACA GTA MP score = 5 ACA ACT GTA GTT ACAACT MP score = 7 ACT ACA GTT GTA ACAGTA MP score = 4 Optimal MP tree

Maximum Parsimony ACT ACA GTT GTA ACAGTA MP score = 4 Finding the optimal MP tree is NP-hard Optimal labeling can be computed in polynomial time using Dynamic Programming

Solving NP-hard problems exactly is … unlikely Number of (unrooted) binary trees on n leaves is (2n-5)!! If each tree on 1000 taxa could be analyzed in seconds, we would find the best tree in 2890 millennia #leaves#trees x x x

Problems with techniques for MP and ML Shown here is the performance of a TNT heuristic maximum parsimony analysis on a real dataset of almost 14,000 sequences. (“Optimal” here means best score to date, using any method for any amount of time.) Acceptable error is below 0.01%. Performance of TNT with time

Research: we try to develop better heuristics Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset Current best techniques DCM boosted version of best techniques

Other computational biology research Multiple sequence alignment Protein structure and function prediction Whole genome assembly Systems biology Drug design Human origins Evolution of languages (and the list goes on)

Computational biology research is fun, multi-disciplinary, and collaborative! Software development Mathematics Probability and Statistics Biology Chemistry Linguistics Plus, you will get to travel to far away lands

Computational biology conference locations