Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.

Slides:

Advertisements

Similar presentations

Lecture 10 Sharing Resources. Basics of File Sharing The core component of any server is its ability to share files. In fact, the Server service in all.

Advertisements

Once you complete the Target Tests you will be returned to the PADDS main menu screen. Usually the next step is to Generate reports. Begin by clicking.

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

NetAcumen ActiveX Download Instructions

Clustal W and Clustal X version 2.0 김영호, 박준호, 최현희 The 9 th Protein Folding Winter School.

Installing SAS 9.3 Raymond R. Balise Health Research and Policy.

CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.

The Information School of the University of Washington Oct university of washington1 What the Digerati Know INFO/CSE100, Fall 2006 Fluency.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

MCITP Guide to Microsoft Windows Server 2008 Server Administration (Exam #70-646) Chapter 14 Server and Network Monitoring.

Command Console Tutorial BCIS 3680 Enterprise Programming.

Installing SAS 9.3 TS1M1 Raymond R. Balise Health Research and Policy.

Debugged!.  You know that old line about an ounce of prevention?  It’s true for debugging.

Chapter 4: Looping CSCI-UA 0002 – Introduction to Computer Programming Mr. Joel Kemp.

Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.

1 Chapter Overview Creating User and Computer Objects Maintaining User Accounts Creating User Profiles.

Integrating HADOOP with Eclipse on a Virtual Machine Moheeb Alwarsh January 26, 2012 Kent State University.

Fundamentals of Python: From First Programs Through Data Structures

Every week: Sign in at the door If you are new: Fill in Registration Form Ask a Mentor how to get started Make sure you are on the Athenry Parents/Kids.

Programming For Nuclear Engineers Lecture 12 MATLAB (3) 1.

Computer Systems Week 10: File Organisation Alma Whitfield.

Fundamentals of Python: First Programs

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

MCTS Guide to Microsoft Windows 7

Hands-On Virtual Computing

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.

The VPO Operator. [vpo_operator] 2 The VPO Operator Section Overview The role of the VPO operator Starting and stopping the Motif GUI The VPO Operator.

Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.

Introduction of Geoprocessing Topic 7a 4/10/2007.

Python From the book “Think Python”

MATLAB for Engineers 4E, by Holly Moore. © 2014 Pearson Education, Inc., Upper Saddle River, NJ. All rights reserved. This material is protected by Copyright.

IBM Software Group ® Overview of SA and RSA Integration John Jessup June 1, 2012 Slides from Kevin Cornell December 2008 Have been reused in this presentation.

Guide to Programming with Python Chapter One Getting Started: The Game Over Program.

1 Administering Shared Folders Understanding Shared Folders Planning Shared Folders Sharing Folders Combining Shared Folder Permissions and NTFS Permissions.

An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.

With Windows 7 Introductory© 2011 Pearson Education, Inc. Publishing as Prentice Hall1 Windows 7 Introductory Chapter 3 Advanced File Management and Advanced.

9/2/ CS171 -Math & Computer Science Department at Emory University.

1 Phylogeny Workshop By Eyal PrivmanEyal Privman The Bioinformatics Unit G.S. Wise Faculty of Life Science Tel Aviv University, Israel November 2009

1 Part-1 Chap 5 Configuring Accounts Definitions.

Module 3 Configuring File Access and Printers on Windows 7 Clients.

Build an Automated Workflow Visual Workflow Creator Discovery Environment.

WinCvs. WinCVS WinCvs is a window based version control system. Use WinCvs when  You want to save every version of your file you have ever created. CVS.

Operating Systems Written by: Tim Keyser Georgia CTAE Resource Network 2010.

MySQL Getting Started BCIS 3680 Enterprise Programming.

Object Oriented Programming (OOP) LAB # 1 TA. Maram & TA. Mubaraka TA. Kholood & TA. Aamal.

2007 TAX YEARERO TRAINING - MODULE 61 ERO (Transmitter) Training Module 6 Federal and State Installation and Updates.

Automatic and manual sequence alignment Inferring phylogenetic trees Mining web-based databases Estimating rates of molecular evolution Testing evolutionary.

Using Divide-and-Conquer to Construct the Tree of Life Tandy Warnow University of Illinois at Urbana-Champaign.

Subscribers – List Model

Introduction of Geoprocessing Lecture 9 3/24/2008.

Efficiently Solving Computer Programming Problems Doncho Minkov Telerik Corporation Technical Trainer.

General Troubleshooting Nonlinear Diagnostics. Goal – In this workshop, our goal is to use the nonlinear diagnostics tools available in Solution Information.

CIS 842: Specification and Verification of Reactive Systems Lecture INTRO-Depth-Bounded: Depth-Bounded Depth-first Search Copyright 2004, Matt Dwyer, John.

Software testing techniques Software testing techniques REGRESSION TESTING Presentation on the seminar Kaunas University of Technology.

Product A Product B Product C A1A1 A2A2 A3A3 B1B1 B2B2 B3B3 B4B4 C1C1 C3C3 C4C4 Turret lathes Vertical mills Center lathes Drills From “Fundamentals of.

MySQL Getting Started BCIS 3680 Enterprise Programming.

System Software (1) The Operating System

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

REGRESSION TESTING Audrius Čėsna IFM-0/2. Regression testing is any type of software testing that seeks to uncover new errors, or regressions, in existing.

Scaling BAli-Phy to Large Datasets June 16, 2016 Michael Nute 1.

The need for Programming Languages

Discussion #11 11/21/16.

Installation and Configuration

Introduction to Algorithm Design

New methods for simultaneous estimation of trees and alignments

Git CS Fall 2018.

MAINTAINING SERVER AVAILIBILITY

Lab 8: GUI testing Software Testing LTAT

Presentation transcript:

Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016

Agenda Quick recap of PASTA Algorithm Run the GUI Explore GUI options and what they do in terms of PASTA Run a test alignment Explore PASTA outputs and diagnostics Run a different test alignment Compare the PASTA fill-in-the-blank defaults for the two test alignments

PASTA: Installation We hope everybody has been able to install PASTA based on instructions from our . If not: See detailed installation instructions at: Three Options: 1)MAC –DMG file available at the link above 2)Linux –Detailed instructions available at the link above –Requires JAVA, wxPython, 3)Virtual Machine (Recommended: VirtualBox) –Virtual appliance available at link above –This is the only option for Windows users

Estimate ML tree on new alignment Tree Obtain initial alignment and estimated ML tree Use tree to compute new alignment Alignment Repeat until termination condition, and return the alignment/tree pair with the best ML score SATé and PASTA Algorithms 4

PASTA Algorithm 5 Input: unaligned sequences 1) Get initial alignment 2) Estimate tree on current alignment 3) Break into subsets according to tree 4) Use external aligner to align subsets 5) Use external profile aligner to merge subset alignments 6) Use transitivity to merge subset pairs into a full alignment, scrap the old tree (repeat) ?

PASTA GUI

7 ? Initial Alignment Get a Tree Decompose Align subsets Merge subset alignments pairwise Transitivity merge This applies to the Tree Estimator in particular PASTA Algorithm 1)This is the alignment tool used to align the subsets (several options). 2)Tool for merging two subset alignments. (OPAL or MUSCLE) 3)Tool to estimate a maximum likelihood tree (FastTree or RAxML) PASTA GUI

8 ? Initial Alignment Get a Tree Decompose Align subsets Merge subset alignments pairwise Transitivity merge The basic input to the problem: FASTA file with sequences in need of alignment 4 <-- not implemented yet 5 6 Data type (DNA, RNA or Protein) This should be checked if the sequence file (4) should be treated as aligned. If not checked, PASTA will generate a fast progressive alignment to start. The user can provide a starting tree that will cause the algorithm to skip the initial alignment step. PASTA Algorithm

9 ? Initial Alignment Get a Tree Decompose Align subsets Merge subset alignments pairwise Transitivity merge Basic administrative settings: Job Name – all output files will start with this name. Output Dir – folder where output files will go. CPUs – number of processors Max. Memory (MB) – only applies to Java when OPAL is called. PASTA Algorithm

10 ? Initial Alignment Get a Tree Decompose Align subsets Merge subset alignments pairwise Transitivity merge Stopping criteria for the decomposition. Can be either a fixed size or a percentage of the total taxa. Decomposition Steps: Start by choosing a branch according to the Decomposition option (Centroid or Longest Branch). For each of the two subsets created, if the number of taxa is greater than Max. Subproblem, then repeat on that subset. How to decide where to bisect the tree, (either Centroid Edge or the Longest Branch) PASTA Algorithm

11 ? Initial Alignment Get a Tree Decompose Align subsets Merge subset alignments pairwise Transitivity merge When to Stop Running? Which iteration to return? (Final or Highest Likelihood) Should final tree be RAxML? (see below) Two-Phase search is simply 1) run an alignment, 2) get a tree from it. This is completely different than PASTA and if this is checked, PASTA (formally) will not be run. PASTA Algorithm

Example 1: small.fasta Step 1: Read in the data. Located at /data/small. fasta This is the PASTA install folder on the Virtual Machine Reads in the DATA and sets Type, prints some stats:

Example 1: small.fasta Importing the data caused the GUI to automatically set several settings based on the size, data type, etc… It noticed that the data type was DNA It also noticed that this fasta file contains aligned sequences.

Example 1: small.fasta Step 2: name the job & set the output folder: Recommended: Use the create folder dialog to create a specific folder for these outputs.

Example 1: small.fasta Step 3: Say “GO”

Example 1: Examining the Output Folder Intermediate alignments and trees after the initial search and after each iteration. Useful mainly for diagnostics and debugging = Important File Job Output (Errors): contains PASTA console output when errors are reported. If this file is zero bytes, that is a good thing. Job Output: contains PASTA console output. Always good to examine this file after a run. Final Alignment: always in this name format:.marker001..aln Final Tree Config File: This saves all the settings for this particular job. The same exact job can be re-run from the command line by running “python run_pasta.py” with the path to this file as the ONLY argument

Example 2: BBA0067 (time permitting) (protein data)

Final Tips & Best Practices After running an alignment, it is always a good idea to look at the console outputs generated to verify that PASTA did what it was expected to do. If the error file is non-zero size, read that too. The PASTA default settings are appropriate and well-chosen for most applications. Unless you have a good reason to use something else, this is a good starting point. PASTA scales with the number of cores available, so giving it as many processors as possible is a good idea. There are more settings available than what is in the GUI. Check the config file output for any pasta job to see the full list. Also can type “python run_pasta.py –h” from the pasta folder to see a thorough help menu Approximate running time benchmarks (length=1500 base pairs): –100 Sequences: <10 minutes on a laptop –1000 Sequences: About 1-3 hours on a 16-core server –10000 Sequences: About 8-15 hours on a 16-core server –(Should scale about linearly after this, but will depend on settings…)

Resources PASTA User Group: Link to these slides: Github Repository (which has more documentation, including full install instructions): My