Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Homology Based Analysis of the Human/Mouse lncRNome
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Heuristic alignment algorithms and cost matrices
Evolution at the DNA level …ACGGTGCAGTTACCA… …AC----CAGTCCACCA… Mutation SEQUENCE EDITS REARRANGEMENTS Deletion Inversion Translocation Duplication.
CS273a Lecture 10, Aut 08, Batzoglou Multiple Sequence Alignment.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
[Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Outline. 1. what is BLAT & why we need it 2
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Comparative Genomics of the Eukaryotes
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
National Center for Genome Analysis Support: Carrie Ganote Ram Podicheti Le-Shin Wu Tom Doak Quality Control and Assessment.
Mouse Genome Sequencing
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Galaxy: Integrative, Reproducible Analysis of Genomics Data Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Comparative Genomics I: Tools for comparative genomics
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Accessing and visualizing genomics data
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Welcome to the combined BLAST and Genome Browser Tutorial.
CS 6293 AT: Current Bioinformatics HW2 Papers 1
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
HomologyIf twp proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Fast Sequence Alignments
Basic Local Alignment Search Tool (BLAST)
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome Analysis 許秉慧、陳怡靜、鄭智懷、宋建均

S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller, “Human-Mouse Alignments with BLASTZ,” Genome Research, 2003; 13: 103–107. B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko, “Galazy: A Platform for Interactive Large-scale, Genome Analysis,” 2005; 15: 1451–1455.

Methods S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C. Hardison, D. Haussler, and W. Miller Human-Mouse Alignments with BLASTZ Genome Research, 2003; 13: 103–107 陳怡靜、許秉慧

Outline Motivation and results BLASTZ and modified BLASTZ Implementation issues and hardware environment Software evaluation

Motivation and Results 陳怡靜

Motivation Several existing programs sacrifice sensitivity to attain very short running time. An appropriate level of sensitivity and specificity was attained by a program called BLASTZ. A modified BLASTZ program attains efficiency adequate for aligning entire mammalian genomes and increasing its specificity.

Results To modify the BLASTZ alignment program which is used by the PipMaker webserver (Schwartz et al. 2000) The modified BLASTZ was used to compare all of the human sequence with all of the mouse efficiently.

BLASTZ and Modified BLASTZ 陳怡靜

Homologous Two proteins are orthologous if they belong to different species that evolve from a common ancestral gene by speciation and retain the same function in the course of evolution. Two proteins are paralogous if they are duplicated within a genome and evolve new functions.

Human-Mouse Alignments To find orthologous alignments Natural consequence We obtain the single best by applying a program, called axtBest, which filters out all but the best alignment within a sliding window of 10,000 bases. Mouse Human align Step1

BLASTZ BLASTZ follows the three-step strategy used by Gapped BLAST. 1)Find short near-exact matches 2)Extend each short match without allowing gaps 3)Extend each gap-free match that exceeds a certain threshold by a DP procedure that permits gaps

BLASTZ Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. BLASTZ has an potion to require that the matching regions that it reports must occur in the same order and orientation in both sequences. Sequence 1 Sequence 2

BLASTZ Two differences between BLASTZ and Gapped BLAST were exploited in the whole-genome alignments. BLASTZ uses an alignment-scoring scheme derived and evaluated by Chiaromonte et al. (2000). Nucleotide substitutions are scored by the matrix and a gap of length k is penalized by subtracting k from the score. ACGT A91–114–31–123 C–114100–125–31 G –125–100–114 T–123–31–

Modified BLASTZ The modified BLASTZ algorithm 1)Remove recent repeated elements 2)Run BLASTZ 3)Adjust positions in the alignment to refer to the original sequences 4) Filter the alignments

Modified BLASTZ Step 1 (an addition from BLASTZ) I. Y. Lee, D. Westaway, A. F. Smit, K. Wang, J. Seto, L. Chen, C. Acharya, M. Ankener, D. Baskin, C. Cooper, et at., “ Complete Genomic Sequence and Analysis of the Prion Protein Gene Region from Three Mammalian Species, ” Genome Research, 1998; 8: 1022 – Sequence 1 Sequence 2 WHY?

Modified BLASTZ Step 2 (a modification from BLASTZ) Extend the induced alignment in each direction, not allowing gaps. Stop extending when the score decrease more than some threshold. Sequence 1 Sequence 2 12-mer Each 12-mer allows a transition (A-G, G-A, C-T or T-c) in any one of the 12 positions.

Modified BLASTZ Step 2 (a modification from BLASTZ) If the gap-free alignment scores more than 3000 then Repeat the extension step, but allow for gaps. Retain the alignment if it scores above Sequence 1 Sequence 2 12-mer

Modified BLASTZ Step 3 If l  50 kb, repeat Step 2, but using a more sensitive seeding procedure (ex. 7-mer exact matches) and lower score thresholds both for gap-free alignments (ex instead of 3000) and for gapped alignments (ex instead of 5000). Sequence 1 Sequence 2 l

Modified BLASTZ Step 4: Adjust sequence positions in the resulting alignments to make them refer to the original sequences. Step 5: Filter the alignments as appropriate for particular purposes. Apply axtBest to finds a best way to align each aligned human position Sequence 1 Sequence 2 Sequence 1 Sequence 2 Choose best one

Modified BLASTZ Two changes to BLASTZ significantly improved its execution speed for aligning entire genomes. When the program realized that many regions of the mouse genome align to the same human segment, that segment is dynamically masked. (Step 1 of the modified BLASTZ) BLASTZ applies 8-mer procedure to align, but the modified BLASTZ applies 12-mer procedure to align. (Step 2 of the modified BLASTZ)

Implementation Issues and Hardware Environment 許秉慧

Implementation Issues Base 1 Base 2 Base 3 Human sequence 10 kb Mouse sequence Gap-free segment score.> Mb

Implementation Issues and Hardware Environment Input 2.8Gb human sequence vs. 2.5Gb mouse sequence Hardware A cluster of Mhz Pentium III Time 481 days of CPU times Half day of wall clock

Software Evaluation 許秉慧

Software Evaluation Different classes of parameters and thresholds might be best tested in different way Reverse mouse sequence to measure specificity

Reverse Mouse Sequence 3’ 5’ 3’ Mouse sequence Human sequence Reverse Mouse sequence cacaca acacac Spurious matchmicrosatellite sequence True match

Coverage by Outer Alignment Score1 Mus>1Mus1 Rev>1 Rev %2.340%0.084%0.080% %2.230%0.040%0.074% %1.975%0.016%0.059% %1.829%0.013%0.051% %1.697%0.011%0.043% %1.586%0.010%0.037% %1.490%0.008%0.033% %1.405%0.007%0.030% %0.164% % 0.075% % 0.037%

Coverage by Outer Alignment DNA sequence geno

Comparison of Genome Coverage chr20CDS 3 ’ UTR5 ’ UTR upstream Blastz all 40.5%98.5%87.1%89.0%87.2% Blastz tight 5.6%92.5%26.0%39.6%28.3% PH all 29.7%95.5%55.0%59.3%52.5% PH tight 5.0%91.2%25.1%36.3%25.2% Transl. BLAT 5.8%90.3%29.2%38.4%27.2%

Comparison of Covered Region AllTight Blastz only54.1%12.2% PH only10.2%3.3% Both35.7%85.5%

Resources B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J. Kent, and A. Nekrutenko Galaxy: A Platform for Interactive Large-scale, Genome Analysis Genome Research, 2005; 15: 1451–1455 宋建均、鄭智懷

What is Galaxy? It’s a tool that it allows users to gather and manipulate data from existing resources in a variety of ways. Galaxy contains three major classes of data manipulation: Query operations Sequence analysis tools Output displays

Why needs Galaxy? 1.Galaxy differs from existing systems in its specificity for access to, and comparative analysis of, genomic sequence and alignments. 2.Programming experience is not required. 3.Galaxy is a web-based software which can handle large sequence data sets.

Query Operations Complement: compiles a list of regions that do not overlap with the current query (requires UCSC library). Restrict: filters data based on chromosome name and region size (requires UCSC library). Merge overlapping regions: overlapping regions within a single query are consolidated into fewer, larger regions. (requires UCSC library). Intersect: finds overlapping regions between two queries (requires UCSC library). Union: to finds all regions that are covered by both of the queries, and return either merged regions or the original regions from one of the query (requires UCSC library).

Query Operations Join Lists: joins two queries side by side to allow performing statistical analyses (requires UCSC library). Cluster: finds clusters of regions within specified distance of each other (requires UCSC library). Proximity: finds regions of one query within a specified distance of regions from another query (requires UCSC library). Subtract: subtracts regions of one query from another query (requires UCSC library). Join Same Coordinates Region: joins two queries, which have the same coordinates, side by side to allow performing statistical analyses (requires UCSC library).

Sequence Analysis Tools Extract sequences: uses a perl wrapper written around fasta-subseq to extract sequences corresponding to bed file coordinates. Uses alignseq.loc file to locate genomic sequences. Requires PATH to include fasta-subseq location (requires perl) Extract blastZ alignments: uses a perl wrapper for extractAxt (developed by Rico) to extract genomic alignments corresponding to bed file coordinates. Uses alignseq.loc to find axt files. Requires PATH to include extractAxt location (requires perl)

Output Displays UCSC, Ensemble Genome Browser EncodeDB at NEGRI EnsMart at Sanger Centre

Language CGI PERL CORE C Database SQL

Other Features Asynchronous query User identity: cookies & assigning a sequential ID number to each terminal

Demo