Genomics Method Seminar - BWA

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Fast and accurate short read alignment with Burrows–Wheeler transform
Group 1 (1)陳伊瑋 (2)沈國曄 (3)唐婉馨 (4)吳彥緯 (5)魏銘良
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Bowtie2: Extending Burrows-Wheeler-based read alignment to longer reads and gapped alignments Ben Langmead 1, 2, Mihai Pop 1, Rafael A. Irizarry 2 and.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Next generation read mapping on GPUs Cole Trapnell.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
RNA-Seq and RNA Structure Prediction
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Li and Dewey BMC Bioinformatics 2011, 12:323
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
MES Genome Informatics I - Lecture V. Short Read Alignment
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Precomputing Edit-Distance Specificity of Short Oligonucleotides Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
RNAseq: a Closer Look at Read Mapping and Quantitation
VCF format: variants c.f. S. Brown NYU
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Department of Computer Science
Jin Zhang, Jiayin Wang and Yufeng Wu
Homology Search Tools Kun-Mao Chao (趙坤茂)
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
BIOINFORMATICS Fast Alignment
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Canadian Bioinformatics Workshops
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Genomics Method Seminar - BWA October 15, 2014 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College of Medicine 안녕하십니까 저는 발표를 맡게된 김소라 입니다. 오늘 발표드릴 내용은 미리 메일로 언급 드린 것과 같이 Bowtie의 alignment method에 대해서 집중적으로 발표하도록 하겠습니다. 진행 중 이해 안되는 부분이 있으시면 언제든지 질문 주시기 바랍니다.

Today’s paper PhD. Heng Li a research scientist at the Broad Institute, working with David Reich and David Altshuler. principal developer of several projects including SAMtools, BWA, MAQ, TreeSoft and TreeFam with most of them started when he was a postdoctoral fellow of Richard Durbin at the Wellcome Trust Sanger Institute. 먼저 오늘 발표드릴 도구의 논문 정보 및 교신저자의 정보를 알려드리겠습니다. Bowtie 논문은 genomebiology 논문에 2009년에 게재되었고 현재 해당 저널은 IF 10.5점의 저널입니다. 그리고 이 논문의 교신저자인 Salzberg 교수님은 존스홉킨스 의과대학 소속으로 Bowtie 및 RNA-seq 분석 시 많이 사용되는 tophat, cufflinks와 같은 툴들을 만들어낸 연구실입니다.

Software information Purpose Category Software URL License BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference genome such as human. Category aligner Software URL http://bio-bwa.sourceforge.net/ License Free, Open Source under Artistic License 우선 해당 프로그램의 기본적인 정보를 알려드리자면 목표는 해당 논문에서 가져온 문장인데요, 말그대로 bowtie는 short read들을 reference에 굉장히 빠르고 메모리 효율적으로 align할 수 있도록 제작된 alignment program입니다.

ChIP-seq RNA-seq WGS, WES

Previous work Bowtie BWT + FM index LF mapping Backtracking 인덱싱 자료 구조에 따라 인덱싱 사이즈가 커지면 인덱싱 하는 데에도 시간과 메모리가 많이 소모되고 이를 이용하여 mapping을 진행할 때에도 많은 데이터들을 살펴봐야 되므로 소요되는 시간이 상대적으로 늘어나게 됩니다. 이를 해결하기 위하여 bowtie에서는 기존에 사용되지 않던 새로운 알고리즘을 적용하여 새로운 인덱싱 방식을 소개하고 있습니다.

Conceptual Overview BWA BWA-SW BWA-MEM For short read For long read For both 결론부터 말씀 드리면 bowtie에서 새롭게 적용한 알고리즘을 사용할 경우 기존의 인덱싱 방식들보다 훨씬 작은 용량의 인덱싱 파일을 만들 수 있고 이를 이용하여 빠른 alignment가 가능하게 됩니다.

CUSHAW2 - MEMs Long read alignment based on maximal exact match seeds, Yongchao Liu and Bertil Schmidt, Bioinformatics (2012) 28 (18):i318-i324 CUSHAW2, a parallelized, accurate, and memory- efficient long read aligner. It is based on the seed-and- extend approach and uses maximal exact matches as seeds to find gapped alignments.

CUSHAW2 - MEMs

CUSHAW2 - MEMs Estimation of the minimal seed size Generation of maximal exact matches

1. Estimation of the minimal seed size qgram lemma states that two strings P and S with an edit distance of e share at least t qgrams, that is substrings of length q, where t = max(|P|,|S|)-q+1-q*e (Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming, Bioinformatics, Vol. 27 no. 10 2011, pages 1351–1358) That means that every error may destroy up to q*e overlapping qgrams. For non-overlapping qgrams, one error can destroy only the qgram in which it is located. Given this assumption, we define the length q of the qgrams as the largest value below 𝒎 𝒆+𝟏 such that

1. Estimation of the minimal seed size A = ACGT B = ACTT q=2, e=1 이라고 가정 q(A) = {AC, CG, GT} q(B) = {AC, CT, TT} t = max(|A|,|B|)-q+1-q*e t = max(4, 4)-2+1-2*1 = 1 A_q와 B_q는 최소 t, 1만큼은 share 하는 구간이 있어야 한다.

1. Estimation of the minimal seed size The estimation is based on the pigeonhole principle for non-overlapping q-grams, meaning that at least one q-gram of length Q is shared by S and its aligned substring mate on the genome. QL: global lower-bound = (default) 13 QH: global upper-bound = (default) 49 employ a simplified error model for ungapped alignments to estimate e. w follows a binomial distribution.

2. Generation of maximal exact matches To identify MEMs between S and T, we advance the starting position p in S, from left to right, to find the longest exact matches (LEMs) using the BWT and the FM-index. LEMs are right/left maximal if it is not part of any previously identified MEM. discard the MEMs whose lengths are less than Q. we only keep its first h (h=1024 by default) occurrences and discard the others.

BWA-MEM Aligning a single query sequence Paired-end mapping Seeding and re-seeding Chaining and chain filtering Seed extension Paired-end mapping Rescuing missing hits Pairing Bowtie는 indexing과 matching의 2개 파트로 나눌 수 있고 각 분야에는 대표적으로 BWT/FM index 방식과 LF, backtracking 방식이 사용되고 있습니다. 각각에 대해서 자세히 설명드리도록 하겠습니다.

SE. Seeding and re-seeding BWA-MEM follows the canonical seed-and-extend paradigm. Seed an alignment with SMEMs (Super Maximal Exact Matches), which essentially finds at each query position the longest exact match covering the position. Suppose we have a SMEM of length l with k occurrences in the reference genome. To reduce mismappings caused by missing seeds, we introduce re-seeding.

SE. Chaining and chain filtering We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Chains detected here do not need to be accurate.

SE. Seed extension rank a seed by length of the chain it belongs to and then by the seed length. drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment.

SE. Seed extension banded affine-gap-penalty dynamic programming

SE. Seed extension BWA-MEM’s seed extension differs from the standard seed extension in two aspects. suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y. while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. 이것은 BLAST의 x-dropoff와 비슷한 것으로 현재까지의 best score 값과 그 다음 위치의 score 간의 차이가 gap extension penalty와 임의로 정해주는 Z값을 이용한 계산 값보다 클 경우에는 즉각적으로 extension을 멈추고 결과를 리포트 하게 되는데 이는 기존의 영역 끝까지 가서 최종적으로 나쁘게 나오는 maximal 값을 찾고 리턴하는 것보다 스피드 면에서 빠를 수 있다. BWA-MEM에서는 최대한으로 end to end alignment를 지양한다. 만약 local alignment 결과와 end to end alignment 간의 best score 차이가 threshold 아래면 우선적으로 end to end의 결과 값을 리포트 한다.

PE. Rescuing missing hits estimates the mean 𝜇 and the variance 𝜎 2 of the insert size distribution from reliable single-end hits. For the top 100 hits (by default) of either end, if the mate is unmapped in a window [𝝁−𝟒𝝈, 𝝁+𝟒𝝈] from each hit, BWA-MEM performs SSE2-based Smith- Waterman alignment for the mate within the window. 각 end에 대하여 상위 100개의 hit을 찾고 window 내에 있는 mate 데이터에 대해서만 SW algorithm을 시행한다.

PE. Rescuing missing hits Hits found from both the single-sequence alignment and SW rescuing will be used for pairing.

PE. Rescuing missing hits Hits found from both the single-sequence alignment and SW rescuing will be used for pairing.

scores the pair (i, j) 𝑆 𝑖𝑗 = 𝑆 𝑖 + 𝑆 𝑗 −min⁡{−𝑎 𝑙𝑜𝑔 4 𝑃 𝑑 𝑖𝑗 , 𝑈} PE. Pairing Given i-th hit for the first read, j-th hit for the second read BWA-MEM computes their distance 𝑑 𝑖𝑗 if the two hits are in the right orientation, or sets 𝑑 𝑖𝑗 to infinity otherwise. scores the pair (i, j) 𝑆 𝑖𝑗 = 𝑆 𝑖 + 𝑆 𝑗 −min⁡{−𝑎 𝑙𝑜𝑔 4 𝑃 𝑑 𝑖𝑗 , 𝑈} P(d) gives the probability of observing an insert size larger than d assuming a normal distribution ‘log4’ arises when we interpret SW score as odds ratio. U is a threshold that controls pairing: if 𝑑 𝑖𝑗 is small enough such that −𝑎 𝑙𝑜𝑔 4 𝑃 𝑑 𝑖𝑗 <𝑈, BWA-MEM prefers to pair the two ends; otherwise it prefers the unpaired alignments.

Results

Running Operation MEM mode

SAM format - spec

SAM format - example

Discussion 100bp 이상의 확실한 long read일 때 MEM 방식을 주로 사용 하고 100bp 이하의 short read일 때는 aln을 쓰는 것을 추천 Seed extend와 local alignment 사용으로 인한 불필요하게 많 이 split 되어 나타나는 alignment 결과물에 대해서 결과 보 정 혹은 후처리를 위해 옵션 조정이 필요