VCF format: variants c.f. S. Brown NYU

Slides:



Advertisements
Similar presentations
SeqMapReduce: software and web service for accelerating sequence mapping Yanen Li Department of Computer Science, University of Illinois at Urbana-Champaign.
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Space-for-Time Tradeoffs
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.
BLAST Sequence alignment, E-value & Extreme value distribution.
Peter Tsai Bioinformatics Institute, University of Auckland
Introduction to Short Read Sequencing Analysis
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Sequence alignment, E-value & Extreme value distribution
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Introduction to Short Read Sequencing Analysis
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Lecture 15 Algorithm Analysis
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Calling Somatic Mutations using VarScan
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
RNAseq: a Closer Look at Read Mapping and Quantitation
Burrows-Wheeler Transformation Review
Lesson: Sequence processing
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Local alignment
Pairwise and NGS read alignment
Department of Computer Science
Database Implementation Issues
Computer Science 2 Hashing
From: TopHat: discovering splice junctions with RNA-Seq
Chapter 7 Space and Time Tradeoffs
CSC2431 February 3rd 2010 Alecia Fowler
Next-generation sequencing - Mapping short reads
BLAST.
Lecture 2- Query Processing (continued)
Lecture 14 Algorithm Analysis
Learning to count: quantifying signal
Database Design and Programming
Maximize read usage through mapping strategies
Space-for-time tradeoffs
Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey
BIOINFORMATICS Fast Alignment
Indexing 4/11/2019.
Space-for-time tradeoffs
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
What we learn with pleasure we never forget. Alfred Mercier
Sequence Analysis - RNA-Seq 2
Sequence alignment, E-value & Extreme value distribution
Database Implementation Issues
Presentation transcript:

VCF format: variants c.f. S. Brown NYU #CHROM POS ID REF ALT QUAL QUAL chr1 901559 . G A 23 . DP=3;VDB=0.0298;AF1=1;AC1=2;DP4=0,0,3,0;MQ=60;FQ=-36 GT:PL:GQ 1/1:55,9,0:15 INFO AA ancestral allele AC allele count in genotypes, for each ALT allele, in the same order as listed AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes AN total number of alleles in called genotypes BQ RMS base quality at this position CIGAR cigar string describing how to align an alternate allele to the reference allele DB dbSNP membership DP combined depth across samples, e.g. DP=154 END end position of the variant described in this record (esp. for CNVs) H2 membership in hapmap2 MQ RMS mapping quality, e.g. MQ=52 MQ0 Number of MAPQ == 0 reads covering this record NS Number of samples with data SB strand bias at this position SOMATIC indicates that the record is a somatic mutation, for cancer genomics VALIDATED validated by follow-up experiment c.f. S. Brown NYU

Alignment or NGS What are the challenges? c.f. S. Brown NYU

Aligning millions of short reads Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Either way, using something like S-W or BLAST would take too long, so you modify them to take shortcuts (heuristics). Heuristics include using indexing methods Length of each read = 50-300 Number of reads = 107 – 108 Genome length = 3.109 or double if diploid

A short word about indexing Just like the word and page number of the index of a book, this creates an index of short sequences, with their location on the genome. The index file contains index entries made up of a search key value and a pointer to the block in the data file. Hashing (or keys) is when you add line numbers and describe the contents by a minimum of characters, called seeds. Techniques are very well defined in computer science End result is that the files takes up a lot less space, take a lot less time to search

Aligning millions of short reads Computationally intensive Aligning to reference genome => mapping the reads Aligning de novo => genome assembly Heuristics include using indexing methods, some use ungapped alignment with short words BWT (Burrows-Wheeler Transformation) greatly reduced computational time Approaches include using hash tables, spaced seeds, contiguous seeds etc

Mapping The step of aligning the reads to the reference genome. - index the reads then scan them against the reference - find the reference match that has the lowest mismatch - calculate the p-value, and assign each read its location Problems: accuracy, splice junctions, variants Challenges: false positives, repeats, parental alleles, HapMap

NGS alignment algorithms Smith Waterman BLAST Enter BWT BLAT is precomputed BLAST

MAQ algorithm Mapping and Aligning with Qualities MAQ first indexes read sequences and scans the reference genome to find hits that are extended and scored (first 28bp with 2 mismatches max), minimizing the sum of quality scores of mismatches. Scheme of indexing in hashtables (six noncontigous seed templates): indexed indexed eg, for an 8-base long fragment with 1 or 2 variants, you still get segments which are fully matched (2 and 6): c.f. Shamir 2011

MAQ algorithm Heuristic yet thoughtful: MAQ loads all reads in the memory For each read MAQ creates an integer depending on the 28bp templates and stores the read ID and the integer in a hashtable. When all reads are processed, we have many reads under the same integer as key. Now we are scanning the reference, using 28 bp subsquences. Using the corresponding hashes of the subsequences, we find reads that match and extend them. MAQ calculates a score for each extended match and retains the best score hits. Once the reference is scanned, the next template is used until no more templates are left. c.f. Shamir 2011

Burrows Wheeler Transformation The Burrows–Wheeler transform is an algorithm used in data compression techniques The output is easier to compress because it has many repeated characters. In this example the transformed string, there are a total of eight runs of identical characters: XX, II, XX, SS, PP, .., II, and III, which together make 17 out of the 44 characters. So, that would be great if we could compress our reads like that: How does it work?

Burrows Wheeler Transformation First we transform: This string now can be nicely compressed (i.e. we can do that with our reads too!)

Burrows Wheeler Transformation However, how do we get back to the original reads (i.e. how does the inverse Transformation work?)

BowTie Matrix has characteristic that similar rows are clustered together Starting with query, we find ranges of rows that match the query. Successively adding to the query the ranges become smaller. A match is found if query matches the complete row, no match if range = 0 Problems arise when mismatches occur Remedy backtracking (next slide)

BowTie

Dealing with splice junctions (from Garber et al, Nature, 2011)

TopHat

Pros and Cons Identifying splice junctions correctly is non-trivial Splice aligners that use exon first methods are computationally less intensive (eg TopHat, which starts with results of Bowtie2) Seed extend methods are better at finding new isoforms, but could equally come up with false positive splice junctions (eg GSNAP)

Identifying transcripts For RNA-Seq, might want to identify novel transcripts Not trivial Ignore isoforms Use genome-guided reconstruction, ie using reads that span the potential splice site of known genomes (eg Cufflinks) use genome independent reconstruction, ie de novo approaches (eg Velvet, Trinity)

Finding the real transcripts