Introduction to Short Read Sequencing Analysis

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

(with many slides adapted from Jim Noonan)
DNAseq analysis Bioinformatics Analysis Team
SOLiD Sequencing & Data
High Throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Assembly.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
High Throughput Sequencing
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
mRNA-Seq: methods and applications
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
CS 6293 Advanced Topics: Current Bioinformatics
NGS Analysis Using Galaxy
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Introduction to Short Read Sequencing Analysis
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
Massive Parallel Sequencing
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
The iPlant Collaborative
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
billion-piece genome puzzle
The Genome Assemblies of Tasmanian Devil Zemin Ning The Wellcome Trust Sanger Institute.
GigAssembler. Genome Assembly: A big picture
Doug Raiford Phage class: introduction to sequence databases.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Day 5 Mapping and Visualization
Short Read Sequencing Analysis Workshop
Lesson: Sequence processing
VCF format: variants c.f. S. Brown NYU
RNA-Seq analysis in R (Bioconductor)
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
CSC2431 February 3rd 2010 Alecia Fowler
Data formats Gabor T. Marth Boston College
Maximize read usage through mapping strategies
BF nd (Next) Generation Sequencing
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Introduction to Short Read Sequencing Analysis Jim Noonan GENE 760

Sequence read lengths remain limiting Chr1: 249 Mb 249 Mb sequencing read Current platforms: A moderate number (~500,000) of long reads (~10 kb) A very large number (>200 M) of short reads (100 bp) For most applications reads are aligned to a reference genome Short reads contain inherently limited information De novo assembly of short reads is difficult

Determining the identity and location of short sequence reads in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Need a computationally efficient method to perform accurate alignments of millions of reads

Read length requirements vary depending on the feature being studied Exome: 80-120 bp Splice junctions (connectivity) Transcriptome: 10,000 bp

Determining the identity and location of short sequence reads in the genome/exome/transcriptome @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Aligning short reads to much larger reference Exome or Genome Considerations Alignment scoring Source of the reads Sequencing format (PE or SE) Read length Error rates Transcriptome

Topics Scoring alignments Error rates and quality scores for short read sequencing Mapability Common algorithms for short read sequence alignment Scoring short read sequence aligments Uniform data output formats

Scoring alignments Correct: Match (+1) Mismatch (-1, -2, etc.) Wrong: | Match (+1) Mismatch (-1, -2, etc.) T TAGATTACACAGATTAC ||||||||||||||||| Wrong: TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

Scoring alignments Correct (polymorphism): Match (+1) | Match (+1) Mismatch (-1, -2, etc.) T TAGATTACTCAGATTAC |||||||| |||||||| TAGATTACACAGATTAC Wrong: TAGATTACTCAGA-TAC |||||||| |||| ||| TAGATTACACAGATTAC Gap penalty: P = a +bN a = cost of opening a gap b = cost of extending gap by 1 N = length of gap A-TAC ||||| ATTAC A--AC Many short read alignment algorithms allow a fixed number of mismatches Adapted from Mark Gerstein

Quality scores A quality score (or Q-score) expresses the probability that a basecall is incorrect. Given a basecall, A: The estimated probability that A is not correct is P(~A); The quality score for A is Q (A) = -10 log10 (P(~A)) A quality score of 10 means a probability of 0.1 that A is the wrong basecall. P(~A) is platform-specific; Q-scores can be compared across platforms. Quality scores are logarithmic: Q-score Error probability 10 0.1 20 0.01 40 0.0001

Error rates in lllumina sequencing reads Reverse termination Add next base, etc. 1 cycle Scan flow cell Add base Sequencing by synthesis with reversible dye terminators Individual synthesis reactions go out of phase

Error rates in lllumina sequencing reads Error rates are mismatch rates relative to reference genome Error rates increase with increasing cycle number Contingent on reference genome quality Reads may be trimmed to improve alignment quality

Illumina quality score encoding in FASTQ format (CASAVA v1.8) >90% Q30 bases in high quality run >80% mappable reads

Sources of error in single-molecule sequencing Illumina: TAGATTACACAGATTAC ||||||||||||||||| Consensus signal PacBio: Single molecule screening - gaps TAGATTA-ACAG-TT-C ||||||| |||| || | TAGATTACACAGATTAC One molecule, one read Sequence templates multiple times

Mapability The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map Chr3 Chr7 repeat Longer reads: Paired reads:

Mapability scores at UCSC The genome contains non-unique sequences (repeats, segmental duplications) Short reads derived from repetitive regions are difficult to map 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

Poorly mappable regions of the genome 36mers, 2 mismatches 75mers, 2 mismatches 100mers, 2 mismatches

Common algorithms for mapping short reads to a reference genome Program Website ELAND (v2) N/A – integrated into Illumina pipeline Bowtie http://bowtie-bio.sourceforge.net/ BWA http://bio-bwa.sourceforge.net/ Maq http://maq.sourceforge.net/ Considerations Alignment scoring method Speed Quality aware Seeding Gapped alignment

Seed-based alignment strategy Single seed alignments Reference Seed Critical values are seed length and number of mismatches allowed In ELAND: Seed length = 32 Number of mismatches = 2 Multiseed alignments (ELAND v2, others) Seed interval contingent on read length

Implementation in ELAND v2 A read must have at least one seed with no more than 2 mismatches and no gaps Gapped alignment: extend each alignment to full length of read, allowing gaps up to 10 bp

Resolving ambiguous read alignments with multiple seeds Reference Seed

Resolving ambiguous read alignments with multiple seeds

Utility of gapped alignments RNA-seq Insertions and deletion variants in exome and whole genome sequencing

Mapping paired end reads Insert size Insert size within specified range

ELAND alignment scoring Base quality values and mismatch positions in a candidate alignment are used to assign a p value P values reflect probability that candidate position in genome would give rise to the observed read if its bases were sequenced at error rates corresponding to the read’s quality values Alignment score for a read is computed from p values of all candidate alignments If there are two candidates for a read with p values 0.9 and 0.3: 0.9/(0.9+0.3) = 0.75, chance highest scoring alignment is correct 1- 0.75, chance highest scoring alignment is wrong Alignment score = -10 log(0.25) = 6.

BaseSpace https://basespace.illumina.com/

Spaced-seed indexing of the reference genome alignment Spaced-seed indexing of the reference genome Need to break up the genome into manageable segments Create index of short sequences Match seeds against genome index Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

Reference genome indexing using Burrows-Wheeler transform alignment Reference genome indexing using Burrows-Wheeler transform Reversible encoding scheme Simplifies genome sequence Results in “indexed” genome Very rapid alignments Trapnell and Salzberg, Nat Biotechnology 27:455 (2009)

Bowtie 2 Pre-built Indexed genomes Bowtie 1 and Bowtie 2 indexes are not compatible

Alignments in Bowtie 2 Multiseed alignment (ungapped) @HWI-ST974:58:C059FACXX:2:1201:10589:110434 1:N:0:TGACCA TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG Multiseed alignment (ungapped) Seed length: 16 nt, every 10 nt # mismatches: 0 Mismatch = -6 TGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCATGGAAAATGAGAAATACACACTTTAGGACGTG TGCACACTGAAGGTCCTGGAATATGGCGAGAAAACTGAAAATCATGGAAA--GAGAAATACACACTTTAGGACGTG Ref Read Gap = -11 -5 to open -3 to extend by 1 bp Seeds are extended (gaps allowed) to generate alignment Match = 2

Mapping in highly repetitive regions ELAND is conservative Non-unique alignments are flagged; only one is reported in export.txt Post-alignment CASAVA analyses ignore these Bowtie will report non-unique alignments User-specified options determine how these are reported http://bowtie-bio.sourceforge.net/manual.shtml http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Sequence Alignment/Map (SAM) format Standard format for reporting short read alignment data BAM is compressed version Header Alignment info http://samtools.sourceforge.net/

Summary Read the material posted for this lecture on the class wiki Next week: first Regulomics lecture