Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored.

Slides:

Advertisements

Similar presentations

RNA-Seq based discovery and reconstruction of unannotated transcripts

Advertisements

Processing of miRNA samples and primary data analysis

SOLiD Sequencing & Data

Peter Tsai Bioinformatics Institute, University of Auckland

Introduction to Short Read Sequencing Analysis

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

TOPHAT Next-Generation Sequencing Workshop RNA-Seq Mapping

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

SOAP3-dp Workflow.

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

Before we start: Align sequence reads to the reference genome

NGS Analysis Using Galaxy

Next generation sequencing Xusheng Wang 4/29/2010.

Whole Exome Sequencing for Variant Discovery and Prioritisation

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Introduction to RNA-Seq and Transcriptome Analysis

Li and Dewey BMC Bioinformatics 2011, 12:323

Expression Analysis of RNA-seq Data

Genomic walking (1) To start, you need: -the DNA sequence of a small region of the chromosome -An adaptor: a small piece of DNA, nucleotides long.

Introduction to Short Read Sequencing Analysis

How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.

RNAseq analyses -- methods

June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.

Introduction to RNA-Seq & Transcriptome Analysis

Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.

Next Generation DNA Sequencing

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

Transcriptome Analysis

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.

RNA Sequencing I: De novo RNAseq

IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Spliced Transcripts Alignment & Reconstruction

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Introduction to RNAseq

The iPlant Collaborative

No reference available

Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.

Comparative transcriptomics of fungi Group Nicotiana Daan van Vliet, Dou Hu, Joost de Jong, Krista Kokki.

-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.

Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,

Short Read Workshop Day 5: Mapping and Visualization

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

QuasR: Quantify and Annotate Short Reads in R Anita Lerch, Dimos Gaidatzis, Florian Hahne and Michael Stadler Friedrich Miescher Institute for Biomedical.

Introductory RNA-seq Transcriptome Profiling

Computing challenges in working with genomics-scale data

Using command line tools to process sequencing data

Day 5 Mapping and Visualization

Lesson: Sequence processing

Next Generation Sequencing Analysis

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

VCF format: variants c.f. S. Brown NYU

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Mirela Andronescu February 22, 2005 Lab 8.3 (c) 2005 CGDN.

Introductory RNA-Seq Transcriptome Profiling

Pairwise and NGS read alignment

Transcriptome Assembly

From: TopHat: discovering splice junctions with RNA-Seq

Maximize read usage through mapping strategies

Basic Local Alignment Search Tool (BLAST)

Sequence Analysis - RNA-Seq 2

BF528 - Sequence Analysis Fundamentals

Introduction to RNA-Seq & Transcriptome Analysis

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/

Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope

DNA ( ug) Single molecule array Sample preparation Cluster growth 5’ 3’ G T C A G T C A G T C A C A G T C A T C A C C T A G C G T A G T Image acquisition Base calling T G C T A C G A T … Sequencing Illumina Sequencing Technology Slide courtesy of G Schroth, Illumina

Raw data Short sequence reads Quality scores = -10log10(p) or GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCAC AGTTT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2

Quality checks Base composition Quality score PCR Artifacts

CDS Gene transcript Sequencing transcripts not the genome Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions. Reads

Coding Sequence Exons IntronsSplice Junctions CDS Aggregate reads based on:  exons  Exons + junctions  All reads start to end of transcript  De novo methods

Mapping tools TypeNameLink General alignerGMAP/GSNAPhttp://research-pub.gene.com/gmap/ BFASThttp://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIEhttp://bowtie-bio.sourceforge.net/index.shtml CloudBursthttp://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmaphttp://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWAhttp://maq.sourceforge.net/ Permhttp://code.google.com/p/perm/ RazerShttp:// Mrfast/mrsfasthttp://mrfast.sourceforge.net/manual.html SOAP/SOAP2http://soap.genomics.org.cn/ SHRiMPhttp://compbio.cs.toronto.edu/shrimp/ De Novo annotatorQPALMA/GenomeMapper/PALMapperhttp:// SpliceMaphttp:// SOAPalshttp://soap.genomics.org.cn/ G-Mo. R-Sehttp:// TopHathttp://tophat.cbcb.umd.edu/ SplitSeekhttp://solidsoftwaretools.com/gf/project/splitseek De Novo transcript assembler Oaseshttp:// MIRAhttp://sourceforge.net/apps/mediawiki/mira-assembler/index.php

Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/

Familiarizing yourself with bowtie The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. Many optional parameters to tweak alignment. Handout Reference: Pages 2-7

How do you map 10^9, 76bp sequences, to a 10^9 bp reference Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7

Lots of aligners, one general strategy Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7

A fragment as seen by an aligner Heuristic acts only on the seed. Putative mapping locations identified. A more precise algorithm extends each seed to the full read and ranks them. The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. ; ; Read Seed Fragment Burrows Wheeler acts on this Smith-Waterman acts on this Handout Reference: Pages 2-7

Our test data set 76bp single end reads. Sequenced using the Illumina Genome Analyzer RNA taken from Mouse myoblast cell line. We only look at a subset of one lane, full data available here gi?acc=GSE2084

Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/

A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq../shared/BowtieIndexes/mm9../shared/Sample_reads.fastq Strict_Mapped.sam -n sets the number of mismatches in the seed The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. --un saves the unmapped reads to the file specified -t prints timing information -p sets the number of simultaneous threads --sam makes bowtie output in SAM format --best ensures the best map is returned Handout Reference: Page 8

Using the defaults bowtie -t -p 1 --best --sam../shared/BowtieIndexes/mm9../shared/Sample_reads.fastq test.sam -t prints timing information -p sets the number of simultaneous threads --sam makes bowtie output in SAM format --best ensures the best map is returned --un saves the unmapped reads to the file specified Handout Reference: Page 8

Allowing some mismatches Bowtie -t -p 1 -n 3 -e sam –best --un Loose_Unmapped.fastq../shared/BowtieIndexes/mm9 Strict_Unmapped.fastq Loose_Mapped.sam Set the number of seed mismatches to the maximum. Set -e to a value more appropriate for our read length. How many reads do you map? Handout Reference: Pages 8-9

A closer look at our data set: fastQC Handout Reference: Pages 9-10

Trimming reads Bowtie allows you to trim reads before it attempts to map them. You can trim from the left (5’) end of the read with the --trim5 option. You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10

CDS Gene transcript Sequencing transcripts not the genome Reads A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12

Options to map more reads… Trim some bases from the end of the reads using --trim5 and/or --trim3. Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12

Trim Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim trim sam --best --un Trimmed_Unmapped.fastq../shared/BowtieIndexes/mm9 Loose_Unmapped.fastq Trimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: # reads with at least one reported alignment: (30.91%) # reads that failed to align: (69.09%) Reported alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10

Then map to junctions Starting]$ bowtie -t -p 1 -n 3 -e sam --best --un Junctions_Unampped.fastq../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastq Junctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: # reads with at least one reported alignment: (35.07%) # reads that failed to align: (64.93%) Reported alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11

Number of mapped reads Mapping strategy Command line options No. Mapped ReadsNo. Unmapped Reads Reference Strict -n 0 -e 11,049,050 (47.32%)1,167,992 (52.68%)Genome Loose-n 3 -e 2001,783,048 (80.42%)433,994 (19.58%)Genome Trimming-n 3 --trim trim3 25 1,912,003 (86.24%)305,039 (13.76%)Genome Junctions-n 3 -e 2002,007,627 (90.55%)209,415 (9.45%)Junction Library Handout Reference: Page 12

Further options TypeNameLink General alignerGMAP/GSNAPhttp://research-pub.gene.com/gmap/ BFASThttp://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIEhttp://bowtie-bio.sourceforge.net/index.shtml CloudBursthttp://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmaphttp://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWAhttp://maq.sourceforge.net/ Permhttp://code.google.com/p/perm/ RazerShttp:// Mrfast/mrsfasthttp://mrfast.sourceforge.net/manual.html SOAP/SOAP2http://soap.genomics.org.cn/ SHRiMPhttp://compbio.cs.toronto.edu/shrimp/ De Novo annotatorQPALMA/GenomeMapper/PALMapperhttp:// SpliceMaphttp:// SOAPalshttp://soap.genomics.org.cn/ G-Mo. R-Sehttp:// TopHathttp://tophat.cbcb.umd.edu/ SplitSeekhttp://solidsoftwaretools.com/gf/project/splitseek De Novo transcript assembler Oaseshttp:// MIRAhttp://sourceforge.net/apps/mediawiki/mira-assembler/index.php Handout Reference: Pages 12-13