Download presentation
Presentation is loading. Please wait.
Published byPoppy Robinson Modified over 9 years ago
1
Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/
2
Mapping RNA-seq data Matthew Young Alicia Oshlack Bernie Pope
3
DNA (0.1-1.0 ug) Single molecule array Sample preparation Cluster growth 5’ 3’ G T C A G T C A G T C A C A G T C A T C A C C T A G C G T A G T 123789456 Image acquisition Base calling T G C T A C G A T … Sequencing Illumina Sequencing Technology Slide courtesy of G Schroth, Illumina
4
Raw data Short sequence reads Quality scores = -10log10(p) or similar… @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCAC AGTTT + !''*((((***+))%%++)(%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Handout Reference: Page 2
5
Quality checks Base composition Quality score PCR Artifacts
6
CDS Gene transcript Sequencing transcripts not the genome Difficulty here is that reads spanning a exon-exon junction may not get mapped when mapping to the genome. One strategy: Supplement the reference genome with sequences that span all known or possible junctions. Reads
7
Coding Sequence Exons IntronsSplice Junctions CDS Aggregate reads based on: exons Exons + junctions All reads start to end of transcript De novo methods
8
Mapping tools TypeNameLink General alignerGMAP/GSNAPhttp://research-pub.gene.com/gmap/ BFASThttp://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIEhttp://bowtie-bio.sourceforge.net/index.shtml CloudBursthttp://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmaphttp://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWAhttp://maq.sourceforge.net/ Permhttp://code.google.com/p/perm/ RazerShttp://www.seqan.de/projects/razers.html Mrfast/mrsfasthttp://mrfast.sourceforge.net/manual.html SOAP/SOAP2http://soap.genomics.org.cn/ SHRiMPhttp://compbio.cs.toronto.edu/shrimp/ De Novo annotatorQPALMA/GenomeMapper/PALMapperhttp://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper SpliceMaphttp://www.stanford.edu/group/wonglab/SpliceMap/ SOAPalshttp://soap.genomics.org.cn/ G-Mo. R-Sehttp://www.genoscope.cns.fr/externe/gmorse/ TopHathttp://tophat.cbcb.umd.edu/ SplitSeekhttp://solidsoftwaretools.com/gf/project/splitseek De Novo transcript assembler Oaseshttp://www.ebi.ac.uk/~zerbino/oases/ MIRAhttp://sourceforge.net/apps/mediawiki/mira-assembler/index.php
9
Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/
10
Familiarizing yourself with bowtie The minimum information bowtie needs is your reads (i.e., the fastq file from the machine) and a reference. The reference is the genome or transcriptome we are trying to map to, transformed using a Burrows Wheeler Transform to allow fast searching. Many optional parameters to tweak alignment. Handout Reference: Pages 2-7
11
How do you map 10^9, 76bp sequences, to a 10^9 bp reference Ideally we’d test every position on the genome for its suitability as a match and assign it a score based on # mismatches, indels etc. However, this is computationally impossible, so we have to come up with something else. Handout Reference: Pages 2-7
12
Lots of aligners, one general strategy Quick “heuristic” is performed to cut down the number of candidate alignment regions for each read. More precise algorithm is employed to decide which of these candidates is a valid alignment. Handout Reference: Pages 2-7
13
A fragment as seen by an aligner Heuristic acts only on the seed. Putative mapping locations identified. A more precise algorithm extends each seed to the full read and ranks them. The fragment is not sequenced directly, it’s sequence is inferred based on mapping of the read. ; ; Read Seed Fragment Burrows Wheeler acts on this Smith-Waterman acts on this Handout Reference: Pages 2-7
14
Our test data set 76bp single end reads. Sequenced using the Illumina Genome Analyzer RNA taken from Mouse myoblast cell line. We only look at a subset of one lane, full data available here http://www.ncbi.nlm.nih.gov/geo/query/acc.c gi?acc=GSE2084
15
Getting the computer setup Follow directions on handout to login to server. Type “qsub -I” to get a compute node. The data you will be using is stored in../shared/
16
A strict mapping bowtie -t -p 1 -n 0 -e 1 --sam --best --un Strict_Unmapped.fastq../shared/BowtieIndexes/mm9../shared/Sample_reads.fastq Strict_Mapped.sam -n sets the number of mismatches in the seed The sum of quality scores at ALL mismatches (not just the seed) must be less than -e. --un saves the unmapped reads to the file specified -t prints timing information -p sets the number of simultaneous threads --sam makes bowtie output in SAM format --best ensures the best map is returned Handout Reference: Page 8
17
Using the defaults bowtie -t -p 1 --best --sam../shared/BowtieIndexes/mm9../shared/Sample_reads.fastq test.sam -t prints timing information -p sets the number of simultaneous threads --sam makes bowtie output in SAM format --best ensures the best map is returned --un saves the unmapped reads to the file specified Handout Reference: Page 8
18
Allowing some mismatches Bowtie -t -p 1 -n 3 -e 200 --sam –best --un Loose_Unmapped.fastq../shared/BowtieIndexes/mm9 Strict_Unmapped.fastq Loose_Mapped.sam Set the number of seed mismatches to the maximum. Set -e to a value more appropriate for our read length. How many reads do you map? Handout Reference: Pages 8-9
19
A closer look at our data set: fastQC Handout Reference: Pages 9-10
21
Trimming reads Bowtie allows you to trim reads before it attempts to map them. You can trim from the left (5’) end of the read with the --trim5 option. You can trim from the right (3’) end of the read with the --trim3 option. Handout Reference: Page 10
22
CDS Gene transcript Sequencing transcripts not the genome Reads A library containing these bits of sequence (which do not appear in the genome) can help map junction reads. This is called an exon-junction library. A reference built from this library is in BowtieIndexes, called mm9.UCSC.knownGene.junctions (named for the annotation it was built from). Handout Reference: Pages 10-12
23
Options to map more reads… Trim some bases from the end of the reads using --trim5 and/or --trim3. Map to the junction library instead of the mouse genome using the mm9.UCSC.knownGene.junctions index. Handout Reference: Pages 9-12
24
Trim [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 2 -e 70 --trim5 15 -- trim3 25 --sam --best --un Trimmed_Unmapped.fastq../shared/BowtieIndexes/mm9 Loose_Unmapped.fastq Trimmed_Mapped.sam Time loading forward index: 00:00:02 Time loading mirror index: 00:00:02 Seeded quality full-index search: 00:00:47 # reads processed: 531414 # reads with at least one reported alignment: 164256 (30.91%) # reads that failed to align: 367158 (69.09%) Reported 164256 alignments to 1 output stream(s) Time searching: 00:00:53 Overall time: 00:00:54 Handout Reference: Page 10
25
Then map to junctions [myoung@bionode11 Starting]$ bowtie -t -p 1 -n 3 -e 200 --sam --best --un Junctions_Unampped.fastq../shared/BowtieIndexes/mm9.UCSC.knownGene.junctions Trimmed_Unmapped.fastq Junctions_Mapped.sam Time loading forward index: 00:00:35 Time loading mirror index: 00:00:34 Seeded quality full-index search: 00:00:24 # reads processed: 367158 # reads with at least one reported alignment: 128756 (35.07%) # reads that failed to align: 238402 (64.93%) Reported 128756 alignments to 1 output stream(s) Time searching: 00:01:43 Overall time: 00:01:45 Handout Reference: Page 11
26
Number of mapped reads Mapping strategy Command line options No. Mapped ReadsNo. Unmapped Reads Reference Strict -n 0 -e 11,049,050 (47.32%)1,167,992 (52.68%)Genome Loose-n 3 -e 2001,783,048 (80.42%)433,994 (19.58%)Genome Trimming-n 3 --trim5 15 --trim3 25 1,912,003 (86.24%)305,039 (13.76%)Genome Junctions-n 3 -e 2002,007,627 (90.55%)209,415 (9.45%)Junction Library Handout Reference: Page 12
27
Further options TypeNameLink General alignerGMAP/GSNAPhttp://research-pub.gene.com/gmap/ BFASThttp://sourceforge.net/apps/mediawiki/bfast/index.php BOWTIEhttp://bowtie-bio.sourceforge.net/index.shtml CloudBursthttp://sourceforge.net/apps/mediawiki/cloudburst-bio/index.php GNUmaphttp://dna.cs.byu.edu/gnumap/index.shtml MAQ/BWAhttp://maq.sourceforge.net/ Permhttp://code.google.com/p/perm/ RazerShttp://www.seqan.de/projects/razers.html Mrfast/mrsfasthttp://mrfast.sourceforge.net/manual.html SOAP/SOAP2http://soap.genomics.org.cn/ SHRiMPhttp://compbio.cs.toronto.edu/shrimp/ De Novo annotatorQPALMA/GenomeMapper/PALMapperhttp://www.fml.tuebingen.mpg.de/raetsch/suppl/palmapper SpliceMaphttp://www.stanford.edu/group/wonglab/SpliceMap/ SOAPalshttp://soap.genomics.org.cn/ G-Mo. R-Sehttp://www.genoscope.cns.fr/externe/gmorse/ TopHathttp://tophat.cbcb.umd.edu/ SplitSeekhttp://solidsoftwaretools.com/gf/project/splitseek De Novo transcript assembler Oaseshttp://www.ebi.ac.uk/~zerbino/oases/ MIRAhttp://sourceforge.net/apps/mediawiki/mira-assembler/index.php Handout Reference: Pages 12-13
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.