Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Similar presentations


Presentation on theme: "Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts."— Presentation transcript:

1 Mapping NGS sequences to a reference genome

2 Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts to a genome sequence Genome annotation Transcript enumeration Identification of splice junctions/variants

3 Blast is too slow Different alignment algorithms are necessary Burrows Wheeler Alignment – sequence database (genome) is transformed to produce an index – Individual sequence reads are searched against this index STAR Aligner (Dobin et al. 2012) Bioinformatics – Uncompressed Suffix trees

4 BWT of “banana”

5 Tophat2 Based on the Bowtie alignment engine – Bowtie, matching with no gaps – Tophat2, gapped matches Aligns reads to a Burrows Wheeler transformed index of the genome 1st pass  non-gapped matches 2 nd pass  splits unmapped reads and attempts to align the fragments

6 Start at the first base of sequence read Find Maximal Mappable Prefix (MMP) Repeat process using unmapped portion of read 50x faster than other aligners The STAR Aligner

7 OUTPUTS TopHat (Bowtie) –.bam file (binary alignment/map) –.sam (sequence alignment/map) – Single.sam file entry: I8MVR:53:837 0 17_dna:chromosome1409085825521M* 00 TAACTACGAATACCTGTCGAT**%-**,00%-*-%---*-*-NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...hXR:Z:CT XG:Z:CT

8 .sam fields

9 .sam flags 1.1 2. 2 3. 1+2 4. 0+4 5. 1+4 6. 0+2+4 7. 1+2+4 8. 0+8 9. 1+8 10. 0+2+8 11. 1+2+8 12. 0+4+8 13. 1+4+8 14. 0+2+4+8 15. 1+2+4+8 16.…etc.

10 CIGAR format I8MVR:104:144 07_dna:chromosome120102744 25562M1I14M*00 GGTTTTTTGGAAGAGTAGTTCGCGTTTCATTAATTAGTTATTTTTTAGTTTTTAAATAAAATAAAATTTTAAAAAAA

11 Quantifying alignments How many reads overlap a given interval on a chromosome (scaffold)? How do these regions correspond to known genes? –.gtf file How many transcripts from my gene of interest? How confident can I be about a variant call?

12 Annotate regions - GTF files 123456789 Chromosome _8.1Cufflinkstranscript90162907661000+. gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; Chromosome _8.1Cufflinksexon90162902311000+. gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; Chromosome _8.1Cufflinksexon90314907661000+. gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "2"; FPKM "110.6292802224"; frac "1.000000"; conf_lo "41.668327"; conf_hi "132.581041"; cov "6.415537"; Chromosome _8.1Cufflinkstranscript90889916201000.. gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "49.8117204717"; frac "1.000000"; conf_lo "21.651798"; conf_hi "73.074820"; cov "2.193724"; GTF fields 1.Sequence ID 2.Source 3.Feature 4.Start 5.End 6.Score 7.Strand 8.Frame 9.Attribute

13 Variant Calling.bam/.sam file contains all of the information required to call variants Variant calls can’t be extracted from the.bam file Must provide the genome sequence I8MVR:53:837 0 17_dna:chromosome1409085825521M * 0 0 TAACTACGAATACCTGTCGAT**%-**,00%-*-%---*-*- NM:i:7 XX:Z:C5T3C2T2CT2C XM:Z:h..H......h.H...x...h XR:Z:CT XG:Z:CT

14 Today’s exercises

15 Variant Analysis Extract variant information from provided.bam file Examine output file and learn about the information contained in the various fields

16 Introducing… Dr. Eric Rouchka Bioinformatics Core Director Department of Computer Engineering and Computer Science University of Louisville Kentucky Biomedical Research Infrastructure Network


Download ppt "Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts."

Similar presentations


Ads by Google