Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
SOLiD Sequencing & Data
BioPivot: Applying Microsoft Live Labs’ Pivot to Problems in Bioinformatics Stephen Taylor, CBRG GMOD Europe 2010.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
The SOLiD System: Next-Generation Sequencing Overview of the SOLiD System –  Scalable  Accurate Ultra High Throughput  Flexible  Mate Pairs.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
High Throughput Sequencing
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
NGS Analysis Using Galaxy
Next generation sequencing Xusheng Wang 4/29/2010.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Todd J. Treangen, Steven L. Salzberg
RNAseq analyses -- methods
Massive Parallel Sequencing
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
Quick introduction to genomic file types Preliminary quality control (lab)
The iPlant Collaborative
I519 Introduction to Bioinformatics, Fall, 2012
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Jodi Humann, Stephen Ficklin, Taein Lee, Chun-Huai Cheng, Sook Jung, Jill Wegrzyn, David Neale and Dorrie Main An easy to use, web-based solution for specialty.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Next Generation Sequencing
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Sequence File Formats.
Lecture-5 ChIP-chip and ChIP-seq
No reference available
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Accessing and visualizing genomics data
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.
DNA Sequencing Second generation techniques
NGS File formats Raw data from various vendors => various formats
Lesson: Sequence processing
Cancer Genomics Core Lab
Sequencing technologies
DNA Sequencing -sayed Mohammad Amin Nourion -A’Kia Buford
Genome Sequence Annotation Server
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
The FASTQ format and quality control
2nd (Next) Generation Sequencing
DNA and the Genome Key Area 8a Genomic Sequencing.
Maximize read usage through mapping strategies
Next-generation DNA sequencing
Stephen Taylor, CBRG GMOD Europe 2010
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Next Generation Sequencing Bioinformatics Stephen Taylor Computational Biology Research Group

History Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites Next Generation Sequencing Much shorter reads, 25 to 300 bp Higher throughput Cheaper cost per Mb Single molecule sequencing (no cloning step) Since Jan 2008 more DNA sequenced than all previous years Sanger Dominant for last ~30 years 1000bp longest read Based on primers so not good for repetitive or SNPs sites Next Generation Sequencing Much shorter reads, 25 to 300 bp Higher throughput Cheaper cost per Mb Single molecule sequencing (no cloning step) Since Jan 2008 more DNA sequenced than all previous years Computational Biology Research Group

Hence We Need High Throughput Bioinformatics Computational Biology Research Group

Sanger Fred Sanger (1980) Dye-terminator sequencing PCR up DNA fragment Separate into 2 strands Polymerase elongates DNA Incorporation of fluorescence labelled ddNTP causes termination of elongation for each base Run DNA fragments on gel/capillary Peak generated for each base Fred Sanger (1980) Dye-terminator sequencing PCR up DNA fragment Separate into 2 strands Polymerase elongates DNA Incorporation of fluorescence labelled ddNTP causes termination of elongation for each base Run DNA fragments on gel/capillary Peak generated for each base Computational Biology Research Group

Illumina (Solexa) Computational Biology Research Group

Illumina (Solexa) Computational Biology Research Group

Illumina (Solexa) Computational Biology Research Group

Illumina (Solexa) Applications Resequencing Characterise different related species or strains Transcriptome analysis No chip/array required! random priming of RNA DNA methylation analysis sequencing bisulfite-converted DNA methylation-sensitive restriction digest enriched fragments Examine chromatin modifications Quantify in vivo protein-DNA interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) Resequencing Characterise different related species or strains Transcriptome analysis No chip/array required! random priming of RNA DNA methylation analysis sequencing bisulfite-converted DNA methylation-sensitive restriction digest enriched fragments Examine chromatin modifications Quantify in vivo protein-DNA interactions using the combination of chromatin immunoprecipitation and sequencing (ChIP-Seq) Computational Biology Research Group

Price Comparison Computational Biology Research Group

Processing and management Computational Biology Research Group

Assemble Data - Illumina Generates short reads (~35-75bp) Good for resequencing Difficult to do de novo assembly all but smallest organisms Generates short reads (~35-75bp) Good for resequencing Difficult to do de novo assembly all but smallest organisms Computational Biology Research Group

Mapping Illumina Reads Acquire and process images and convert to FASTQ* Get data Quality control** Map to genome Visualisation Post Processing Peak Finding SNP Calling * Not covered today Acquire and process images and convert to FASTQ* Get data Quality control** Map to genome Visualisation Post Processing Peak Finding SNP Calling * Not covered today Computational Biology Research Group

FASTQ TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I 1.HWUSI-EAS100R the unique instrument name 2.6 flowcell lane 3.73 tile number within the flowcell lane 'x'-coordinate of the cluster within the tile 'y'-coordinate of the cluster within the tile 6.#0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads TATACAATGCACTTAGTCATCCGCGTATCACTTTAT + IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I 1.HWUSI-EAS100R the unique instrument name 2.6 flowcell lane 3.73 tile number within the flowcell lane 'x'-coordinate of the cluster within the tile 'y'-coordinate of the cluster within the tile 6.#0 index number for a multiplexed sample (0 for no indexing) /1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only) Computational Biology Research Group

FASTQ format Quality Score ASCII representation of score for each base e.g. I Convert to ASCII e.g. 73 Minus Original Qphred= 40 See Quality Score ASCII representation of score for each base e.g. I Convert to ASCII e.g. 73 Minus Original Qphred= 40 See Computational Biology Research Group

Formats – warning! FASTQ format appears ‘standard’ but there are 3 types based on the probabilities of the base calls… Qphred = -10 x log10(error_prob) Qsolexa = -10 x log10(error_prob/(1-error_prob)) 1.Standard fastq: ASCII( Qphred + 33 ) 2.Illumina pre v1.3 : ASCII( Qsolexa + 64 ) 3.Illumina post v1.3: ASCII( Qphred+64 ) Option 3 should be the main one for the forseeable future! FASTQ format appears ‘standard’ but there are 3 types based on the probabilities of the base calls… Qphred = -10 x log10(error_prob) Qsolexa = -10 x log10(error_prob/(1-error_prob)) 1.Standard fastq: ASCII( Qphred + 33 ) 2.Illumina pre v1.3 : ASCII( Qsolexa + 64 ) 3.Illumina post v1.3: ASCII( Qphred+64 ) Option 3 should be the main one for the forseeable future! Computational Biology Research Group

Convert between formats Computational Biology Research Group Use sol2std2

Get Data May be supplied in a variety of formats.prb.txt files Contain probabilities for each base Some SNP callers use this Usually convert to FASTQ FASTQ Like FASTA but with quality score associated with each base May be supplied in a variety of formats.prb.txt files Contain probabilities for each base Some SNP callers use this Usually convert to FASTQ FASTQ Like FASTA but with quality score associated with each base Computational Biology Research Group

WTCHG If data is from WTCHG likely to get an E.g. wget the FASTQ file in the GERALD directory LD_ _johnb/s_2_sequence.txt.gz If data is from WTCHG likely to get an E.g. wget the FASTQ file in the GERALD directory LD_ _johnb/s_2_sequence.txt.gz Computational Biology Research Group

Processing reads - Illumina Mapping Tools MAQ Sanger Uses quality scores ELAND Comes with the machine and runs as standard Very fast NOVOALIGN Slower, more accurate Output option includes pairwise (handy for following up SNP calls) TOPHAT For RNA-Seq Can map slice junctions Mapping Tools MAQ Sanger Uses quality scores ELAND Comes with the machine and runs as standard Very fast NOVOALIGN Slower, more accurate Output option includes pairwise (handy for following up SNP calls) TOPHAT For RNA-Seq Can map slice junctions Computational Biology Research Group

Notes on Mapping What genome? Masking? Some tools disregard multiple maps e.g. ELAND Some tools map to one location and adjust probability score e.g. MAQ Can be confusing… For ChIP-Seq we normally use DNA heavily masked for repeats (simple/complex/ribosomal) What genome? Masking? Some tools disregard multiple maps e.g. ELAND Some tools map to one location and adjust probability score e.g. MAQ Can be confusing… For ChIP-Seq we normally use DNA heavily masked for repeats (simple/complex/ribosomal) Computational Biology Research Group

Databanks Indices We have many indexed databanks Under /databank/indices/ e.g. for maq ens_human_chrs/ ens_human_chrs_ucsc_rmfull_2/ ens_mouse_chrs/ ens_mouse_chrs_ucsc_rmfull/ ens_human_cdna/ ens_mouse_masked_chrs/ Indices for both maq and novoalign If an index you need is not there please ask – don’t make a local one in your account! We have many indexed databanks Under /databank/indices/ e.g. for maq ens_human_chrs/ ens_human_chrs_ucsc_rmfull_2/ ens_mouse_chrs/ ens_mouse_chrs_ucsc_rmfull/ ens_human_cdna/ ens_mouse_masked_chrs/ Indices for both maq and novoalign If an index you need is not there please ask – don’t make a local one in your account! Computational Biology Research Group

ChIP-Seq Pipeline Computational Biology Research Group ChIP-Sequencing Advantages Less DNA needed Not limited by micro-array content More precise site mapping Increased reads increases sensitivity Produces higher quality data

ChIP-Seq example NGS reads Map (maq) Peak pick (cisgenome) Extract sequences from features (Motif extract) MEME Weblogo

MAQ For simple runs use ‘easyrun’ option… nohup /proj/hts/bin/maq.pl easyrun -d maq.log In the main file is all.map To see the binary to something usable: maq pileup all.map > all.pileup These are quite large files… For simple runs use ‘easyrun’ option… nohup /proj/hts/bin/maq.pl easyrun -d maq.log In the main file is all.map To see the binary to something usable: maq pileup all.map > all.pileup These are quite large files… Computational Biology Research Group

Visualization all.map file converts to wig using CBRG custom tool maq wig all.map > all.wig Then we convert to GFF format using custom scripts all.map file converts to wig using CBRG custom tool maq wig all.map > all.wig Then we convert to GFF format using custom scripts Computational Biology Research Group

GFF format Gene Feature Format Developed at the Sanger Institute Format for describing features associated with DNA, RNA and Protein sequences Easy to parse More tools e.g. EMBOSS starting to use this as standard GFF3 is more standard and works best with GBrowse Gene Feature Format Developed at the Sanger Institute Format for describing features associated with DNA, RNA and Protein sequences Easy to parse More tools e.g. EMBOSS starting to use this as standard GFF3 is more standard and works best with GBrowse Computational Biology Research Group

##gff-version 3 chr3 src exon ID=exon00001 chr3 src exon ID=exon00002 chr3 src exon ID=exon00003 chr3 src exon ID=exon00004 chr3 src exon ID=exon00005 ##gff-version 3 chr3 src exon ID=exon00001 chr3 src exon ID=exon00002 chr3 src exon ID=exon00003 chr3 src exon ID=exon00004 chr3 src exon ID=exon00005 SOFA term Note ‘=‘

Wig binary files Scripts and modules to handle : UCSC wiggle format (1 column; 2 column; 4 column) or, gff3 binary (.wib) GMOD script wiggle_to_wigBinary.pl gff file Function: wiggle_to_wigBinary.pl variables (source / method / trackname / paths / input & output filenames ) command line to load binary / gff data into GBrowse (bp_seqfeature_load.pl + all variables: database name, filenames, paths etc) a conf file stanza - to display the loaded data construct an intermediate wiggle format file (....if input was gff3, maq binary)

Peak Calling Lots of algorithms to do this Problems with identifying a good cut off score Over and under prediction F-Seq Based on a training set of peaks identified by researcher in specific region Iterate over parameter space until achieve best TP/FP score cisgenome Uses IP and Non IP ChIP-Seq data, increases accuracy of predictions Lots of algorithms to do this Problems with identifying a good cut off score Over and under prediction F-Seq Based on a training set of peaks identified by researcher in specific region Iterate over parameter space until achieve best TP/FP score cisgenome Uses IP and Non IP ChIP-Seq data, increases accuracy of predictions Computational Biology Research Group

Motif Extraction Extract underlying DNA from peak calls Run using web based motif finders Weeder MEME May need to do successive rounds to find weaker motifs Extract underlying DNA from peak calls Run using web based motif finders Weeder MEME May need to do successive rounds to find weaker motifs Computational Biology Research Group

Quick note: SNP Calling Often finds errors in the PCR amplication step maq cns2snp (run during the easyrun option) SNPseeker Novoalign + CBRG script Worth trying all of the above! Often finds errors in the PCR amplication step maq cns2snp (run during the easyrun option) SNPseeker Novoalign + CBRG script Worth trying all of the above! Computational Biology Research Group

Molbiol Data Structure Analyse your data on deva.molbiol.ox.ac.uk CBRG set up /proj/hts/data/ Suggested structure: batch/ fastq/ dbname/ Contact us if you want a GBrowse database for your data Analyse your data on deva.molbiol.ox.ac.uk CBRG set up /proj/hts/data/ Suggested structure: batch/ fastq/ dbname/ Contact us if you want a GBrowse database for your data Computational Biology Research Group

Future Problem In depth analysis after mapping = bottleneck Need to empower the users to do their own analysis Solution Makefiles for bulk data analysis Allow access to NGS data via GBrowse ‘workbench’ GBrowse plugins to export data to other tools Galaxy looks promisinghttp://main.g2.bx.psu.edu/ Problem In depth analysis after mapping = bottleneck Need to empower the users to do their own analysis Solution Makefiles for bulk data analysis Allow access to NGS data via GBrowse ‘workbench’ GBrowse plugins to export data to other tools Galaxy looks promisinghttp://main.g2.bx.psu.edu/ Computational Biology Research Group