Next Generation Sequencing Analysis

Slides:

Advertisements

Similar presentations

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.

Advertisements

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.

DNAseq analysis Bioinformatics Analysis Team

High Throughput Sequencing

SOLiD Sequencing & Data

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

MCB Lecture #20 Nov 18/14 Reference alignments.

Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.

SOAP3-dp Workflow.

Bioinformatics Tips NGS data processing and pipeline writing

NGS Analysis Using Galaxy

Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &

Whole Exome Sequencing for Variant Discovery and Prioritisation

Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.

Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.

File formats Wrapping your data in the right package Deanna M. Church

GBS Bioinformatics Pipeline(s) Overview

RNAseq analyses -- methods

NGS data analysis CCM Seminar series Michael Liang:

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.

Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Introduction to RNAseq

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

No reference available

Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.

Calling Somatic Mutations using VarScan

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

From Reads to Results Exome-seq analysis at CCBR

DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.

Canadian Bioinformatics Workshops

071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

NGS File formats Raw data from various vendors => various formats

Day 5 Mapping and Visualization

Short Read Sequencing Analysis Workshop

Lesson: Sequence processing

Cancer Genomics Core Lab

Dowell Short Read Class Phillip Richmond

MGmapper A tool to map MetaGenomics data

VCF format: variants c.f. S. Brown NYU

First Bite of Variant Calling in NGS/MPS Precourse materials

Invest. Ophthalmol. Vis. Sci ;57(10): doi: /iovs Figure Legend:

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

MiSeq Validation Pipeline

EMC Galaxy Course November 24-25, 2014

Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.

Whole-exome sequencing for RH genotyping and alloimmunization risk in children with sickle cell anemia by Stella T. Chou, Jonathan M. Flanagan, Sunitha.

Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng

Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.

ChIP-Seq Data Processing and QC

Data formats Gabor T. Marth Boston College

Learning to count: quantifying signal

Maximize read usage through mapping strategies

BF528 - Genomic Variation and SNP Analysis

Additional file 2: RNA-Seq data analysis pipeline

Canadian Bioinformatics Workshops

BF528 - Sequence Analysis Fundamentals

Computational Pipeline Strategies

Automating NGS Gene Panel Analysis Workflows

RNA-Seq Data Analysis UND Genomics Core.

The Variant Call Format

Presentation transcript:

Next Generation Sequencing Analysis By Basil Khuder Basil.Khuder@rockets.utoledo.edu By Basil Khuder Basil.Khuder@rockets.utoledo.edu

Workflow Part 1: Raw Reads to Analysis-Ready Reads RAW Reads = FASTQ files We convert FASTQ to uBAM, and then back to FASTQ We take this converted FASTQ and map it to the reference using BWA, and run some preparatory procedures using Picard Tools Tools Needed Burrows-Wheeler Aligner Picard Tools SAMTools Genome Analysis Toolkit We finally run base recalibration on the aligned file to get into BAM format, and it’s not ready for variant calling.

Step 1: Creating a Reference Index Why Create a Reference? Since we are mapping many short reads to the reference sequence, we need to build a reference index so that it’s easier to find the exact sequence in the reference. Without reference, process would be slow.

Step 1: Creating a Reference Index (Continued) Creating a Reference index will produce 7 files, based upon the reference genome. bwa index -a bwtsw hg19.fasta java -jar picard.jar CreateSequenceDictionary R= hg19.fasta O= hg19.dict samtools faidx hg19.fasta

Step 2: Converting FASTQ file to an Unmapped BAM Why convert to an Unmapped BAM? According to the Broad Institute: Most sequencing providers generate FASTQ files with the raw unmapped read sequences, This is not ideal because much of the metadata runs cannot be stored in FASTQ files, unlike BAM files.

Step 2: Converting FASTQ file to an Unmapped BAM (Continued) Play close attention to the following options: Sample Name, Library Name, Platform and Read Group Name: Sample_NAME is simply the sample you are working with. In our case, SG4-6. Library_Name is the library preparation that the sequencing data went through it. Our library prep was Agilent SureSelect Human All Exon V5 Kit. Platform is the company who conducted the sequencing. In this case, it’s Illumina. Read_Group_Name is taken from the FASTQ file. Specifically, the information in the third placeholder marked “C7H7” which is the flowcell id. You can ignore the EANXX, and then will also need to include the “5” which refers to the flow cell lane.

Step 2: Converting FASTQ file to an Unmapped BAM (Continued) java -Xmx8G -jar picard.jar FastqToSam \ FASTQ=SG5_R1.fastq \ FASTQ2=SG5_R2.fastq \ OUTPUT=SG5_fastqtosam.bam \ READ_GROUP_NAME=C7H7:5 \ SAMPLE_NAME=SG_5 \ LIBRARY_NAME=Agilent_SureSelect \ PLATFORM=illumina \ RUN_DATE=2017-01-20T00:30:30+00:00 #using iso8601 http://www.timestampgenerator.com/

Step 3: Mark Illumnia Adatprers Because the adapters that were used in the initial sequencing prep are still in the files, we must mark them. We do this using Picard Tools and the Mark Illumina Adapters option. java -jar picard.jar MarkIlluminaAdapters \ I=SG5_fastqtosam.bam\ O=SG5_fastqtosam_markilluminaadapters.bam\ M=SG5_snippet_markilluminaadapters_metrics.txt

Step 4: Revert back to FASTQ, with required meta data. java -jar picard.jarSamToFastq\ I=SG5_fastqtosam_markilluminaadapters.bam \ FASTQ=SG5_interleaved.fq \ CLIPPING_ATTRIBUTE=XT \ #Refers to the marking of adapters CLIPPING_ACTION=2 \ #Refers to the marking of adapters INTERLEAVE=true \ #Refers to paired reads NON_PF=true \ #0x200 failedVendorQualityChecks flag #85

Step 5: Align to Reference Using BWA Bwa is the alignnment tool, mem is the alrgorithm we are using. -M allows MergeBamAlignment, a tool for the next step, to reassign proper secondary alignments. Secondary alignments are ambiguous alignments that might be mapped incorrectly (due to poor read quality, contamination ect.) T refers to the amount of computer threads we want to use. (Use nproc to find number of processor threads.) -P indicates the file contains interleaved paired reads. bwa mem -M -t 30 -p hg19.fasta \ SG5_interleaved.fq > SG5_Hg19.sam

Step 6: Convert aligned SAM file to BAM java -jar picard.jar MergeBamAlignment \ R=hg19.fasta \ UNMAPPED_BAM=SG6_fastqtosam.bam \ ALIGNED_BAM=SG6_Hg19.sam \ O=SG6_final_hg19.bam \ CREATE_INDEX=true \ ADD_MATE_CIGAR=true \ #Cigar is a string of sequences to help indicate mismatches, deletions, ect. CLIP_ADAPTERS=false \ #Already marked, no need to clip. CLIP_OVERLAPPING_READS=true \ #We will clip any overlapping reads SECONDARY_ALIGNMENTS=true \ #Remember this from Step 5 MAX_INSERTIONS_OR_DELETIONS=-1 \ PRIMARY_ALIGNMENT_STRATEGY=MostDistant\ #Recommended Algorithm ATTRIBUTES_TO_RETAIN=XS #Retaining Meta Data

We use HaplotypeCaller to produce the variants based off our BAM file. Part 2: Variant Calling We use HaplotypeCaller to produce the variants based off our BAM file. Variant Calling is identifying the sites where your data displays variation relative to the reference genome, and calculate genotypes for each sample at that site. We use VQSR to recalibrate SNPS and filter variants that don’t pass our threshold.

Using Haplotype Caller java -jar GenomeAnalysisTK.jar \ -T HaplotypeCaller \ -R hg19.fasta \ -I SG6_final_hg19.bam \ -L 20 \ #Only used if you want to call variants for specific chromosome, ect. --genotyping_mode DISCOVERY \ -stand_call_conf 30 \ -o raw_variants.vcf

Variant Filtration Steps 1.) Prepare recalibration parameters for SNPs 2.) Build the SNP recalibration model 3.) Apply the desired level of recalibration to the SNPs in the call set 4.) Prepare recalibration parameters for Indels 5.) Build the Indel recalibration model 6.) Apply the desired level of recalibration to the Indels in the call set

Prepare recalibration parameters for SNPs Purpose: Specify which call sets the program should use as resources to build the recalibration model. Annotation Programs to conduct recalibration: QualByDepth: Produces variance confidence FisherStrand: Measures strand bias (is the variation on forward or reverse strand? More bias = false positives. StrandOddsRatio: Also measures strand bias. MappingQualityRankSumTest: Test for mapping qualities. ReadPostRankSumTest: Also tests for mapping qualities. RMSMappingQuality: Estimates the overall mapping quality of variants. Link: https://software.broadinstitute.org/gatk/guide/topic?name=tutorials

Prepare recalibration parameters for SNPs (Continued) Tranches are essentially slices of variants, ranked by VQSLOD, with specific threshold values. These values refer to the sensitivity we can obtain when we apply them to the call sets that the program uses to train the model. The idea is that the lowest tranche is highly specific but less sensitive (there are very few false positives but potentially many false negatives, i.e. missing calls), and each subsequent tranche in turn introduces additional true positive calls along with a growing number of false positive calls. This allows us to filter variants based on how sensitive we want the call set to be, rather than applying hard filters and then only evaluating how sensitive the call set is using post hoc methods.

Step 2: Apply the desired level of recalibration to the SNPs in the call set

Step 3: Build the Indel Recalibration Model

Step 4: Apply the desired level of recalibration to the Indels in the call set

Homework 1.) Download the hg19 reference genome and describe the process. Explain the steps you would take to index this reference genome and the software used. Lastly, explain how you would conduct reference indexing and why it's a necessary step. 2.) What is the purpose of going from FASTQ to uBAM? Explain the importances of this conversion step and how uBAM differs from a traditional BAM file. 3.) What does variant filtration do? Describe the necessary parameters and procedures that are needed to conduct variant filtration? How does this step differ from variant calling? BONUS: How does the workflow of RNA-Seq to VCF differ than the workflow described today involving using Exome-Sequencing data? Explain the major differences between the two.