MCB3895-004 Lecture #20 Nov 18/14 Reference alignments.

Slides:



Advertisements
Similar presentations
MCB Lecture #15 Oct 23/14 De novo assemblies using PacBio.
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint.
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bioinformatics Tips NGS data processing and pipeline writing
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
NGS data analysis CCM Seminar series Michael Liang:
RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Introduction to RNAseq
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Personalized genomics
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
Calling Somatic Mutations using VarScan
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
From Reads to Results Exome-seq analysis at CCBR
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Dowell Short Read Class Phillip Richmond
Next Generation Sequencing Analysis
Call SNPs & Infer Phylogeny (CSI Phylogeny)
Regulatory Genomics Lab
Variant Calling Workshop
First Bite of Variant Calling in NGS/MPS Precourse materials
GE3M25: Data Analysis, Class 4
MiSeq Validation Pipeline
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Assessment of HaloPlex Amplification for Sequence Capture and Massively Parallel Sequencing of Arrhythmogenic Right Ventricular Cardiomyopathy–Associated.
BF528 - Biological Data Formats
ChIP-Seq Data Processing and QC
Maximize read usage through mapping strategies
Information processing after resequencing
Regulatory Genomics Lab
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
Regulatory Genomics Lab
The Variant Call Format
Presentation transcript:

MCB Lecture #20 Nov 18/14 Reference alignments

Today: 1.Align reads to a reference genome 2.Correct for misalignments 3.Analyze variants between reads and the reference genome (i.e., differences between the sequenced genome and the reference) We will roughly follow the default samtools protocol:

Record keeping As you will see, a huge amount of this work is converting between formats so that different software will work It is therefore CRUCIAL that you keep records of all the commands that you use BEWARE: different versions of all of these software have different syntaxes

Read mapping Align reads to a reference genome and determine SNPs Note aligning reads, not contigs as with nucmer Computationally more efficient than doing de novo assembly first

Read mapping tools Many different flavors, but overall dominated by two programs: bwa (we will use today) bowtie Note: early versions of bowtie did not align reads containing indels, whereas bwa did Some debate about which is best, trade-offs between sensitivity (ability to map everything) and specificity (are mappings correct) Also speed and memory considerations

Mapping using bwa Create an index of the reference genome nucleotide fasta for the alignment software to use for read mapping syntax: $ bwa index [ref.fasta file] e.g.: $ bwa index E_coli.fasta note: use ".fasta " file ending for a later step Creates 5 output files: [ref.fasta].amb,.ann,.bwt,.pac,.sa "Index": special computer data structure that allows fast searching; software-specific

Mapping using bwa bwa mem does the actual mapping step syntax: $ bwa mem -R ry1' [ref.fasta file] [read file 1] [read file 2] > [outfile] -R : indexes "read groups", required for GATK in later steps e.g.: $ bwa mem -R E_coli.fasta SRR826450_1.fastq SRR826450_2.fastq > align.sam

samtools : convert.sam to.bam, clean up names.sam is the plain text output format of most sequence alignment programs Because these can be large, most subsequent programs use the compressed ".bam " format instead bwa sometimes does odd things to read pairing information, can clean up during conversion

samtools : convert.sam to.bam, clean up names syntax: $ samtools fixmate -O bam [input.sam file] [output.bam file] -O : output file type e.g.: $ samtools fixmate -O bam align.sam align_fixmate.bam

samtools : sort.bam file samtools and related software use.bam files that are sorted by ascending genomic position i.e., starts from position #1 on the reference genome and goes to the end syntax: $ samtools sort -O bam -o [output sorted.bam] -T [temp file location] [input unsorted.bam] -O : output file type -o : output file name -T : location for temporary files (required) e.g.: $ samtools sort -O bam -o align_sorted.bam -T temp align_fixmate.bam

GATK : realign indels bwa sometimes misaligns indels in reads One way to get rid of these is to use the realignment functions in the GATK package More generally: GATK does much of the same thing as samtools, strong focus on diploid genomes Unfortunately, GATK uses java (silly command line syntax) Unfortunately, GATK needs its own file formats

Picard : index reference syntax: $ java -jar /export/apps/picard-tools /picard.jar CreateSequenceDictionary REFERENCE=[ref.fasta file] OUTPUT=[output.dict file] REFERENCE : reference file name OUTPUT : output index file name, must be ".dict " e.g.: $ java -jar /export/apps/picard-tools /picard.jar CreateSequenceDictionary REFERENCE=E_coli.fasta OUTPUT=E_coli.dict

samtools : index reference syntax: $ samtools faidx [ref.fasta file] e.g.: $ samtools faidx E_coli.fasta outputs [ref.fasta].fai index file

samtools : index.bam file syntax: $ samtools index [sorted.bam file] e.g.: $ samtools index align_sorted.bam outputs [sorted bam].bai output file

GATK : prepare reads for indel realignment syntax: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R [ref.fasta file] -I [sorted & indexed.bam file] -o [output file name] -R : reference.fasta file name -I : sorted and indexed.bam file -o : output intervals file name e.g.: $ java -Xmx2g -jar /opt/bioinformatics GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R E_coli.fasta -I align_sorted.bam -o align_sorted.intervals

GATK : perform indel realignment syntax: $ java -Xmx4g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.j ar -T IndelRealigner -R [ref.fasta file] -I [sorted & indexed.bam file] - targetIntervals [intervals file] -o [output.bam file] -T : Program function to use -R : Reference.fasta file -I : Intervals file from last step -o : output.bam file name e.g.: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.j ar -T IndelRealigner -R E_coli.fasta -I align_sorted.bam -targetIntervals align_sorted.intervals -o align_realigned.bam

Picard : remove duplicates Duplicate reads can arise because of PCR artifacts during sequencing Because duplicate reads to not provide additional information, it is best to remove them for computational efficiency Identified by having identical start and end mapping positions

Picard : remove duplicates syntax: $ java -Xmx2g -jar /export/apps/picard-tools /picard.jar MarkDuplicates INPUT=[input bam file] OUTPUT=[output bam file] REMOVE_DUPLICATES=true METRICS_FILE=[metrics output file] INPUT : input.bam file from GATK OUTPUT : output.bam file lacking duplicates METRICS_FILE : summary file of duplicate reads removed e.g.: $ java -Xmx2g -jar /export/apps/picard-tools /picard.jar MarkDuplicates INPUT=align_realigned.bam OUTPUT=align_nodups.bam REMOVE_DUPLICATES=true METRICS_FILE=nodups.metrics

samtools : index.bam file samtools requires that the new.bam file be indexed before variant calling syntax: $ samtools index [.bam file name] e.g.: $ samtools index align_nodups.bam

samtools : create.bcf file for variant calling bcftools is a package very similar to samtools that handles variant calling Of course, it requires its own file format syntax: $ samtools mpileup -go [output.bcf] -f [ref.fasta] [1 or more indexed.bam] -go : specify output file name and.bcf format -f : reference.fasta file name e.g.: $ samtools mpileup -go E_coli.bcf -f E_coli.fasta align_nodups.bam

bcftools : call variants The actual variant calling step uses the call function in bcftools syntax: $ bcftools call -vmO z -o [output.vcf.gz file] [input.bcf file] -v : only output variant sites -m : specify variant calling algorithm (multiallelic) -O : specify output format, z =.vcf.gz -o : output file name e.g.: $ bcftools call -vmO z -o E_coli.vcf.gz E_coli.bcf

bcftools : index.vcf.gz file tabix is a program included in bcftools that indexes a.vcf.gz file syntax: $ tabix -p vcf [input.vcf.gz file] -p : specifies file type e.g.: $ tabix -p vcf E_coli.vcf.gz

bcftools : analyze.vcf.gz file bcftools has handy software to analyze the variants that it has identified syntax: $ bcftools stats -F [ref.fasta] -s - [input.vcf.gz file] > [output file] -F : faidx indexed reference.fasta sequence -s : list of samples to analyze, " - " = all samples e.g.: $ bcftools stats -F E_coli.fasta -s - E_coli.vcf.gz > E_coli.vcf.gz.stats

Summary stats Indel stats

Quality stats Indel types Substitution types

bcftools : filter variants based on quality score Generally, one wants to mark low quality variants. How to draw a cutoff line is somewhat subjective syntax: $ bcftools filter -O z -o [output.vcf.gz file] -s LOWQUAL -i '%QUAL>10' [input.vcf.gz file] -O : output type, " z " =.vcf.gz -o : output file name -s : label to mark failed variants -i : condition under which sequences pass e.g.: $ bcftools filter -O z -o E_coli_filtered.vcf.gz -s LOWQUAL -i '%QUAL>10' E_coli.vcf.gz

bcftools : calculate stats based on filtered variants You can tell bcftools stats to only analyze variants that pass the filter syntax: $ bcftools stats -F [ref.fasta] -f PASS -s - [input filtered.vcf.gz file] -F : faidx indexed reference.fasta sequence -f : how sequences to include are marked -s : list of samples to analyze, " - " = all samples e.g.: $ bcftools stats -F E_coli.fasta -f PASS -s - E_coli_filtered.vcf.gz

Assignment How do our MiSeq and HiSeq E.coli datasets differ from the reference K-12 genome? Submit: 1.Number of SNP and indel differences compared to the reference genome 2.Justification for the filtering parameters used 3.Lab notebook file listing all of the exact parameters used 4.Output of the bcftools stats analysis