Dowell Short Read Class Phillip Richmond

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
ChIP-seq analysis Ecole de bioinformatique AVIESAN – Roscoff, Jan 2013.
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
High Throughput Sequencing
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
VIM: The basics Tang Wai-Chung, Matthew (MaFai) 29/12/2006.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
MCB Lecture #20 Nov 18/14 Reference alignments.
NGS data processing Bioinformatics tips, tools of the trade and pipeline writing Na Cai 4 th year DPhil in Clinical Medicine Supervisor: Jonathan Flint.
Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Bioinformatics Tips NGS data processing and pipeline writing
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
NGS Analysis Using Galaxy
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Variant Calling Workshop Chris Fields Variant Calling Workshop | Chris Fields | PowerPoint by Casey Hanson.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
MES Genome Informatics I - Lecture V. Short Read Alignment
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
Chapter 131 Applets and HTML Chapter Objectives learn how to write applets learn to write a simple HTML document learn how to embed an applet in.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Personalized genomics
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Lesson 6-Using Utilities to Accomplish Complex Tasks.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Lesson: Sequence processing
Next Generation Sequencing Analysis
RNA Sequencing Day 7 Wooohoooo!
Integrative Genomics Viewer (IGV)
Variant Calling Workshop
Short Read Sequencing Analysis Workshop
Introduction to RAD Acropora millepora.
GE3M25: Data Analysis, Class 4
GE3M25: Data Analysis, Class3
EMC Galaxy Course November 24-25, 2014
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
BF528 - Biological Data Formats
ChIP-Seq Data Processing and QC
Maximize read usage through mapping strategies
Agenda The Linux File System (chapter 4 in text)
Grauer and Barber Series
Information processing after resequencing
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops

The Variant Call Format
Presentation transcript:

Dowell Short Read Class Phillip Richmond ReSequencing Dowell Short Read Class Phillip Richmond

Outline The Plan Organize and copy data to your own working directory Map reads back to a reference genome Convert sam to bam Remove duplicates Run a variant caller Visualize variants

Plan The first round of variant calling we’re going to do will involve cutting the yeast genome Sigma1278b into reads, mapping them back to the S288c reference genome, and then finding all SNP differences between the two genomes This data will be synthetic The reads will already be produced for you in fastq format, 1x50 bp reads

Getting started Organization is KEY!! For the resequencing tutorial this is the organization that will be necessary: Make a new directory in your home directory called: ReSequencing Inside of ReSequencing make subdirectories: GENOME FASTQ SAM VCF PBS

Copying the data Now we want to copy the data from: /projects/sreadgrp/homeworkfiles/ReSequencing/ Copy the Fastq file from the FASTQ directory (Sigmav7_50mers.fastq) to your own FASTQ directory Copy SGDv4.fasta from GENOME/ to your own directory GENOME/ Copy the PBS files to your own PBS directory: IndexGenome.pbs MapReads.pbs Sam2Bam.pbs IndelRealign.pbs CallSNPs.pbs

Index the genome (IndexGenome.pbs) Command: /opt/bowtie/bowtie2-2.0.2/bowtie2-build <in.fasta> <out_index> My Command: /opt/bowtie/bowtie2-2.0.2/bowtie2-build /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta /Users/richmonp/ReSequencing/GENOME/SGDv4_bowtie2_Index

Map the reads back to the genome (MapReads.pbs) These reads need to have “readgroups” in order to work. It’s best to add these when we map using the bowtie2 options --rg and --rg-id: Example: --rg-id Sigmav7vsS288c_bowtie2 –rg SM:Sigmav7vsS288c_bowtie2 Full Command: /opt/bowtie/bowtie2-2.0.2/bowtie2 --rg-id Sigmav7vsS288c_bowtie2 --rg SM:Sigmav7vsS288c_bowtie2 /Users/richmonp/ReSequencing/GENOME/SGDv4_bowtie2_Index /Users/richmonp/ReSequencing/FASTQ/Sigmav7_50mers.fastq –S /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sam 2> /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.stderr

Convert your file format using Samtools (Sam2Bam.pbs) samtools view –bS <in.sam> -o <out.bam> samtools sort <in.bam> <out.sorted> samtools index <in.sorted.bam> /opt/samtools/0.1.18/samtools view –bS /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sam –o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.bam /opt/samtools/0.1.18/samtools sort /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.bam /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted /opt/samtools/0.1.18/samtools index /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted.bam

Samtools remove duplicates (Sam2Bam.pbs) Removes duplicate reads from PCR errors in reads. samtools rmdup <in.sorted.bam> <out.rmdup.sorted.bam> /opt/samtools/0.1.18/samtools rmdup /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2.sorted.bam /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam

Realign around indels (IndelRealign.pbs) GATK has a two-step process for realigning reads around indels Step 1: Find candidate locations that may be best represented by an insertion or deletion GATK’s RealignerTargetCreator Step 2: Apply local realignment around the candidate locations to produce a new bam file GATK’s IndelRealigner

Realign around Indels: RealignerTargetCreator java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –R <reference genome> -T RealignerTargetCreator (options) –I <in.sorted.rmdup.bam> -o <out.intervals> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta \ -T RealignerTargetCreator -minReads 5 \ -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam -o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.intervals

Realign around indels: IndelRealigner java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T IndelRealigner –model USE_READS –targetIntervals <in.intervals> -R <reference.fasta> -I <in.rmdup.sorted.bam> -o <out.rmdup.realigned.sorted.bam> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -T IndelRealigner -model USE_READS \ -targetIntervals /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.intervals \ -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta \ -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup.sorted.bam -o /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam

Call variants using GATK UnifiedGenotyper (CallSNPs.pbs) The GATK package is a java executable, or a .jar file. To run the package you type: java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar Then you must select a –T, or a program within the package to run, which in our case is UnifiedGenotyper java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T UnifiedGenotyper

Call variants using GATK UnifiedGenotyper java –jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar –T UnifiedGenotyper -glm BOTH -I <in.sorted.bam> -R <in.fasta> -o <out.vcf> java -jar /opt/gatk/2.4-9/GenomeAnalysisTK.jar -T UnifiedGenotyper -glm BOTH -R /Users/richmonp/ReSequencing/GENOME/SGDv4.fasta -I /Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam -o /Users/richmonp/ReSequencing/VCF/Sigmav7_vs_S288c_bowtie2_gatk.vcf

View your VCF in IGV GATK automatically indexes your VCF files, so now we can visualize both the reads and SNPs in IGV Transfer both the final bam file (/Users/richmonp/ReSequencing/SAM/Sigmav7_vs_S288c_bowtie2_rmdup_realigned.sorted.bam) and the vcf file (/Users/richmonp/ReSequencing/VCF/Sigmav7_vs_S288c_bowtie2_gatk.vcf) to your student directory on /projects/sreadgrp/student/<username>/ Open up the visualization VNC window Open IGV Load the files

Organize into groups of 5 Coffee Break Then… Organize into groups of 5

Paired-end data The main difference between paired-end and single-end data will occur when you are mapping Each read in the pair is denoted by either “R1” or “R2” 1028_S1_L001_R1_001.fastq 1028_S1_L001_R2_001.fastq

How it changes your bowtie2 command: Open up MapPairedReads.pbs in an editor Notice: -1 /data/Avery/FASTQ/1056_S1_L001_R1_001.fastq \ -2 /data/Avery/FASTQ/1056_S1_L001_R2_001.fastq \ The -1 is for read 1, and the -2 is for read 2

Now… Copy the MapPairedReads.pbs to your own PBS directory (from /projects/sreadgrp/homeworkfiles/ReSequencing/PBS/) Copy a pair of fastq files to your FASTQ directory (only copy the ones based on your group problem sheet)

First group to map, call variants, and visualize variants, wins First group to map, call variants, and visualize variants, wins! (prizes are not amazing)