MCB Lecture #20 Nov 18/14 Reference alignments
Today: 1.Align reads to a reference genome 2.Correct for misalignments 3.Analyze variants between reads and the reference genome (i.e., differences between the sequenced genome and the reference) We will roughly follow the default samtools protocol:
Record keeping As you will see, a huge amount of this work is converting between formats so that different software will work It is therefore CRUCIAL that you keep records of all the commands that you use BEWARE: different versions of all of these software have different syntaxes
Read mapping Align reads to a reference genome and determine SNPs Note aligning reads, not contigs as with nucmer Computationally more efficient than doing de novo assembly first
Read mapping tools Many different flavors, but overall dominated by two programs: bwa (we will use today) bowtie Note: early versions of bowtie did not align reads containing indels, whereas bwa did Some debate about which is best, trade-offs between sensitivity (ability to map everything) and specificity (are mappings correct) Also speed and memory considerations
Mapping using bwa Create an index of the reference genome nucleotide fasta for the alignment software to use for read mapping syntax: $ bwa index [ref.fasta file] e.g.: $ bwa index E_coli.fasta note: use ".fasta " file ending for a later step Creates 5 output files: [ref.fasta].amb,.ann,.bwt,.pac,.sa "Index": special computer data structure that allows fast searching; software-specific
Mapping using bwa bwa mem does the actual mapping step syntax: $ bwa mem -R ry1' [ref.fasta file] [read file 1] [read file 2] > [outfile] -R : indexes "read groups", required for GATK in later steps e.g.: $ bwa mem -R E_coli.fasta SRR826450_1.fastq SRR826450_2.fastq > align.sam
samtools : convert.sam to.bam, clean up names.sam is the plain text output format of most sequence alignment programs Because these can be large, most subsequent programs use the compressed ".bam " format instead bwa sometimes does odd things to read pairing information, can clean up during conversion
samtools : convert.sam to.bam, clean up names syntax: $ samtools fixmate -O bam [input.sam file] [output.bam file] -O : output file type e.g.: $ samtools fixmate -O bam align.sam align_fixmate.bam
samtools : sort.bam file samtools and related software use.bam files that are sorted by ascending genomic position i.e., starts from position #1 on the reference genome and goes to the end syntax: $ samtools sort -O bam -o [output sorted.bam] -T [temp file location] [input unsorted.bam] -O : output file type -o : output file name -T : location for temporary files (required) e.g.: $ samtools sort -O bam -o align_sorted.bam -T temp align_fixmate.bam
GATK : realign indels bwa sometimes misaligns indels in reads One way to get rid of these is to use the realignment functions in the GATK package More generally: GATK does much of the same thing as samtools, strong focus on diploid genomes Unfortunately, GATK uses java (silly command line syntax) Unfortunately, GATK needs its own file formats
Picard : index reference syntax: $ java -jar /export/apps/picard-tools /picard.jar CreateSequenceDictionary REFERENCE=[ref.fasta file] OUTPUT=[output.dict file] REFERENCE : reference file name OUTPUT : output index file name, must be ".dict " e.g.: $ java -jar /export/apps/picard-tools /picard.jar CreateSequenceDictionary REFERENCE=E_coli.fasta OUTPUT=E_coli.dict
samtools : index reference syntax: $ samtools faidx [ref.fasta file] e.g.: $ samtools faidx E_coli.fasta outputs [ref.fasta].fai index file
samtools : index.bam file syntax: $ samtools index [sorted.bam file] e.g.: $ samtools index align_sorted.bam outputs [sorted bam].bai output file
GATK : prepare reads for indel realignment syntax: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R [ref.fasta file] -I [sorted & indexed.bam file] -o [output file name] -R : reference.fasta file name -I : sorted and indexed.bam file -o : output intervals file name e.g.: $ java -Xmx2g -jar /opt/bioinformatics GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R E_coli.fasta -I align_sorted.bam -o align_sorted.intervals
GATK : perform indel realignment syntax: $ java -Xmx4g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.j ar -T IndelRealigner -R [ref.fasta file] -I [sorted & indexed.bam file] - targetIntervals [intervals file] -o [output.bam file] -T : Program function to use -R : Reference.fasta file -I : Intervals file from last step -o : output.bam file name e.g.: $ java -Xmx2g -jar /opt/bioinformatics/GATK/GenomeAnalysisTK.j ar -T IndelRealigner -R E_coli.fasta -I align_sorted.bam -targetIntervals align_sorted.intervals -o align_realigned.bam
Picard : remove duplicates Duplicate reads can arise because of PCR artifacts during sequencing Because duplicate reads to not provide additional information, it is best to remove them for computational efficiency Identified by having identical start and end mapping positions
Picard : remove duplicates syntax: $ java -Xmx2g -jar /export/apps/picard-tools /picard.jar MarkDuplicates INPUT=[input bam file] OUTPUT=[output bam file] REMOVE_DUPLICATES=true METRICS_FILE=[metrics output file] INPUT : input.bam file from GATK OUTPUT : output.bam file lacking duplicates METRICS_FILE : summary file of duplicate reads removed e.g.: $ java -Xmx2g -jar /export/apps/picard-tools /picard.jar MarkDuplicates INPUT=align_realigned.bam OUTPUT=align_nodups.bam REMOVE_DUPLICATES=true METRICS_FILE=nodups.metrics
samtools : index.bam file samtools requires that the new.bam file be indexed before variant calling syntax: $ samtools index [.bam file name] e.g.: $ samtools index align_nodups.bam
samtools : create.bcf file for variant calling bcftools is a package very similar to samtools that handles variant calling Of course, it requires its own file format syntax: $ samtools mpileup -go [output.bcf] -f [ref.fasta] [1 or more indexed.bam] -go : specify output file name and.bcf format -f : reference.fasta file name e.g.: $ samtools mpileup -go E_coli.bcf -f E_coli.fasta align_nodups.bam
bcftools : call variants The actual variant calling step uses the call function in bcftools syntax: $ bcftools call -vmO z -o [output.vcf.gz file] [input.bcf file] -v : only output variant sites -m : specify variant calling algorithm (multiallelic) -O : specify output format, z =.vcf.gz -o : output file name e.g.: $ bcftools call -vmO z -o E_coli.vcf.gz E_coli.bcf
bcftools : index.vcf.gz file tabix is a program included in bcftools that indexes a.vcf.gz file syntax: $ tabix -p vcf [input.vcf.gz file] -p : specifies file type e.g.: $ tabix -p vcf E_coli.vcf.gz
bcftools : analyze.vcf.gz file bcftools has handy software to analyze the variants that it has identified syntax: $ bcftools stats -F [ref.fasta] -s - [input.vcf.gz file] > [output file] -F : faidx indexed reference.fasta sequence -s : list of samples to analyze, " - " = all samples e.g.: $ bcftools stats -F E_coli.fasta -s - E_coli.vcf.gz > E_coli.vcf.gz.stats
Summary stats Indel stats
Quality stats Indel types Substitution types
bcftools : filter variants based on quality score Generally, one wants to mark low quality variants. How to draw a cutoff line is somewhat subjective syntax: $ bcftools filter -O z -o [output.vcf.gz file] -s LOWQUAL -i '%QUAL>10' [input.vcf.gz file] -O : output type, " z " =.vcf.gz -o : output file name -s : label to mark failed variants -i : condition under which sequences pass e.g.: $ bcftools filter -O z -o E_coli_filtered.vcf.gz -s LOWQUAL -i '%QUAL>10' E_coli.vcf.gz
bcftools : calculate stats based on filtered variants You can tell bcftools stats to only analyze variants that pass the filter syntax: $ bcftools stats -F [ref.fasta] -f PASS -s - [input filtered.vcf.gz file] -F : faidx indexed reference.fasta sequence -f : how sequences to include are marked -s : list of samples to analyze, " - " = all samples e.g.: $ bcftools stats -F E_coli.fasta -f PASS -s - E_coli_filtered.vcf.gz
Assignment How do our MiSeq and HiSeq E.coli datasets differ from the reference K-12 genome? Submit: 1.Number of SNP and indel differences compared to the reference genome 2.Justification for the filtering parameters used 3.Lab notebook file listing all of the exact parameters used 4.Output of the bcftools stats analysis