Next Gen. Sequencing Files and pysam

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Python programs How can I run a program? Input and output.
MES Genome Informatics I - Lecture V. Short Read Alignment
RNAseq analyses -- methods
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Beginning BioPerl for Biologists MPI Ploen Jun Wang.
Next Generation DNA Sequencing
Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Iteration While / until/ for loop. While/ Do-while loops Iteration continues until condition is false: 3 important points to remember: 1.Initialise condition.
9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID
Using Local Tools: BLAST
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Python is Awesome! (and cooler than R). My Research.
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
Sequence File Parsing using Biopython
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Cancer Genomics Core Lab
RNA Sequencing Day 7 Wooohoooo!
Using Local Tools: BLAST
MGmapper A tool to map MetaGenomics data
Advanced Python Concepts: Modules
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
(optional - but then again, all of these are optional)
(optional - but then again, all of these are optional)‏
Introductory RNA-Seq Transcriptome Profiling
EMC Galaxy Course November 24-25, 2014
Sequence File Parsing using Biopython
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
More for loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Notes about Homework #4 Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming:
Advanced Python Concepts: Modules
Using Local Tools: BLAST
Information processing after resequencing
Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Using Local Tools: BLAST
Advanced Python Concepts: Modules
Sequence File Parsing using Biopython
BF528 - Sequence Analysis Fundamentals
Computational Pipeline Strategies
Presentation transcript:

Next Gen. Sequencing Files and pysam BCHB524 Lecture 12 BCHB524 - Edwards

Next Gen. Sequencing Wiki: Genomics BCHB524 - Edwards

Next Gen. Sequencing Nature Biotechnology 29, 24–26 (2011) BCHB524 - Edwards

Python for NGS NGS data is big! Use Python for: Special purpose tools (tophat, cufflinks, samtools) for aligning Use Python for: Clean up / filter reads Post-process tool output Visualization BCHB524 - Edwards

Count reads from FASTQ file # Import BioPython's SeqIO module import Bio.SeqIO # Import the sys module import sys # Get first command-line argument inputfile = sys.argv[1] # Initialize counter count = 0 # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Increment count     count += 1 # Output result print count,"reads" BCHB524 - Edwards

Filter reads in FASTQ file import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] minlength = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Check the length if len(read.seq) > minlength:         # Output to standard-out print read.format("fastq"), BCHB524 - Edwards

Filter reads in FASTQ file import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] thr = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Check the minimum phred score     if min(read.letter_annotations["phred_quality"]) >= thr:         # Output to standard-out         print read.format("fastq"), BCHB524 - Edwards

Remove primer sequence import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # if the primer sequence is present     if read.seq.startswith('GATGACGGTGT'):         # remove it and output as FASTA         read = read[11:]         print read.format("fasta"), BCHB524 - Edwards

Dump space-separated-values import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Output description, and read length     print read.description,len(read.seq) BCHB524 - Edwards

Plot read lengths import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Store read length     lengths.append(len(read.seq)) # lengths.sort() plot(lengths,'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Histogram of read lengths import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     # Store read length     lengths.append(len(read.seq)) hist(lengths) show() # savefig('readlengthhist.png') BCHB524 - Edwards

Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     phred_scores = read.letter_annotations["phred_quality"]     l = 0     for phsc in phred_scores:         if phsc < 30:             break         l += 1     lengths1.append(l)     lengths2.append(len(read.seq)) plot(lengths2,lengths1,'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"):     phred_scores = read.letter_annotations["phred_quality"]     l = 0     for phsc in phred_scores:         if phsc < 30:             break         l += 1     lengths1.append(l)     lengths2.append(len(read.seq)) plot(sorted(lengths1),'.',sorted(lengths2),'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Samtools using pysam Popular format for alignment records pysam is a lightweight wrapper around the samtools code Need to understand samtools alignment data-structures BAM indexes permit random access by locus Direct access to mate-pairs BCHB524 - Edwards

Integrated Genome Viewer chr21:9,826,858-9,827,663 BCHB524 - Edwards

Integrated Genome Viewer chr21:9,907,824-9,907,853 BCHB524 - Edwards

Reads overlapping a region # Import the PySam module import pysam # Open the BAM file bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for aligned_read in bf.fetch('21',9000000,10000000):     # Dump the information about each read     print aligned_read.qname,\        aligned_read.seq,\        bf.getrname(aligned_read.tid),\        aligned_read.pos,\        aligned_read.qend BCHB524 - Edwards

Determine coverage by locus import pysam # Open the BAM file bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for pileup in bf.pileup('21',9000000,10000000): # Dump the position and number of reads print pileup.pos, pileup.n # Plot? BCHB524 - Edwards

Look for SNPs import pysam bf = pysam.Samfile('10_Normal_Chr21.bam') # For every position in the reference for pileup in bf.pileup('21'):     counts = {}     # ...examine every aligned read     for pileupread in pileup.pileups:         # ...and get the read-base         if not pileupread.query_position:             continue         readbase = pileupread.alignment.seq[pileupread.query_position]         # Count the number of each base         if readbase not in counts:              counts[readbase] = 0         counts[readbase] += 1     # If there is no variation, move on     if len(counts) < 2:          continue     # Otherwise, output the position, coverage and base counts     print pileup.pos, pileup.n,     for base in sorted(counts):         print base,counts[base],     print BCHB524 - Edwards

Filter out bad/poor alignments         # ...check the read and alignment         if pileupread.indel:             continue         if pileupread.is_del:             continue         al = pileupread.alignment         if al.is_unmapped:             continue         if al.is_secondary:             continue         if int(al.opt('NM')) > 1:            continue         if int(al.opt('NH')) > 1:            continue         # ...and get the read-base         if not pileupread.query_position:             continue         readbase = al.seq[pileupread.query_position]     # if not enough observations of minor allele, move on     if sorted(counts.values())[-2] < 10:         continue BCHB524 - Edwards