Next Gen. Sequencing Files and pysam

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

10/6/2014BCHB Edwards Sequence File Parsing using Biopython BCHB Lecture 11.

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani

Before we start: Align sequence reads to the reference genome

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

Python programs How can I run a program? Input and output.

MES Genome Informatics I - Lecture V. Short Read Alignment

RNAseq analyses -- methods

CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.

Beginning BioPerl for Biologists MPI Ploen Jun Wang.

Next Generation DNA Sequencing

Next Generation Sequencing. Overview of RNA-seq experimental procedures. Wang L et al. Briefings in Functional Genomics 2010;9: © The Author.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

10/20/2014BCHB Edwards Advanced Python Concepts: Modules BCHB Lecture 14.

Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Iteration While / until/ for loop. While/ Do-while loops Iteration continues until condition is false: 3 important points to remember: 1.Initialise condition.

9/28/2015BCHB Edwards Basic Python Review BCHB Lecture 8.

Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015

Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.

IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.

Trinity College Dublin, The University of Dublin Data download: bioinf.gen.tcd.ie/GE3M25/project Get.fastq.gz file associated with your student ID

Using Local Tools: BLAST

Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.

Short Read Workshop Day 5: Mapping and Visualization

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Python is Awesome! (and cooler than R). My Research.

Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.

Canadian Bioinformatics Workshops

Sequence File Parsing using Biopython

Introductory RNA-seq Transcriptome Profiling

Using command line tools to process sequencing data

NGS File formats Raw data from various vendors => various formats

Day 5 Mapping and Visualization

Cancer Genomics Core Lab

RNA Sequencing Day 7 Wooohoooo!

Using Local Tools: BLAST

MGmapper A tool to map MetaGenomics data

Advanced Python Concepts: Modules

For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

(optional - but then again, all of these are optional)

(optional - but then again, all of these are optional)‏

Introductory RNA-Seq Transcriptome Profiling

EMC Galaxy Course November 24-25, 2014

Sequence File Parsing using Biopython

While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.

Next Gen. Sequencing Files and pysam

More for loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Notes about Homework #4 Professor Hugh C. Lauer CS-1004 — Introduction to Programming for Non-Majors (Slides include materials from Python Programming:

Advanced Python Concepts: Modules

Using Local Tools: BLAST

Information processing after resequencing

Basic Python Review BCHB524 Lecture 8 BCHB524 - Edwards.

While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Next Gen. Sequencing Files and pysam

Using Local Tools: BLAST

Advanced Python Concepts: Modules

Sequence File Parsing using Biopython

BF528 - Sequence Analysis Fundamentals

Computational Pipeline Strategies

Presentation transcript:

Next Gen. Sequencing Files and pysam BCHB524 Lecture 10 BCHB524 - Edwards

Next Gen. Sequencing Wiki: Genomics BCHB524 - Edwards

Next Gen. Sequencing Nature Biotechnology 29, 24–26 (2011) BCHB524 - Edwards

Python for NGS NGS data is big! Use Python for: Special purpose tools (tophat, cufflinks, samtools) for aligning Use Python for: Clean up / filter reads Post-process tool output Visualization BCHB524 - Edwards

Count reads from FASTQ file # Import BioPython's SeqIO module import Bio.SeqIO # Import the sys module import sys # Get first command-line argument inputfile = sys.argv[1] # Initialize counter count = 0 # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Increment count count += 1 # Output result print count,"reads" BCHB524 - Edwards

Filter reads in FASTQ file import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] minlength = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Check the length if len(read.seq) > minlength: # Output to standard-out print read.format("fastq"), BCHB524 - Edwards

Filter reads in FASTQ file import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] thr = int(sys.argv[2]) # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Check the minimum phred score if min(read.letter_annotations["phred_quality"]) >= thr: # Output to standard-out print read.format("fastq"), BCHB524 - Edwards

Remove primer sequence import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # if the primer sequence is present if read.seq.startswith('GATGACGGTGT'): # remove it and output as FASTA read = read[11:] print read.format("fasta"), BCHB524 - Edwards

Dump space-separated-values import Bio.SeqIO import sys # Get command-line arguments inputfile = sys.argv[1] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Output description, and read length print read.description,len(read.seq) BCHB524 - Edwards

Plot read lengths import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Store read length lengths.append(len(read.seq)) # lengths.sort() plot(lengths,'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Histogram of read lengths import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): # Store read length lengths.append(len(read.seq)) hist(lengths) show() # savefig('readlengthhist.png') BCHB524 - Edwards

Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): phred_scores = read.letter_annotations["phred_quality"] l = 0 for phsc in phred_scores: if phsc < 30: break l += 1 lengths1.append(l) lengths2.append(len(read.seq)) plot(lengths2,lengths1,'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Plot read lengths and quality import Bio.SeqIO import sys from matplotlib.pyplot import * # Get command-line arguments inputfile = sys.argv[1] lengths1 = [] lengths2 = [] # Loop through all reads in inputfile for read in Bio.SeqIO.parse(inputfile, "fastq"): phred_scores = read.letter_annotations["phred_quality"] l = 0 for phsc in phred_scores: if phsc < 30: break l += 1 lengths1.append(l) lengths2.append(len(read.seq)) plot(sorted(lengths1),'.',sorted(lengths2),'.') show() # savefig('readlengths.png') BCHB524 - Edwards

Samtools using pysam Popular format for alignment records pysam is a lightweight wrapper around the samtools code Need to understand samtools alignment data-structures BAM indexes permit random access by locus Direct access to mate-pairs BCHB524 - Edwards

Integrated Genome Viewer chr21:9,826,858-9,827,663 BCHB524 - Edwards

Integrated Genome Viewer chr21:9,907,824-9,907,853 BCHB524 - Edwards

Reads overlapping a region # Import the PySam module import pysam # Open the BAM file bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for aligned_read in bf.fetch('21',9000000,10000000): # Dump the information about each read print aligned_read.qname,\ aligned_read.seq,\ bf.getrname(aligned_read.tid),\ aligned_read.pos,\ aligned_read.qend BCHB524 - Edwards

Determine coverage by locus import pysam # Open the BAM file bf = pysam.Samfile('10_Normal_Chr21.bam') # Access the reads overlapping 21:9000000-10000000 for pileup in bf.pileup('21',9000000,10000000): # Dump the position and number of reads print pileup.pos, pileup.n # Plot? BCHB524 - Edwards

Look for SNPs import pysam bf = pysam.Samfile('10_Normal_Chr21.bam') # For every position in the reference for pileup in bf.pileup('21'): counts = {} # ...examine every aligned read for pileupread in pileup.pileups: # ...and get the read-base if not pileupread.query_position: continue readbase = pileupread.alignment.seq[pileupread.query_position] # Count the number of each base if readbase not in counts: counts[readbase] = 0 counts[readbase] += 1 # If there is no variation, move on if len(counts) < 2: continue # Otherwise, output the position, coverage and base counts print pileup.pos, pileup.n, for base in sorted(counts): print base,counts[base], print BCHB524 - Edwards

Filter out bad/poor alignments # ...check the read and alignment if pileupread.indel: continue if pileupread.is_del: continue al = pileupread.alignment if al.is_unmapped: continue if al.is_secondary: continue if int(al.opt('NM')) > 1: continue if int(al.opt('NH')) > 1: continue # ...and get the read-base if not pileupread.query_position: continue readbase = al.seq[pileupread.query_position] # if not enough observations of minor allele, move on if sorted(counts.values())[-2] < 10: continue BCHB524 - Edwards