Bioinformatics Applications

Slides:



Advertisements
Similar presentations
NGS data analysis in R Biostrings and Shortread
Advertisements

In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Information Retrieval in Practice
IMGS 2012 Bioinformatics Workshop: File Formats for Next Gen Sequence Analysis.
DNAseq analysis Bioinformatics Analysis Team
Institute for Quantitative & Computational Biosciences Workshop4: NGS- study design and short read mapping.
Sequencing and Sequence Alignment
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Modules, Hierarchy Charts, and Documentation
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
The University of Adelaide Table Talk: Using tables in Word Peter Murdoch March 2014 PREPARING GOOD LOOKING DOCUMENTS.
Workshop Schedule Schedule has links to introductory presentations and the FungiDB workshops Tuesday 3rdWednesday.
HTML: PART ONE. Creating an HTML Document  It is a good idea to plan out a web page before you start coding  Draw a planning sketch or create a sample.
UNIX Filters.
Fundamentals of Python: From First Programs Through Data Structures
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Advanced File Processing
Fundamentals of Python: First Programs
ASP.NET Programming with C# and SQL Server First Edition
Introduction to Short Read Sequencing Analysis
MES Genome Informatics I - Lecture V. Short Read Alignment
File formats Wrapping your data in the right package Deanna M. Church
GBS Bioinformatics Pipeline(s) Overview
RNAseq analyses -- methods
SAGExplore web server tutorial for Module II: Genome Mapping.
CIS 218 Advanced UNIX1 CIS 218 – Advanced UNIX (g)awk.
Galaxy for Bioinformatics Analysis An Introduction TCD Bioinformatics Support Team Fiona Roche, PhD Date: 31/08/15.
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
NGS data analysis CCM Seminar series Michael Liang:
Next Generation DNA Sequencing
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
Chapter Five Advanced File Processing Guide To UNIX Using Linux Fourth Edition Chapter 5 Unix (34 slides)1 CTEC 110.
Quick introduction to genomic file types Preliminary quality control (lab)
DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.
SAGExplore web server tutorial for Module I: Genome Explore.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introducing Python CS 4320, SPRING Lexical Structure Two aspects of Python syntax may be challenging to Java programmers Indenting ◦Indenting is.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
GE3M25: Computer Programming for Biologists Python, Class 5
SAGExplore web server tutorial. The SAGExplore server has three different modules …
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
1 Copyright © 2011 Tata Consultancy Services Limited Virtual Access Storage Method (VSAM) and Numeric Intrinsic Functions (NUMVAL and NUMVAL-C) LG - TMF148.
Accessing and visualizing genomics data
Welcome to the combined BLAST and Genome Browser Tutorial.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
Canadian Bioinformatics Workshops
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Canadian Bioinformatics Workshops
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Storage and File Organization
Using command line tools to process sequencing data
Day 5 Mapping and Visualization
First Bite of Variant Calling in NGS/MPS Precourse materials
BF528 - Biological Data Formats
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Canadian Bioinformatics Workshops
BF528 - Sequence Analysis Fundamentals
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Bioinformatics Applications Vivek Krishnakumar & Haibao Tang J. Craig Venter Institute Plant Informatics Workshop July 15, 2013 *

Module 1: FASTA

FASTA format - Example Definition Line or Header begins with ‘>’ A single FASTA sequence record: Definition Line or Header begins with ‘>’ Width of sequence rows usually 60 letters. The Multi-FASTA format is composed of FASTA records concatenated together.

FASTA - Definition Line Minimum requirement for definition line is '>' symbol followed by an identifier >Medtr5g073340 Extra comments may be added after the identifier separated by a whitespace Several pre-defined conventions already exist, that are followed by the sequence databases

FASTA - Naming Conventions

faSize htang@htang-lx $ faSize contigs.fasta 344315953 bases (621962 N's 343693991 real 343693991 upper 0 lower) in 177125 sequences in 1 files Total size: mean 1943.9 sd 3181.2 min 200 (contig_177102) max 139442 (contig_26116) median 624 N count: mean 3.5 sd 10.5 U count: mean 1940.4 sd 3180.1 L count: mean 0.0 sd 0.0 %0.00 masked total, %0.00 masked real htang@htang-lx $ faSize contigs.fasta -detailed | sort -k2,2nr | head contig_26116 139442 contig_26133 126994 contig_26222 58825 contig_28230 54642 contig_25859 50265 contig_38555 47349 contig_26213 47226 contig_41638 45958 contig_6779 43393 contig_26783 42041

faOneRecord, faSomeRecords faOneRecord - Extract a single record from a .FA file usage: faOneRecord in.fa recordName faSomeRecords - Extract multiple fa records faSomeRecords in.fa listFile out.fa options: -exclude - output sequences not in the list file. Does not create index like cdbindex/cdbyank (does not create extra files) Sufficiently fast

faSplit faSplit - Split an fa file into several files. usage: faSplit how input.fa count outRoot where how is either 'base' 'sequence' or 'size'. Files split by sequence will be broken at the nearest fa record boundary, while those split by base will be broken at any base. Files broken by size will be broken every count bases. Examples: faSplit sequence estAll.fa 100 est This will break up estAll.fa into 100 files (numbered est001.fa est002.fa, ... est100.fa Files will only be broken at fa record boundaries faSplit base chr1.fa 10 1_ This will break up chr1.fa into 10 files faSplit size input.fa 2000 outRoot This breaks up input.fa into 2000 base chunks faSplit about est.fa 20000 outRoot This will break up est.fa into files of about 20000 bytes each by record. faSplit byname scaffolds.fa outRoot This breaks up scaffolds.fa using sequence names as file names. faSplit gap chrN.fa 20000 outRoot This breaks up chrN.fa into files of at most 20000 bases each, at gap boundaries if possible.

faGapSizes, faGapLocs htang@htang-lx (~) $ faGapSizes medicago.fa gapCount=7870, totalN=51926655, minGap=1, maxGap=250000, avgGap=6598.00 0 < size < 10 : 2402: *********************** size = 10 : 2: 10 < size < 50 : 50: size = 50 : 26: 50 < size < 100 : 38: size = 100 : 4654: ********************************************* 100 < size < 500 : 7: size = 500 : 0: 500 < size < 1000 : 0: size = 1000 : 0: 1000 < size < 5000 : 1: size = 5000 : 305: ** 5000 < size < 10000 : 1: size = 10000 : 1: 10000 < size < 50000 : 0: size = 50000 : 101: size > 50000 : 282: ** htang@htang-lx (~) $ faGapLocs CU914135 stdout 0 CU914135.1_seq_0 70702 CU914135.1 276201 70702 CU914135.1_gap_0 50 CU914135.1 276201 70752 CU914135.1_seq_1 51172 CU914135.1 276201 121924 CU914135.1_gap_1 50 CU914135.1 276201 121974 CU914135.1_seq_2 154227 CU914135.1 276201

Module 2: FASTQ

Format for HTS Data High Throughput Sequencing (HTS) instruments produce large quantities of sequence data Requirement to store sequence quality along with raw sequence arose Thus, logical extension to FASTA was developed, called FASTQ Minimal representation of sequencing read Ability to store numeric quality score

FASTQ format Line 1 begins with the @ character followed by sequence identifier Line 2 consists of the raw sequence Line 3 begins with a + symbol followed by an optional description or repeat of Line 1 Line 4 corresponds to the encoded quality values (one character each for every nucleotide in the sequenced read) Common file extensions: .fastq, .fq, or .txt are used

FASTQ format Illumina Quality Encoding Phred Quality Scores Q: logarithmically related to base calling error probabilities P Phred score: 10 = 10% error 20 = 1% error 30 = 0.1% error 40 = 0.01% error

FASTQ format Illumina Quality Encoding Illumina format encodes a quality score between 0 and 62 using ASCII 64 to 126 (as compared to Sanger which encodes 0 to 93 using ASCII 33 to 126) The phred score + some offset = ASCII code, example: Two types of offsets (phred +33, and phred +64). Most of the FASTQ files are phred +33. How do I know which offset it is? There is a quick tip

FASTQ format Illumina Phred + 33

FASTQ format Illumina Phred + 64

Module 3: GFF, BED

Gene structure

Generic Feature Format GFF = Generic Feature Format Tab delimited, easy for data parsing and processing Many annotation viewers accept this format, isn't very strict Fields: Reference Sequence: base seq to which the coordinates are anchored Source: source of the annotation Type: Type of feature Start coordinate (1-based) End coordinate Score: Used for holding numerical scores (similarity, etc) Strand: "+","-", or "." if unstranded Frame: Signifies codon phase for coding sequence (CDS) features Other attributes or/and comments

GFF3 – Generic Feature Format v3 http://www.sequenceontology.org/gff3.shtml Extension of GFF by the Sequence Ontology (SO) and Generic Model Organism Database (GMOD) Projects - Allows hierarchies more than one level deep - Constrains the feature type field to be taken from controlled vocabulary - Feature can belong to more than one group - Attributes take the form of “Key=Value” pairs separated by a ";" - Uppercase 'Keys' are reserved. Lower case 'keys' are user-defined

BED Format http://genome.ucsc.edu/FAQ/FAQformat.html#format1 Developed primarily for the UCSC genome browser BED lines have 3 required fields and 9 optional fields With just the required fields and a few additional optional fields (bed6), individual features (such as gene/mRNA boundaries) can be represented In order to depict hierarchical features, all 12 columns are necessary (also called bed12 format)

BED Format - Description Three required BED fields are: chrom - The name of the chromosome/reference (e.g. chr3, scaffold_1) chromStart - The starting position of the feature (0-based) chromEnd - The ending position of the feature The 9 additional optional BED fields are: name - Defines the name of the BED line score - A score between 0 and 1000 strand - Defines the strand - either '+' or '-'. thickStart - The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). thickEnd - The ending position at which the feature is drawn thickly (for example, the stop codon in gene displays) itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0) blockCount - The number of blocks (exons) in the BED line blockSizes - A comma-separated list of the block sizes, corresponding to blockCount blockStarts - A comma-separated list of block starts, relative to chromStart, corresponding to blockCount

BED Format - Examples BED6 BED12 BED chr2 0 673342 chr5 0 619924 chr2 136141 136779 Medtr2g005340 0 + chr2 140131 140463 Medtr2g005360 0 + BED12 chr2 140131 140463 Medtr2g005360 0 + \ 140131 140463 255,0,0 2 100,58 0,100

BEDTOOLS Author: Aaron Quinlan, UVA BED format (first three columns `chr`, `start`, `end` are required), 0-based BEDTOOLS operates on genomic coordinates and uses “Bin indexing” to speed up range queries #chromosome start end name score strand ... Mt_chr01 150050 154771 Medtr1g005000 1000 - Mt_chr01 154858 155160 Medtr1g005010 1000 + Mt_chr01 155200 162479 Medtr1g005020 1000 + Mt_chr01 162535 164388 Medtr1g005030 1000 - Mt_chr01 164428 166156 Medtr1g005040 1000 - Mt_chr01 167402 169400 Medtr1g005050 1000 - Mt_chr01 169440 171646 Medtr1g005060 1000 - Mt_chr01 171722 172427 Medtr1g005070 1000 -

BEDTOOLS functionalities Find out what’s overlapping between sets of features. (intersectBed) Find closest genomic features. (closestBed, windowBed) Merge overlapping features. (mergeBed) Computing coverage for alignments based on genome features. (coverageBam, bamToBed) Calculating the depth and breadth of sequence coverage across defined "windows" in a genome. (coverageBed) Sequence output. (fastaFromBed, maskFastaFromBed)

intersectBed: What’s in common? Report the base-pair overlap between sequence alignments and genes. Report those alignments that overlap NO genes. Like "grep -v" Report the number of genes that each alignment overlaps. Report the entire, original alignment and gene entries for each overlap, and number of overlapping bases. Only report an overlap with a repeat if it spans at least 50% of the exon. Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs. Retain only single-end BAM alignments that do not overlap simple sequence repeats. $ intersectBed -a reads.bed -b genes.bed $ intersectBed -a reads.bed -b genes.bed -v $ intersectBed -a reads.bed -b genes.bed -c $ intersectBed -a reads.bed -b genes.bed –wo $ intersectBed -a exons.bed -b repeatMasker.bed –f 0.50 $ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed –v $ intersectBed -abam reads.bam -b SSRs.bed -v > reads.noSSRs.bam

windowBed, closestBed: What’s in there? Report all genes that are within 10000 bp upstream or downstream of CNVs. Report all genes that are within 10000 bp upstream or 5000 bp downstream of CNVs. Report all SNPs that are within 5000 bp upstream or 1000 bp downstream of genes. Define upstream and downstream based on strand. closestBed Note: By default, if there is a tie for closest, all ties will be reported. closestBed allows overlapping features to be the closest. Find the closest ALU to each gene. Find the closest ALU to each gene, choosing the first ALU in the file if there is a tie. $ windowBed -a CNVs.bed -b genes.bed -w 10000 $ windowBed -a CNVs.bed -b genes.bed –l 10000 –r 5000 $ windowBed -a genes.bed –b snps.bed –l 5000 –r 1000 -sw $ closestBed -a genes.bed -b ALUs.bed $ closestBed -a genes.bed -b ALUs.bed –t first

mergeBed, coverageBed mergeBed Merge overlapping repetitive elements into a single entry. Merge overlapping repetitive elements into a single entry, returning the number of entries merged. Merge nearby (within 1000 bp) repetitive elements into a single entry. coverageBed Compute the coverage of aligned sequences on 10 kilobase “windows” spanning the genome. $ mergeBed -i repeatMasker.bed $ mergeBed -i repeatMasker.bed -n $ mergeBed -i repeatMasker.bed –d 1000 $ coverageBed -a reads.bed -b windows10kb.bed Default Output: After each entry in B, reports: 1) The number of features in A that overlapped the B interval. 2) The number of bases in B that had non-zero coverage. 3) The length of the entry in B. 4) The fraction of bases in B that had non-zero coverage.

fastaBed, maskFastaFromBed: messing with sequences Program: fastaFromBed (v2.6.1) Summary: Extract DNA sequences into a fasta file based on BED coordinates. Usage: fastaFromBed [OPTIONS] -fi -bed -fo Options: -fi Input FASTA file -bed BED file of ranges to extract from -fi -fo Output file (can be FASTA or TAB-delimited) -name Use the BED name field (#4) for the FASTA header -tab Write output in TAB delimited format. - Default is FASTA format. Program: maskFastaFromBed (v2.6.1) Summary: Mask a fasta file based on BED coordinates. Usage: maskFastaFromBed [OPTIONS] -fi -out -bed Options: -fi Input FASTA file -bed BED file of ranges to mask in -fi -fo Output FASTA file -soft Enforce "soft" masking. That is, instead of masking with Ns, mask with lower-case bases.

Module 4: SAM, BAM

Sequence Alignment Format With the advent of HTS technologies, several requirements arose: need for a generic alignment format to store read alignments support for short and long reads generated by different sequencing platforms compact file size random access ability ability to store various types of alignments (clipped, spliced, multi-mapped, padded) As a result, SAM (Sequence Alignment/Map) format evolved

SAM Format - Example http://samtools.sourceforge.net/SAM1.pdf @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * r003 16 ref 29 30 6H5M * 0 0 TAGGC * NM:i:0 r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *

SAM Format - Header Lines Header lines start with @ @ is followed by TAG Known TAGS: @HD - Header @SQ - Reference Sequence dictionary @RG - Read Group Header fields are TYPE:VALUE pairs @SQ SN:ref LN:45 TAG TYPE:VALUE TYPE:VALUE Example: @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891

SAM Format - Alignment Section 11 mandatory fields Tab-delimited Optional fields (variable) QNAME: Query name of the read or the read pair FLAG: Bitwise flag (multiple segments, unmapped, etc.) RNAME: Reference sequence name POS: 1-Based leftmost position of clipped alignment MAPQ: Mapping quality (Phred-scaled) CIGAR: Extended CIGAR string (operations: MIDNSHP) MRNM: Mate reference name (‘=’ if same as RNAME) MPOS: 1-based leftmost mate position ISIZE: Inferred insert size SEQ: Sequence on the same strand as the reference QUAL: Query quality (ASCII-33 = Phred base quality)

SAM Format - CIGAR string M: match/mismatch I: insertion D: deletion P: padding N: skip S: soft-clip H: hard-clip Ref: GCATTCAGATGCAGTACGC Read: ccTCAG--GCAGTAgtg CIGAR 2S4M2D6M3S POS 5

SAM Format Compressed Binary Version Known as BAM Binary, indexed representation of SAM Uses BGZF (Blocked GZip Format) compression Storage space requirements ~27% of original SAM SAM/BAM can be manipulated using SAMTOOLS package

SAMTOOLS SAM, and BAM format is popular for reporting read mappings onto the reference genome, SAM is human readable SAMTOOLS, author: Heng Li, Broad good for manipulating SAM files @HD VN:1.0 SO:unsorted @SQ SN:contig_27167 LN:26169 GFNQG3Z01EMD0D 65 contig_27167 25164 60 3S24M = 23568 1620 CATTCTCTCTTTCTTCTTTGTGCTTCA * GFNQG3Z01EMD0D 129 contig_27167 23568 60 18M1D8M = 25164 1620 GTCAAACAACCCTCGTAGAAATATGAT *

SAMTOOLS Samtools manipulate SAM/BAM Once you have the bam indexed, you can quickly access any range in the reference (remember bin indexing?) bowtie ref -1 SRR067323_1.fastq -2 SRR067323_2.fastq --best --maxins 1000 -S | \ samtools view -h -S -F 4 - > SRR067323.aligned.sam samtools view -bS SRR067323.aligned.sam –o SRR067323.aligned.bam samtools sort SRR067323.aligned.bam SRR067323.sorted samtools index SRR067323.sorted.bam SAM BAM Sorted BAM Indexed BAM samtools view samtools sort samtools index