Data formats Gabor T. Marth Boston College

Slides:



Advertisements
Similar presentations
NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
Advertisements

Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
SOLiD Sequencing & Data
Introduction to Short Read Sequencing Analysis
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.
Design Goals Crash Course: Reference-guided Assembly.
Bioinformatics Methods and Computer Programs for Next-Generation Sequencing Data Analysis Gabor Marth Boston College Biology Next Generation Sequencing.
Data analysis methods for next- generation sequencing technologies Gabor T. Marth Boston College Biology Department Epigenomics & Sequencing Meeting July.
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
NHGRI/NCBI Short-Read Archive: Data Retrieval Gabor T. Marth Boston College Biology Department NCBI/NHGRI Short-Read.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department.
Informatics tools for next-generation sequence analysis Gabor T. Marth Boston College Biology Department University of Michigan October 20, 2008.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Before we start: Align sequence reads to the reference genome
The Phase 1 Variant Set and Future Developments
Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Li and Dewey BMC Bioinformatics 2011, 12:323
Introduction to Short Read Sequencing Analysis
GBS Bioinformatics Pipeline(s) Overview
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Next Generation DNA Sequencing
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Quick introduction to genomic file types Preliminary quality control (lab)
TOOLS FOR HTS ANALYSIS Michael Brudno and Marc Fiume Department of Computer Science University of Toronto.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Genome STRiP ASHG Workshop demo materials
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Variant calling: number of individuals vs. depth of read coverage Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Characterizing the short tandem repeat mutation process at every locus in the genome Melissa Gymrek Genome Informatics
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Short Read Workshop Day 5: Mapping and Visualization
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Integrated variant detection Erik Garrison, Boston College.
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling
Next Generation Sequencing Analysis
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
NGS Analysis Using Galaxy
MiSeq Validation Pipeline
EMC Galaxy Course November 24-25, 2014
2nd (Next) Generation Sequencing
MapView: visualization of short reads alignment on a desktop computer
Discovery tools for human genetic variations
CSC2431 February 3rd 2010 Alecia Fowler
Maximize read usage through mapping strategies
Next-generation DNA sequencing
Canadian Bioinformatics Workshops
Giulio Genovese, Robert E. Handsaker, Heng Li, Eimear E
Presentation transcript:

Data formats Gabor T. Marth Boston College for folks developing data standards for 1000G analysis 1000 Genomes Meeting Philadelphia, November 10-11, 2008

Why have standard formats? slide courtesy of Richard Durbin

Standard formats aggregate data from different platforms on a common footing ABI/capillary 454 FLX 454 GS20 Illumina

Standard formats provide algorithms with a well-defined input and output plug alternate tools into pipeline compare performance integrate results across different algorithms capture “checkpoints” in the analysis pipeline

Data types with standard formats

Read data formats – SRF and FASTQ What is the data: trace information, base calls, base qualities Produced by base callers, used by read mappers/aligners SRF FASTQ Standard formats

Read data formats – SRF and FASTQ SRF (Sequence Read Format): designed to store machine-specific trace information, alternative base calls, extended base quality value schemes complex format used mostly for archival FASTQ: only stores base calls + 1 Q-value per base simple format the same for all platforms the de facto format for downstream analysis is there information in SRF (but not in FASTQ that is required by downstream analysis?

Alignment formats What is the data? generated by read mapper / aligners / assemblers used by e.g. allele callers, SV callers

Alignment formats A standard format (SAM, TAM, BAM) is being defined (Heng Li [Sanger], Bob Handsaker [Broad], etc.)… a standard is within reach Compatible with all technologies (AB?), allows aggregation of data from different individuals, different platforms “Lean and mean”  cannot be all-encompassing Remaining issues: gapped / padded alignments, reads pairs, compression, indexing Extremely high priority for 1000G data analysis

SNP / short-INDEL allele calling Data: SNP probability, individual genotype probabilities Produced by SNP caller, used by downstream analysis

Genotype likelihood format: GLF -----c----- P(B1=aacc|G1=aa) P(B1=aacc|G1=cc) P(B1=aacc|G1=ac) P(G1=aa|B1=aacc; Bi=aaaacc; Bn= cccc) P(G1=cc|B1=aacc; Bi=aaaacc; Bn= cccc) P(G1=ac|B1=aacc; Bi=aaaacc; Bn= cccc) Prior(G1,..,Gi,.., Gn) P(Bi=aaaacc|Gi=aa) P(Bi=aaaacc|Gi=cc) P(Bi=aaaacc|Gi=ac) P(Gi=aa|B1=aacc; Bi=aaaacc; Bn= cccc) P(Gi=cc|B1=aacc; Bi=aaaacc; Bn= cccc) P(Gi=ac|B1=aacc; Bi=aaaacc; Bn= cccc) P(Bn=cccc|Gn=aa) P(Bn=cccc|Gn=cc) P(Bn=cccc|Gn=ac) P(Gn=aa|B1=aacc; Bi=aaaacc; Bn= cccc) P(Gn=cc|B1=aacc; Bi=aaaacc; Bn= cccc) P(Gn=ac|B1=aacc; Bi=aaaacc; Bn= cccc) P(SNP) “genotype likelihoods” “genotype probabilities”

Other data types that need standard format?