A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.

Slides:



Advertisements
Similar presentations
applications of genome sequencing projects
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Genetic Basis of Agronomic Traits Connecting Phenotype to Genotype Yu and Buckler (2006); Zhu et al. (2008) Traditional F2 QTL MappingAssociation Mapping.
DNAseq analysis Bioinformatics Analysis Team
High Throughput Sequencing
SOLiD Sequencing & Data
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
Affymetrix Resequencing Arrays Matthew Smith Trainee Presentation West Midlands Regional Genetics Laboratory.
NGS Analysis Using Galaxy
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Introduction to Short Read Sequencing Analysis
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
GBS Bioinformatics Pipeline(s) Overview
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
Next Generation DNA Sequencing
CS177 Lecture 10 SNPs and Human Genetic Variation
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Identification of Copy Number Variants using Genome Graphs
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
The International Consortium. The International HapMap Project.
Genome STRiP ASHG Workshop demo materials
No reference available
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Accessing and visualizing genomics data
Analyzing DNA using Microarray and Next Generation Sequencing (1) Background SNP Array Basic design Applications: CNV, LOH, GWAS Deep sequencing Alignment.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
Analysis of Next Generation Sequence Data BIOST /06/2015.
Canadian Bioinformatics Workshops
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
From Reads to Results Exome-seq analysis at CCBR
Interpreting exomes and genomes: a beginner’s guide
Lesson: Sequence processing
Cancer Genomics Core Lab
CSE 182 Project.
The Genome Diversity in Africa Project
Exploring and Understanding ChIP-Seq data
BF528 - Genomic Variation and SNP Analysis
Canadian Bioinformatics Workshops
Analysis of protein-coding genetic variation in 60,706 humans
Alignment and CNV analysis in cattle
Presentation transcript:

A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health and Population Studies, University of KwaZulu-Natal, Durban, South Africa

Introductions meta-analysis and power of genetic studies Genetics GWAS results and interpretation GWAS QC Basic principles of measuring disease in populations population genetics Principal components analyses Basic genotype data summaries and analyses GWAS association analyses Bioinformatics Public databases and resources for genetics Epidemiology whole genome sequencing and fine-mapping

A brief guide to sequencing

Why sequence? A key motivation for sequencing is to discover and type more genetic variation, either within a region or across the whole genome. In particular, sequencing can be used to fine-map signals of associations by identifying putatively functional mutations that may not have been typed in the original data.

Why sequence? E.g. 99% of SNPs with at least 1% frequency detected E.g. sequencing the whole genome: REDACTED

Why sequence? E.g. sequencing within a region: Sequencing of three genes In 282 Africans

A comparison of technologies Genotyping array What can you type? Previously identified variants What do you get out? Chip intensities per variant per sample. How much data? e.g. 50Mb per sample How do you process it? Call genotypes from intensity files. How much does it cost? $

A comparison of technologies Genotyping array‘Next-generation’ * sequencing What can you type? Previously identified variants The whole genome! (Except inaccessible bits) What do you get out? Chip intensities per variant per sample. Short (e.g. 100bp), accurate reads How much data? e.g. 50Mb per sample e.g. 40Gb per 12x coverage How do you process it? Call genotypes from intensity files Map to a reference sequence to call variation. How much does it cost? $$$ * This is typical current-generation sequencing.

A comparison of technologies Genotyping array‘Next-generation’ * sequencing Single-molecule sequencing What can you type? Previously identified variants The whole genome! (Except inaccessible bits) The whole genome! (Except really inaccessible bits) What do you get out? Chip intensities per variant per sample. Short (e.g. 100bp), accurate reads Long (e.g. 4kb), less accurate reads. How much data? e.g. 50Mb per sample e.g. 40Gb per 12x coverage Similar to short-read data How do you process it? Call genotypes from intensity files Map to a reference sequence to call variation. Custom mapping/assembly pipelines How much does it cost? $$$$$$$$ * This is really the current generation sequencing.

A comparison of technologies Genotyping array‘Next-generation’ * sequencing Single-molecule sequencing What can you type? Previously identified variants The whole genome! (Except inaccessible bits) The whole genome! (Except really inaccessible bits) What do you get out? Chip intensities per variant per sample. Short (e.g. 100bp), accurate reads Long (e.g. 4kb), less accurate reads. How much data? e.g. 50Mb per sample e.g. 40Gb per 12x coverage Similar to short-read data How do you process it? Call genotypes from intensity files Map to a reference sequence to call variation. Custom mapping/assembly pipelines How much does it cost? $$$$$$$$ * This is really the current generation sequencing.

Typical sequencing pipeline 1. Whole DNA goes in…2. DNA gets sheared into short fragments (typically bp) 4. Fragments are sequenced…5. Reads come out. 3. DNA gets amplified and adapted

Anatomy of a read Direction of read “Insert size” “Paired ends”

Anatomy of a read AGGCAGGCTTTAGGCTAGGCAGTCCCTAGGCCGCCGCTTGCATATGC … …GGTGTCTGTATTGAATTGCCGCTAGCTAGACTAGCTAC Direction of read

Anatomy of a read AGGCAGGCTTTAGGCTAGGCAGTCCCTAGGCCGCCGCTTGCATATGC … …GGTGTCTGTATTGAATTGCCGCTAGCTAGACTAGCTAC Direction of read Lower-quality bases High-quality bases

Typical sequencing pipeline 1. Whole DNA goes in…2. DNA gets sheared into short fragments (typically bp) 4. Fragments are sequenced…5. Reads come out. 3. DNA gets amplified and adapted

? How to assemble it?

...ATAGATAGACCATACTGCATCGCAAGCAGCTACGCTAGCGTTAGGCATGG... Reference sequence ATAGATAGACCATACTCCAT GAAAGACCATACTCCATCGCTAGCAGCACGCTAGAG...ATAGATAGACCAGA GAAAGACCATACTCCATCGCTAGCAGCTACGCTAGAG Does this map here? Sequencing reads GGC.. Aligning back to the reference sequence

...ATAGATAGACCATACTGCATCGCAAGCAGCTACGCTAGCGTTAGGCATGG... Reference sequence ATAGATAGACCATACTCCAT GAAAGACCATACTCCATCGCTAGCAGC-ACGCTAGAG...ATAGATAGACCAGA GAAAGACCATACTCCATCGCTAGCAGCTACGCTAGAG Does this map here? Sequencing reads GGC.. Aligning back to the reference sequence

...ATAGATAGACCATACTGCATCGCAAGCAGCTACGCTAGCGTTAGGCATGG... ATAGATAGACCATACTCCAT GAAAGACCATACTCCATCGCTAGCAGC ACGCTAGAG...ATAGATAGACCAGA GAAAGACCATACTCCATCGCTAGCAGCTACGCTAGAG GGC.. Read depth (coverage) = 4 Read depth = 3 Read depth = 1 Aligning back to the reference sequence

...ATAGATAGACCATACTGCATCGCAAGCAGCTACGCTAGCGTTAGGCATGG... ATAGATAGACCATACTCCAT GAAAGACCATACTCCATCGCTAGCAGC ACGCTAGAG...ATAGATAGACCAGA GAAAGACCATACTCCATCGCTAGCAGCTACGCTAGAG GGC.. T/A Mapping back to the reference sequence T/GA/T indel C/AG/C ? Variation:

...ATAGATAGACCATACTGCATCGCAAGCAGCTACGCTAGCGTTAGGCATGG... ATAGATAGACCATACTCCAT GAAAGACCATACTCCATCGCTAGCAGC ACGCTAGAG...ATAGATAGACCAGA GAAAGACCATACTCCATCGCTAGCAGCTACGCTAGAG GGC.. T/A Aligning back to the reference sequence T/GA/T indel C/AG/C T/A A/T C/A Variation: Filtered variation:

Typical calling pipeline for discovering and/or typing variation in multiple samples Align reads to a reference sequence into BAM file Call genotypes At bases where there’s some variation in the data. into VCF or BCF file Filter variants E.g. by read depth, mapping quality, potential biases between ref and non-ref reads… Into VCF or BCF file Analyse Into high-profile journal

Typical calling pipeline for discovering and/or typing variation in multiple samples Call genotypes At bases where there’s some variation in the data. into VCF or BCF file Filter variants E.g. by read depth, mapping quality, potential biases between ref and non-ref reads… into VCF or BCF file Analyse into high-profile journal This is often the tricky part Align reads to a reference sequence into BAM file

(Image provided courtesy of Jason Wendler based on MalariaGEN data)

What can possibly go wrong?

The human genome is complex. Short reads will not align well here

What can possibly go wrong? CRG Align 100 track on UCSC Genome Browser

But I don’t have any sequencing data! Luckily lots of data is publically available - in particular the 1000 Genomes project data can be downloaded: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/ ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/

File format glossary Format nameWhat it containsWhere it comes from FASTA/FASTQSequences reads and quality scores. Also full genome and regional assemblies. Sequencing. Also genome assembly. BAM files (or SAM files)Sequence reads mapped to a reference sequence. Alignment algorithm – e.g. BWA. VCF, BCFGenotypes called from BAM files at polymorphic locations. Calling algorithm – e.g. samtools mpileup + bcftools..fai,.bai, …Index files, used to quickly access FASTA, BAM, or other files. Indexing software – e.g. samtools index

Plan for the practical Inspect some mapped read data in the FUT2 gene for a single individual using samtools Call variation in across a set of individuals using samtools and bcftools Use this data to understand LD between these variants and our GWAS ‘hit’.