DTL Focus meeting: Using GRCh38 in NGS data analysis Time slotSpeakerSubject 12:45-13:00Coffee/tea 13:00-13:20Ies Nijman (UMCU) Welcome & Introduction.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Updating the human reference assembly V.A. Schneider, P. Flicek, T. Graves, T. Hubbard & D.M. Church for the Genome Reference Consortium
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
GRC Workshop ASHG 22 Oct Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data
hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Elephant Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
How to access genomic information using Ensembl August 2005.
Evaluation of PacBio sequencing to improve the sunflower genome assembly Stéphane Muños & Jérôme Gouzy Presented by Nicolas Langlade Sunflower Genome Consortium.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Plants.ensembl.org / The transPLANT project is funded by the European Commission within its 7 th Framework Programme under the thematic.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
NGS Analysis Using Galaxy
The Ensembl Gene set The “Genebuild” 21 April 2008.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Rice Sequence and Map Analysis Leonid Teytelman. Rice Genome Annotation Sequence Alignments Automation Comparative Maps Genetic Marker Correspondences.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
Galaxy: Integrative, Reproducible Analysis of Genomics Data Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, – Location: Tarpon #IMGC2012.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Variant Calling Workshop.
Human Genome.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Genome representation and variant identification Deanna M. Church, NCBI.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
Introduction to Genes and Genomes with Ensembl
Lesson: Sequence processing
VCF format: variants c.f. S. Brown NYU
ENCODE Pseudogenes and Transcription
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Jin Zhang, Jiayin Wang and Yufeng Wu
Genome organization and Bioinformatics
Ensembl Genome Repository.
BF528 - Whole Genome Sequencing and Genomic Variation
Complete Haplotype Sequence of the Human Immunoglobulin Heavy-Chain Variable, Diversity, and Joining Genes and Characterization of Allelic and Copy-Number.
Giulio Genovese, Robert E. Handsaker, Heng Li, Eimear E
The Variant Call Format
Presentation transcript:

DTL Focus meeting: Using GRCh38 in NGS data analysis Time slotSpeakerSubject 12:45-13:00Coffee/tea 13:00-13:20Ies Nijman (UMCU) Welcome & Introduction to GRCh38 (hg20) 13:20-13:40Pieter Neerinx (UMCG) Migration of tools, pipelines to support GRCh38 13:40-14:00Pjotr Prins BWA handling of ALT- contigs 14:00-14:10Tea break 14:10-14:30Zuotian Tatum (LUMC) New insights on Differential Gene Expression using GRCh38 14:30-14:50Wibowo Arindrarto (LUMC) Comparison of hg19 and GRCh38 in the study of DUX4 gene 14:50-15:30Ies Nijman (UMCU)Wrap-up and open discussions

GRCh38 / hg20

Human genome build hg20 Basic new assembly released dec 24 th 2013, now GRCh38.p2 (dec 8 th, 2014) 5-7 megabases of added sequence to primary reference Many corrected regions (patches) to hg alternative loci: chromosomal regions with high variability (~66 MB) 128 large unplaced sequence regios Human_herpes_virus (EBV) mapping decoy (171 kb) Centromere sequences: gaps are replaced by sequence models of the centromer repeats New mitochondrial sequence: Revised Cambridge Reference Sequence (rCRS) from MITOMAP 4 PAR regions This means that coordinates change! Lift-over strategies will not completely solve it.

Human genome build hg20

New genebuild now available ( coding genes; in alternative loci) Only few calling/annotation tools support hg20 yet (VEP fi) Ensembl default genome is hg20!! Latest hg19 site is beeing maintained through archive link. dbSNP locations available for hg G data will be remapped and recalled (est Q1,/Q2 2015)

Human genome build hg20 -Challenges and opportunities- How to use these alternative loci? In hg19 only few were present and mostly blissfully ignored.. Challenge I: mapping strategy and tools needs to be changed In prep: iBWA, srprism BWA (29 dec 2014) supports ALTs in a two-step approach Challenge II: variant callers need to be aware of alternative references (and context) Challenge III: how to display this data in genome browsers etc, while maintaining context? Challenge IV: nomenclature The primary assembly contains all patches and fixes to hg19 and is still a good starting point.

What are these ALT loci? Scaffolds that provide an alternate representation of a locus found in the primary reference. long regions with clustered variations (ie LRC/KIR chr19 and MHC on chr6.HLA loci) Next to different haplo-variants of genes, contain also genes not in the primary assembly (20 prot.coding, ~40 predicted prot.cod., pseudogenes, lincs) Mind: ALTernative approaches between NCBI and ensembl: NCBI uses primary chromosomes and ALT loci while ensembl build a completely new ALT chromosome (so incl identical sequence)

Usage scenarios I: use primary reference (toplevel chrs) II: use primary reference + mapping decoys (Un + EBV) Improves mapping accuracy Only feed primary reference to variantcaller III: use primary reference + ALT loci + mapping decoys (Un + EBV) Improves mapping accuracy (?) A:Only feed primary reference to variantcaller B: Run variantcaller on all loci…

Adding the mapping decoys Grch38_full_plus_analysissetGrch38_full_analysisset ClassTotal bp Primary Unlocalized Unplaced ALT decoy Total graphs based on 11 Xten WGS samples

Personalis, Inc. | Confidential and Proprietary 10 GRCh37.p13 Improved alignments outside of fix patch regions Regions outside of fix patches Jason Harris hs37d5 GRCh37.p13 hs37d5 GRCh37.p13

Heng Li: BWA approach to ALT mapping ALTs supported in >v through additional ID-list file $ref.alt Advised to use NCBI ngs-analyses sets (3 flavors) with slightly modified sequences to facilitate mapping (hardmasked PAR and centromeric regions) 1.The original mapQ of a non-ALT hit is computed across non-ALT hits only. The reported mapQ of an ALT hit is always computed across all hits. 2.An ALT hit is only reported if its score is better than all overlapping non- ALT hits. A reported ALT hit is flagged with 0x800 (supplementary) unless there are no non-ALT hits. 3.The mapQ of a non-ALT hit is reduced to zero if its score is less than 80% (controlled by option -g) of the score of an overlapping ALT hit. In this case, the original mapQ is moved to the om tag.

Heng Li: BWA approach

Variantcalling on ALTs?

By adding the ALT loci in mapping and calling we gain better haplo aware mappings/calls, but it is not clearly reflected in the vcf Adding ‘ haplotyping’ to the VCF format A. Quinlan, Virginia, GRC WS 2014

Variant Annotation on HG20 / ALTs Ensembl VEP snpEFF dbNSFP in next release (~may)

Personalis, Inc. | Confidential and Proprietary 17 Nomenclature chr19_KI270938v1_alt CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1 GenBank: KI RefSeq: NT_ hg38 / GRCh38 not hg20 please…

Everything is in a state of flux, including the status quo. -Robert Byrne- Even after 1.5 years after the release many things are uncertain about the use of the full build. GATK is remarkably silent Ewan Birney and Richard Durbin agreed march24th to rebuild a new reference/analysis set with more standardized set of chr, ALTs and decoys (pers. Comm). Henk Li: “ The current BWA-MEM method is just a start. []We may make changes. It is also possible that we might make breakthrough on the representation of multiple genomes, in which case, we can even get rid of ALT contigs for good.”