GRC Workshop ASHG 22 Oct 2013. Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data

Slides:



Advertisements
Similar presentations
Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
Advertisements

Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Updating the human reference assembly V.A. Schneider, P. Flicek, T. Graves, T. Hubbard & D.M. Church for the Genome Reference Consortium
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
Lecture 14 Genome sequencing projects
hg19 (GRCh37) vs. hg38 (GRCh38) Human Genome Reference Comparison
DTL Focus meeting: Using GRCh38 in NGS data analysis Time slotSpeakerSubject 12:45-13:00Coffee/tea 13:00-13:20Ies Nijman (UMCU) Welcome & Introduction.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
GeVab: Genome Variation Analysis Browsing Server Korean BioInformation Center, KRIBB InCoB2009 KRIBB
Mouse Genome Sequencing
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Genome Sequencing in the Legumes Le et al Phylogeny Major sequencing efforts Minor sequencing efforts ~14 MY ~45 MY.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native.
26th International Mammalian Genome Conference 2012 Bioinformatics Workshop Sunday, October 21, – Location: Tarpon #IMGC2012.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A.J. Pierce MI615 University of Kentucky. Low Copy Repeats in the Human Genome Implications for Genomic Structure MI615 Andrew J. Pierce Microbiology,
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
The Changing Face of Sequencing
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Human Genome.
Sequence Tracking Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 Understanding your sequence context.
MPL The DNA Sequence of chimpanzee chromosome 22 and comparative analysis with its human ortholog, chromosome 21 Bioinformatics Dae-Soo Kim.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Genome representation and variant identification Deanna M. Church, NCBI.
26 th July 2006 Christine Nicholson, Mapping Core Group Karen McLaren, Finishing Group Leader Wellcome Trust Sanger Institute Sequencing the Gene Space.
Welcome to the combined BLAST and Genome Browser Tutorial.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
Genome sequence assembly
Pre-genomic era: finding your own clones
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
Very important to know the difference between the trees!
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
2nd (Next) Generation Sequencing
Discovery tools for human genetic variations
Long-read sequence assembly of the gorilla genome
Resolving the Breakpoints of the 17q21
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
BF528 - Whole Genome Sequencing and Genomic Variation
Complete Haplotype Sequence of the Human Immunoglobulin Heavy-Chain Variable, Diversity, and Joining Genes and Characterization of Allelic and Copy-Number.
Presentation transcript:

GRC Workshop ASHG 22 Oct 2013

Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data

What is the Reference Assembly? Reference Assembly Basics

An assembly is a MODEL of the genome

Lander and Waterman (1988) Genomics Reads are randomly distributed Overlap between reads does not vary Assumptions Variables: G= haploid genome length in bp L= sequence read length in bp N= number of reads sequenced T= amount of overlap needed for detection in bp C= Coverage (C=LN/G) Poisson distribution: P(Y=y)=( y * e – )/y! y= number of events in an interval = mean number of events in an interval For sequence calculations, coverage can be viewed as Reference Assembly Basics Using this equation, you can calculate the probability that a base has been sequenced y number of times. By manipulating this formula, you can estimate the numbers of gaps for any given level of coverage.

SequencedNot sequenced 1X Coverage 5X Coverage 10X Coverage 37%63% 0.6%99.4% 0.005%99.995% Reference Assembly Basics

2009 Sanger cost: shotgun sequence ~ $0.01/base finished sequence ~ $0.03/base This clone:Shotgun=$1500 Finish=$3000 Reference Assembly Basics

Captured gap= no sequence, but a sub-clone spans the gap Uncaptured gap= no sequence, no sub-clone spanning gap Bob Blakesley, NISC Reference Assembly Basics

Biology Repetitive sequence (interspersed repeats, segmental duplications) Variation (regions of high diversity, structural variation) Kidd et al., 2008 Reference Assembly Basics

Eugene Yaschenko, NCBI

Enrichment Observed Expected Human- PANTHER classifications (biological process) Evan Eichler, University of Washington Reference Assembly Basics

Technology Read length long reads vs. short reads Mate lengths distribution of insert sizes Read accuracy error model for your technology Read depth coverage at each base Genome distribution reads covering entire genome equally Ajay et al., 2011

Genome Research, May, 1997 Reference Assembly Basics

Restrict and make libraries 2, 4, 8, 10, 40, 150 kb End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig tails WGS: Sanger Reads Scaffold Reference Assembly Basics

Contig: a sequence constructed from smaller, overlapping sequences, which contains no gaps. Scaffold: a sequence constructed from smaller sequences, which may contain gaps. Genome Vocabulary Typically built from reads, but also from sequences in GenBank/EMBL/DDBJ Typically built from sequences in GenBank/EMBL/DDBJ Reference Assembly Basics

Schatz et al, 2010 Reference Assembly Basics

A T T T T C C C T T C T G A A A T G A T G A A A G A G T C Reference Assembly Basics

BAC insert BAC vector Shotgun sequence Assemble Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS “finishers” go in to manually fill the gaps, often by PCR Clone based assemblies Reference Assembly Basics

A B C D E F G H I J K L M N O A B C D F G H K L O N Ideally… Non-sequence based Map (flip) A B C D F G H K L O N Reference Assembly Basics

More like… A B C D E F G H I J K L M N O A B C Z Y X W H J M V N O A B H I J C D Y L M N O A B H I J L M N O ? Reference Assembly Basics

Sequence vs. Non-sequence based maps Mmu7 WI Genetic WI/MRC RH

Human assemblies available in the NCBI assembly database Reference Assembly Basics

N50: Measure of continuity. Half of the contigs in the assembly are this length or greater.

Reference Assembly Basics Fragmented genomes tend to have more partial models Fragmented genomes have fewer frameshifts Alexander Souvorov, NCBI

Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data

Distributed data Genome not in INSDC Database Old Assembly Model GRC Assembly Management Human Genome Project (HGP)

GRC Assembly Management

Distributed data Genome not in INSDC Database Old Assembly Model Centralized Data GRC Assembly Management

Issue tracking system (based on JIRA) GRC Assembly Management

GRC Assembly Management

5 July 2011

GRC Assembly Management

ACCESSIONNAMECONTIG GAPTelomere10000 AP006221XX-190A2Hschr1_ctg1 AL627309RP11-34P13Hschr1_ctg1 GAPtype-3 AC114498RP5-857K21Hschr1_ctg3 AL669831RP11-206L10Hschr1_ctg3 AL645608RP11-54O7Hschr1_ctg3 Tiling Path File (TPF) GRC Assembly Management

Full Dovetail Half-dovetail Contained Short/Blunt GRC Assembly Management

Build sequence contigs based on contigs defined in TPF (Tiling Path File). Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Representative chromosome sequence GRC Assembly Management

HschrX_ctg13HschrX_ctg14 GRC Assembly Management

AGP: A Golden Path Provides instructions for building a sequence Defines components sequences used to build scaffolds/chromosome Switch points Defines gaps and types GRC Produces GRC Assembly Management AGP FASTA

Distributed data Old Assembly Model Centralized Data Updated Assembly Model GRC Assembly Management Genome not in INSDC Database

Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes GRC Assembly Management

Assembly (e.g. GRCh37) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) GRC Assembly Management

AC AC AC AC AC AC AC AC NCBI36 NC_ (chr4) Tiling Path Xue Y et al, 2008 TMPRSS11E TMPRSS11E2 GRCh37 NC_ (chr4) Tiling Path AC AC AC AC AC AC TMPRSS11E GRCh37 : NT_ (UGT2B17 alternate locus) AC AC AC AC AC TMPRSS11E2 UGT2B17 Region GRC Assembly Management

7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome UGT2B17MHC MAPT GRCh37 (hg19)

Assembly (e.g. GRCh37.p13) Primary Assembly Non-nuclear assembly unit (e.g. MT) ALT 1 ALT 2 ALT 3 ALT 4 ALT 5 ALT 9 ALT 6 ALT 7 ALT 8 PAR … Genomic Region (MHC) Genomic Region (UGT2B17) Genomic Region (MAPT) Patches Genomic Region (ABO) Genomic Region (SMA) Genomic Region (PECAM1) GRC Assembly Management

GRCh37.p Regions: 3.15% of chromosome sequence 131 FIX patches: add 6.8 Mb novel sequence 73 NOVEL patches: add >800kb novel sequence GRCh37.p Regions: 3.15% of chromosome sequence 131 FIX patches: add 6.8 Mb novel sequence 73 NOVEL patches: add >800kb novel sequence

MHC (chr6) Chr 6 representation (PGF) Alt_Ref_Locus_2 (COX) GRC Assembly Management

17q deletion H1 H2 Zody et al, 2008 GRC Assembly Management

chromosome alt/patch reads On-target alignment Off-target alignments (n=122,922) GRC Assembly Management

Masks and alt aware aligners reduce the incidence of ambiguous alignments observed when aligning reads to the full assembly Mask1: mask chr for fix patches, scaffold for novel/alts.Mask2: mask only on scaffolds GRC Assembly Management

Distributed data Genome not in INSDC Database Old Assembly Model Centralized Data Updated Assembly Model Genome in INSDC Database Genome not in INSDC Database GRC Assembly Management

Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data

GRCh38 Impact GRCh38

GRCh38 Impact GRCh37 Scaff N50: 44,983,201 GRCh37B Scaff N50: 62,124,159 GRCh37 Contig N50: 38,440,852 GRCh37B Contig N50: 49,319,739

GRCh38 Impact

 Modeled Centromeres  Individual base updates  Fixed tiling path/assembly errors  Addition of novel sequence GRCh38 Impact Major Features of GRCh38

CENTROMERES GRCh38 Impact

61-mer analysis set kG high- confidence set Mismatches MAF = 0 n=15,244 MAF=0 Insertions n=834 MAF=0 Insertions n=834 MAF=0 Deletions n=1541 MAF=0 Deletions n=1541 MAF<5% Mismatch in pseudo/pr txpt n=1413 MAF<5% Mismatch in pseudo/pr txpt n=1413 Annotator and clinical requests n= ~260 Annotator and clinical requests n= ~260 GRCh38 Impact

Intergenic Intronic Upstream Downstream Mismatches (n=15,244) Essential splice site: 4 Non-syn coding (delet): 6 GRCh38 Impact

Pile-Up Analysis: “Never Seen” Mismatched Bases Originating from RP11 Components GRCh38 Impact 79% of these bases are heterozygous in RP11 WGS

GRCh37 Insertions Originating from RP11 GRCh38 Impact GRCh37 Deletions Originating from RP11 17% heterozygous in RP11 WGS 18% heterozygous in RP11 WGS

GRCh38 Impact

1q321q211p21 1p21 patch alignment to chromosome 1 Dennis et al., 2012 GRCh38 Impact

HYDIN: chr16 (16q22.2) HYDIN2: chr1 (1q21.1) Missing in NCBI35/NCBI36Unlocalized in GRCh37Finished in GRCh38 Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID Alignment of HYDIN CHM1_1.0, >99.9% ID Alignment of HYDIN2 Genomic, 300 Kb, 99.4% ID Alignment of HYDIN CHM1_1.0, >99.9% ID Doggett et al., 2006 GRCh38 Impact

Other Major Tiling Path Updates Single CHM1 haplotype paths for: 1p12, 1q21, 1q32: SRGAP2 IGH LRC/KIR CCL3L1 (17q21) OM-guided 10q11 Chr. 9 peri-centromeric inversion

GRCh38 Impact NOVEL GENES! GRCh37.p13: 211 genes found only on alt loci and patches NOVEL GENES! GRCh37.p13: 211 genes found only on alt loci and patches

GRCh38 Impact Sudmant et al., 2010

Genovese et al., 2013

1000G decoy sequence, viewed by: GenBank alignment Percent Repeat Masked Repeat Mask type Sequence Source (HTG, HuRef, ALLPATHS) GRCh38 Impact In a preliminary analysis, 90% of NA12878 reads that previously aligned uniquely to the decoy sequence had an alignment to the updated assembly.

GRCh38 Impact Where is the decoy sequence in GRCh38? Alt loci (low repeat content) Model centromeres (high repeat content) Unlocalized/Unplaced Scaffolds Chromosomes

Outline Reference Assembly Basics GRC: Assembly management and dataflow GRCh38 Accessing the assembly and data

Accessing the Data

NCBI Genes, Ensembl Genes, Annotated Clone Problems, Segmental Duplications Accessing the Data

GRCh38 in Ensembl GRCh38 will be incorporated into the existing Ensembl interface. Features such as genes, variation, regulation will be remade or remapped onto the new genome. Nearly 500 tracks are available. GENCODE gene set

Accessing the Data

Alternate sequences in Ensembl Haplotypes and patches on the chromosome A fix patch around the ABO gene Use the Region comparison view to see the difference between the patch and primary assembly The GRC alignment track indicates edits

View your data on the Genome Zoomed in Zoomed out Follow the link from the homepage Red bases show mismatches

Transition to GRCh38 in Ensembl INSDC coordinates identify the assembly as well as the position Convert coordinates between assemblies Our blog series details our progress with GRCh38 Ensembl.info

Remap Set up slide

Accessing the Data

1000 Genomes Browser: GeT-RM Browser: Variation Viewer: (coming Fall 2013!)

Tiling Path Sequence Bar Segmental Duplications, Eichler Lab 1000 Genomes strict accessibility mask Annotated clone assembly problems

dbSNP Build 138 based on annotation run 104 Model based paralogous sequence differences, NCBI annotation run # Paralogous/pseudo gene alignments, NCBI annotation run # Single Unique Nucleotide (SUN) map, Sudmant 2010 ClinVar Long Variations GRC Curation Issues ClinVar Short Variations

Accessing the Data

Accessing the Data