Class meetings: TR 3:30-4:50 MCGIL 2315

Slides:



Advertisements
Similar presentations
Implications of Consanguinity for Routine Diagnostic Testing and Development of Specialist Services Teresa Lamb Clinical Scientist Leeds DNA Laboratory.
Advertisements

Charles He, Jessica McClendon, Kaelin Priger, and Wangshu Yang Group B2 Genes and Mutations.
Genetic Approaches to Rare Diseases: What has worked and what may work for AHC Erin L. Heinzen, Pharm.D, Ph.D Center for Human Genome Variation Duke University.
Outline to SNP bioinformatics lecture
Introduction to Medical Genetics Fadel A. Sharif.
Human non-synonymous SNP: molecular function, evolution and disease Shamil Sunyaev Genetics Division, Brigham & Women’s Hospital Harvard Medical School.
PolyPhen and SIFT: Tools for predicting functional effects of SNPs Epi 244 Spring 2009 Sam S. Oh.
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Supplementary slides. Mock-ups Exome overview Genomic coverage: lower quartile 1, median 23, upper quartile 35 Protocols: Aligner used: BWA v2.3 Reference.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Chapter 14 – The Human Genome
DR. ERNEST K. ADJEI FRCPath. DEPARTMENT OF PATHOLOGY SMS-KATH
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Next-Generation Sequencing
Gene Mutations Higher Human Biology Unit 1 – Human Cells.
The Biology and Genetic Base of Cancer. 2 (Mutation)
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Aims and objectives of the workshop David Moore. Aims Classification of variants is subjective and NEQAS results suggest this is not a major problem To.
Bioinformatics Lecture 1: molecular biology. Background The student can refer to “overview of cell biology” look at the macroscopic manifestation of the.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Eukaryotic Genomes  The Organization and Control of Eukaryotic Genomes.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Single nucleotide polymorphisms and Large scale variation
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
© 2012 Genomatix GeneGrid finding disease causing variants in NGS data Claudia Gugenmus Genomatix Software GmbH Bayerstrasse 85a
 We need to look into cells for the answer  Analyzing chromsomes enables biologists to look at the human genome  Karyotype is a picture of chromosomes.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
How do we interpret the variants?. Overview How do we prioritize the filtered variants? What filters can be used to identify the causative variants? What.
Identifying disease causal variants Mendelian disorders A. Mesut Erzurumluoglu 1.
Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Integrated sequence analysis pipeline provides one-stop solution for identifying disease-causing mutations Cougar Hao Hu, MPIMG.
MEDICAL GENETICS.
Interpreting exomes and genomes: a beginner’s guide
SNPs and complex traits: where is the hidden heritability?
Genomic Analysis: GWAS
Common variation, GWAS & PLINK
Nucleotide variation in the human genome
Lesson Four Structure of a Gene.
Timing, rates and spectra of human germline mutation
(4) Genes and proteins in health and disease
Lesson Four Structure of a Gene.
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
Functional Mapping and Annotation of GWAS: FUMA
Interpretation Next Generation Sequencing (Bench Clinic)
Unit 3.
Very important to know the difference between the trees!
Chapter 4 – proteins, mutations & genetic disorders
School of Pharmacy, University of Nizwa
What makes a mutant?.
Human Cells Genes and proteins in health and disease
Detection of the footprint of natural selection in the genome
What are the Patterns Of Nucleotide Substitution Within Coding and
Different mode and types of inheritance
DOMINO: Using Machine Learning to Predict Genes Associated with Dominant Disorders  Mathieu Quinodoz, Beryl Royer-Bertrand, Katarina Cisarova, Silvio.
Genes 3.1.
Mutations changes in the DNA sequence that can be inherited
Group A1 Caroline Kissel, Meg Sabourin, Kaylee Isaacs, Alex Maeder
Daniel C. Koboldt, David E. Larson, Lori S. Sullivan, Sara J
DNA and the Genome Key Area 6a & b Mutations.
What is a mutation? Mutation = any change in DNA (the order of nucleotide bases/letters) Can occur in any cell in the body. Remember from the cells unit.
DNA and the Genome Key Area 6a & b Mutations.
Higher Biology Unit 1: 1.6 Mutations.
Unit 1 Human Cells Higher Human Biology for CfE Miss Aitken
Analysis of protein-coding genetic variation in 60,706 humans
Mutation and DNA repair
Presentation transcript:

Class meetings: TR 3:30-4:50 MCGIL 2315 CSE291: Personal genomics for bioinformaticians Class meetings: TR 3:30-4:50 MCGIL 2315 Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216 Contact: mgymrek@ucsd.edu Today’s schedule: 3:30-4:10 Prioritization and Filtering 4:10-4:15 Break 4:15-4:50 Time to work on PS5 (+command line tips and info for final presentations) Announcements: PS5 due Thursday Reading posted for Thursday

Prioritizing and Filtering variants CSE291: Personal Genomics for Bioinformaticians 03/07/17

The challenge: needles in haystacks Annotation annotation annotation Outline The challenge: needles in haystacks Annotation annotation annotation Family information Prior knowledge of gene function Selection Remaining challenges

The challenge: needles in haystacks

The challenge: sifting through piles of variants … From the exome alone, THOUSANDS of candidate variants

The challenge: sifting through piles of variants 2. Is this variant in affected family members? 1. Have I seen this variant before? 3. Does this mutation affect gene function? GAAAATATCATGTGGTGTTTCC GAAAATATCATATGGTGTTTCC 6. Is the gene likely relevant to this disease? 4. Is this position conserved across species? 5. Gene expression pattern of this gene makes sense? Common approach: progressively apply filters from highest to least confidence Caveat: some truly pathogenic variants will fail these filters!

Annotation annotation annotation

Annotating impact of mutations on genes

Loss of function variants LOFTEE (https://github.com/konradjk/loftee), VEP plug-in to annotate LoF Assesses suspected LoF variants: Stop gain (nonsense) Splice site disrupting Frameshift variants KEY: filters to identify true vs. false positive LoF annotations, e.g.: Nonsense variants in last 5% of the gene unlikely to be that damaging (why?) Nonsense variants in an exon without canonical splice sites around it likely false positive (why?) Splice sites in very small introns (e.g. <15bp) likely not that critical If the LoF allele matches the ancestral allele, likely not really LoF (why?)

Are missense variants important? Polyphen2: predict impact of an amino acid substitution on gene structure and function 8 sequence based features, 3 structure based features Classify variants as: probably damaging, possibly damaging, benign See also: SIFT, MutationTaster, SNAP, and more. Most people use multiple methods and e.g. require more than one method to call the mutation damaging Adzhubei et al. Nature Methods 2010

Ensemble annotations Ensemble: combination of different methods Idea: there’s lots of annotations out there. Some combination of which are probably important. Let’s combine them into a single classifier CADD: Combined Annotation-Dependent Depletion Kircher, et al. Nature Genetics 2014 (Shendure Lab) Features (63 annotations total): VEP annotations (e.g. nonsense, missense) SIFT, PolyPhen2 Mappability Conservation (PhastCons, PhyloP, GERP) Segmental duplications Expression Histone modifications SVM Classifier Train on simulated data to determine: Observed (likely benign) vs. Simulated de novos (likely pathogenic)

Family information

Pedigrees help narrow down the disease location Thousands of candidate variants Dozens of candidate variants

Pedigrees help narrow down the disease location Causal variant homozygous in affecteds, missing or heterozygous in unaffecteds Affected siblings almost always share the region IBD=2 Autosomal recessive Causal variant het (or hom) in affecteds, missing in unaffecteds Affected siblings likely share the region IBD=1, both inherited from affected parent Autosomal dominant Bigger pedigree=better. Why? Example of different types Dominant Heterozygous De novo De novo Mutation not present in parents or affected siblings

Prior knowledge of gene function

Databases of clinical consequences of variants Has my candidate gene been previously implicated in a human disease? If yes, is it related to the current disease I’m trying to solve?

Gene ontologies Does the annotated function of my gene make sense? e.g. for Marfan Syndrome, FBN1 Biological process: skeletal/heart/kidney development Cellular component: basement membrane, extracellular, microfibril Molecular function: Calcium ion binding, structural, protein binding http://waclawikgen677s10.weebly.com/gene-ontology.html

Incorporating gene expression data - tissue Is this gene expressed in tissues that make sense for this disease? e.g. if disease is primarily liver related, the causal gene is probably expressed in liver We now have resources (e.g. GTeX) reporting expression across tissues CFTR Expression by tissue http://www.gtexportal.org/home/gene/CFTR

Incorporating gene expression data from patients Identified novel exon in COL6A1 formed by dominantly acting splice gain event, causes external collagen-VI-like dystrophy Overall, diagnosis rate of 35% in patients with undiagnosed neuromuscular diseases Can incorporate RNA-seq from family members to leverage traditional pedigree approaches Cummings et al. 2016

Selection

Types of selection Positive selection: a new mutation confers a selective advantage, and rises to frequency quickly. OR a new environmental factor makes an existing mutation suddenly more advantageous. Examples: LCT (lactase persistence), EDAR1 Tests: Long haplotypes, high derived allele frequency Purifying selection: mutations in critical regions of the genome are often deleterious and quickly eliminated Examples: protein coding sequence vs. introns, ultra-conserved regions Tests: all of these compare observed vs. expected variation Tajima’s D, Fu and Li Test, many others Genetic constraint (Tuesday) Implication: deleterious mutations are rare!

Selection says disease-causing mutations are rare severe e.g. Tay-Sachs Severe Mendelian disorders Nonexistent (removed by selection) (well actually… AD APO e4. why?) Effect size e.g. high cholesterol, Crohn’s Disease, Type II Diabetes (many common alleles with small effect sizes) Likely many examples, but low power to detect these mild rare common Allele Frequency

Metrics of purifying selection Site frequency spectrum: the distribution of allele frequencies of a given set of SNPs in a population or sample Site frequency spectrum define it Use the three different metrics: % variable sites (e.g. for de novos we’ll talk about thurs) % singleton Mean MAF Summarizing the SFS: % Singletons (seen 1 time in the population). Higher=rarer=stronger selection Mean MAF (higher=more common=weaker selection) % variable sites (how many positions were never variable. Higher=weaker selection)

Purifying selection for variant classes MAPS: mutability adjusted % singletons Missense, nonsense, syn, splice Show intron exon plot from daly Help us prioritize variation, but can only be applied to classes of vars, not single http://www.nature.com/nature/journal/v536/n7616/images_article/nature19057-f2.jpg Generally, the more variants that are singletons, the more deleterious that mutation class is Caveat: these metrics describe categories of variants, and may or may not be useful for predicting impact of a specific mutation Lek et al. 2016

Conservation across species informative Phastcons: compute conservation based on phylogenetic tree + HMM phyloP: compute conservation using sequence alignment and model of neutral evolution GERP: identify constrained elements in multiple sequence alignment by quantifying “substitution deficits” Davydov et al. 2010

Remaining challenges

Some examples defy our annotation pipelines Cystic fibrosis (deltaF508 in CFTR) I I F G V In-frame deletion! (Usually would prioritize frameshifts) rs113993960 GAAAATATCATCTTTGGTGTTTCC GAAAATATCAT---TGGTGTTTCC I I F G V Synonymous variants can be pathogenic! (e.g. in MDR1, multidrug resistance. UBE1 spinal muscular atrophy Deep intronic mutation

The non-coding genome… With exome sequencing, we analyze 2% of the genome but still have too many variants. Helped by the fact that we have a decent idea of how to analyze coding variants. How will we deal with the overwhelming number of false positives from WGS? Requires… annotation and prioritization! More on non-coding regions next Tuesday.

Final projects + PS5