LARVA: An integrative framework for Large-scale Analysis of Recurrent Variants in noncoding Annotations M Gerstein, Yale Slides freely downloadable from.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Genome Browsers Ensembl (EBI, UK) and UCSC (Santa Cruz, California)
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Whole Genome Polymorphism Analysis of Regulatory Elements in Breast Cancer AAGTCGGTGATGATTGGGACTGCTCT[C/T]AACACAAGCGAGATGAAGAAACTGA Jacob Biesinger Dr.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Activities in SVs, focusing on breakpoint characterization Mark Gerstein, Yale.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Radiogenomics in glioblastoma multiforme
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Signatures of Accelerated Somatic Evolution in Gene Promoters in Multiple Cancer Types Update Talk Kyle Smith De Lab.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Algorithms in Bioinformatics: A Practical Introduction
Motif Detection in Yeast Vishakh Joe Bertolami Nick Urrea Jeff Weiss.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Lecture 11. Topics in Omic Studies (Cancer Genomics, Transcriptomics and Epignomics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Projects
Presented by: John Lawson Developed by: John Lawson, Nathan Sheffield
Sungkyunkwan University, School of Medicine.
The Transcriptional Landscape of the Mammalian Genome
Of Sea Urchins, Birds and Men
National Human Genome Research Institute
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Content and Labeling of Tests Marketed as Clinical “Whole-Exome Sequencing” Perspectives from a cancer genetics clinician and clinical lab director Allen.
Albert Xue, Binbin Huang, Jianrong Wang
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
GersteinLab.org Overview
Scaling Computation to Keep Pace with Data Generation
One SNP at a Time: Moving beyond GWAS in Psoriasis
Xin Li, Alexis Battle, Konrad J. Karczewski, Zach Zappala, David A
Deep Learning in Bioinformatics
Challenge 4 – Moving beyond coding regions
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Genomic instability is a core feature of ovarian cancer that frequently involves DNA-damage repair genes. Genomic instability is a core feature of ovarian.
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

LARVA: An integrative framework for Large-scale Analysis of Recurrent Variants in noncoding Annotations M Gerstein, Yale Slides freely downloadable from Lectures.GersteinLab.org & “tweetable” (via @markgerstein). See last slide for references & more info.

Finding Key Variants in Cancer Genomes: the Needle in the Haystack [Image credit: www.yourgrantauthority.com ‘15] Increasing number of whole genome sequenced for tumor/normal pairs Eg >2500 for PCAWG Lots of somatic mutations in an average tumor (~5K/sample), particularly in non-coding regions A focus is distinguishing drivers & passengers Canonical Drivers are mutations driving cancer progression Thought to be under positive selection Recur in the same position, gene or functional element across tumors in different individuals Passengers are thought not be significant to driving cancer progression Collateral damage Could result from impaired DNA repair processes Most driver work has focused on genes eg Youn & Simon (‘11). Bioinformatics; Lawrence et al. (‘13). Nature

Noncoding Annotations Promoters TF binding sites Transcription start sites DHS sites Enhancers [Jakubowski, csbsju.edu ‘14] Ultra-sensitive & Ultra-conserved elements Non-coding regions more conserved than expectation across the human population & between species [Bejerano et al. (‘04). Science; Khurana et al., Science (‘13)]

Identification of non-coding candidate drivers amongst somatic variants: FunSeq [Khurana et al., Science (‘13)]

Identification of non-coding candidate drivers amongst somatic variants: FunSeq Function Prioritization Funseq 2.0 Better Functional Annotation LARVA [Khurana et al., Science (‘13)]

From Funseq 1.0 to Funseq 2.0 Elaborated features New scoring system Motif disruption score: changes in PWMs Network centrality analysis: PPI, regulatory, and phosphorylation networks New scoring system Entropy based scoring system HOT region Sensitive region Polymorphisms Genome Better schematic …. p = probability of the feature overlapping natural polymorphisms Feature weight: For a variant: 6 [Fu et al., GenomeBiology ('14)]

Mutation recurrence Cancer Type 1 Cancer Type 2 Cancer Type 3

Mutation recurrence Early replicated regions Late replicated regions Cancer Type 1 Cancer Type 2 Cancer Type 3

Noncoding annotations Cancer Type 1 Cancer Type 2 Cancer Type 3 Early replicated regions Late replicated regions

Noncoding annotations Cancer Type 1 Cancer Type 2 Cancer Type 3 Early replicated regions Late replicated regions

Cancer Somatic Mutational Heterogeneity, across cancer types, samples & regions [Lochovsky et al. NAR (’15)]

Cancer Somatic Mutation Modeling 3 models to evaluate the significance of mutation burden Suppose there are k genome elements. For element i, define: ni: total number of nucleotides xi: the number of mutations within the element p: the mutation rate R: the replication timing bin of the element Model 1: Constant Background Mutation Rate (Model from Previous Work) Model 2: Varying Mutation Rate Model 3: Varying Mutation Rate with Replication Timing Correction [Lochovsky et al. NAR (’15)]

LARVA Model Comparison Comparison of mutation count frequency implied by the binomial model (model 1) and the beta-binomial model (model 2) relative to the empirical distribution The beta-binomial distribution is significantly better, especially for accurately modeling the over-dispersion of the empirical distribution [Lochovsky et al. NAR (’15)]

Adding DNA replication timing correction further improves the beta-binomial model Bottom 10% of rep. timing bins requires large correction Top 10% of rep. timing bins requires little correction [Lochovsky et al. NAR (’15)]

LARVA Results TSS LARVA results These have literature-verified cancer associations b-binomial noncoding annotation p-values in sorted order binomial [Lochovsky et al. NAR (’15)]

LARVA Implementation http://larva.gersteinlab.org/ Freely downloadable C++ program Verified compilation and correct execution on Linux A Docker image is also available to download Runs on any operating system supported by Docker Running time on transcription factor binding sites (a worst case input size) is ~80 min Running time scales linearly with the number of annotations in the input

Acknowledgements LARVA.gersteinlab.org L Lochovsky*, J Zhang*, Y Fu, E Khurana FunSeq2.gersteinlab.org Y Fu, Z Liu, S Lou, J Bedford, X Mu, K Yip, E Khurana

Info about content in this slide pack General PERMISSIONS This Presentation is copyright Mark Gerstein, Yale University, 2015. Please read permissions statement at www.gersteinlab.org/misc/permissions.html . Feel free to use slides & images in the talk with PROPER acknowledgement (via citation to relevant papers or link to gersteinlab.org). Paper references in the talk were mostly from Papers.GersteinLab.org. PHOTOS & IMAGES. For thoughts on the source and permissions of many of the photos and clipped images in this presentation see http://streams.gerstein.info . In particular, many of the images have particular EXIF tags, such as kwpotppt , that can be easily queried from flickr, viz: http://www.flickr.com/photos/mbgmbg/tags/kwpotppt