The Human Genome Source Code

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Periodic clusters. Non periodic clusters That was only the beginning…
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger.
[Bejerano Fall10/11] 1.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
CS173 Lecture 14: Personal Genomics, GSEA/GREAT
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 17:
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Calculating branch lengths from distances. ABC A B C----- a b c.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
[Bejerano Fall10/11] 1.
Statistical Testing with Genes Saurabh Sinha CS 466.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Evidence for Evolution by Natural Selection Hunting for evolution clues… Elementary, my dear, Darwin!
Study Questions: 1. What is epigenesis?. Study Questions: 1. What is epigenesis? Epigenesis is the creation of structures that did not exist before. In.
Functional annotation of ChIP-peaks
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are.
Regulation of Gene Expression
CS273A Lecture 17: Cross Species Comparisons
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are both.
From: Phylogenetic Analysis of the ING Family of PHD Finger Proteins
Monica Britton, Ph.D. Sr. Bioinformatics Analyst June 2016 Workshop
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Basics of Comparative Genomics
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are both.
Statistical Testing with Genes
Evidence for Evolution.
Very important to know the difference between the trees!
Protein Sequence Alignments
Genomes and Their Evolution
Genomes and Their Evolution
Volume 2, Issue 4, Pages (October 2012)
Genomes and Their Evolution
There are four levels of structure in proteins
Relationship between Genotype and Phenotype
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Relationship between Genotype and Phenotype
A Zero-Knowledge Based Introduction to Biology
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
5.4 Cladistics Essential idea: The ancestry of groups of species can be deduced by comparing their base or amino acid sequences. The images above are both.
Basics of Comparative Genomics
The Human Genome Source Code
The Human Genome Source Code
Relationship between Genotype and Phenotype
Statistical Testing with Genes
Relationship between Genotype and Phenotype
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Gene regulatory regions of the insect/crustacean egr-B homologs.
Presentation transcript:

The Human Genome Source Code CS273A The Human Genome Source Code Gill Lecture 16: Comparative+Functional Genomics TTh  1:30-2:50pm, mostly Always M106* Prof: Gill Bejerano CAs: Boyoung (Bo) Yoo & Yatish Turakhia * Track class on Piazza http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Announcements http://cs273a.stanford.edu [Bejerano Winter 2018/19]

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG Genome Evolution http://cs273a.stanford.edu [Bejerano Winter 2018/19] 3

Life’s Amazing Diversity – On Your Laptop Now Mammals (250+) Birds (130+) Reptiles (35+) Amphibians (5+) Fish (200+) Genome papers barely scratch the surface of the mysteries genomes holds http://cs273a.stanford.edu [Bejerano Winter 2018/19]

What about a tree of related species? What if we could find evolutionary patterns that were distinct enough to be phenotypically revealing? Species A Species B . Genomes: Inherited with Modifications. Traits: Come and Go. ancestor Species H http://cs273a.stanford.edu [Bejerano Winter 2018/19]

The PG screen Capture the independent genomic switch from purifying selection  neutral evolution in all and only the trait loss species. Robust to: Different trait disabling times. Different trait disabling mutations. http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Insertion in species 1 or We quantify divergence by comparing sequences to the reconstructed ancestral sequence Mutation in species 1 or 2? Insertion in species 1 or deletion in species 2 ? reconstruct ancestral sequence species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 2 outgroup ancestor ACCCTATCGATT-CA species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA 14 identical bases species 2 11 identical bases percent of identical bases: species 1 93% species 2 79%  more diverged

We quantify the match to the vitamin C pattern by counting the number of species that violate the pattern Percent identity Percent identity 100 100      1 violation 2 violations http://cs273a.stanford.edu [Bejerano Winter 2018/19]

ABCB4 is a phospholipid transporter http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Find “Cure” Models for Human Disease Human ABCB4 mutations lower patient biliary phospholipid levels to guinea pig levels but are detrimental. Our discovery: Guinea pig and horse have inactivated the Abcb4 gene in their natural state. How can they do it? create KO gene Natural KO try to fix/treat find nature’s cure! http://cs273a.stanford.edu [Bejerano Winter 2018/19]

What about gene regulation? Capture the independent genomic switch from purifying selection  neutral evolution in all and only the trait loss species. Robust to: Different trait disabling times. Different trait disabling mutations. http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. DNA Proteins Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”, and the nearby gene is activated to produce protein. http://cs273a.stanford.edu [Bejerano Winter 2018/19]

ChIP-Seq: glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak http://cs273a.stanford.edu [Bejerano Winter 2018/19] 13

What is the transcription factor I just assayed doing? Collect known literature of the form Function A: Gene1, Gene2, Gene3, ... Function B: Gene1, Gene2, Gene3, ... Function C: ... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site http://cs273a.stanford.edu [Bejerano Winter 2018/19] 14

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 SRF is known as a “master regulator of the actin cytoskeleton” In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. Jurkat (Human T cell lymphoblast-like cell line) Description: serum response factor (c-fos serum response RefSeq Summary (NM_003131): This gene encodes a ubiquitous nuclear protein that stimulates both cell proliferation and differentiation. It is a member of the MADS (MCM1, Agamous, Deficiens, and SRF) box superfamily of transcription factors. This protein binds to the serum response element (SRE) in the promoter region of target genes. This protein regulates the activity of many immediate-early genes, for example c-fos, and thereby participates in cell cycle regulation, apoptosis, cell growth, and cell differentiation. This gene is the downstream target of many pathways; for example, the mitogen-activated protein kinase pathway (MAPK) that acts through the ternary complex factors (TCFs). [provided by RefSeq]. http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) π π π π Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. Rank by enrichment hypergeometric p-value. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with π π π π P = Pr(k ≥1 | n=2, K =3, N=8) π π http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Pro: A lot of tools out there for the analysis of gene lists. We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Gene regulation data Microarray tool http://cs273a.stanford.edu [Bejerano Winter 2018/19]

SRF Gene-based enrichment results Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF Z ~ SRF Where’s the signal? Top “actin” term is ranked #28 in the list. ~ [1] Valouev A. et al., Nat. Methods, 2008 http://cs273a.stanford.edu [Bejerano Winter 2018/19] 18

Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Bad Solution: Associating distal peaks brings in many false enrichments π π π Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Large “gene deserts” are often next to key developmental genes http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Real Solution: Do not convert to gene list Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( ‘actin cytoskeleton’) Gene regulatory domain Genomic region (ChIP-seq peak) π π π π π GREAT = Genomic Regions Enrichment of Annotations Tool p = 0.33 of genome annotated with π n = 6 genomic regions k = 5 genomic regions hit annotation Fraction of genome resulting in annotation explicitly used in enrichment calculation P = Prbinom(k ≥5 | n=6, p =0.33) π π π Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. http://cs273a.stanford.edu [Bejerano Winter 2018/19]

How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks http://cs273a.stanford.edu [Bejerano Winter 2018/19]

GREAT infers many specific functions of SRF from its binding profile Top GREAT enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* Top gene-based enrichments of SRF Gene Ontology actin cytoskeleton actin binding 30 31 7x10-9 5x10-5 Miano et al. 2007 Pathway Commons TRAIL signaling Class I PI3K signaling 32 26 5x10-7 2x10-6 Bertolotto et al. 2000 Poser et al. 2000 TreeFam FOS gene family 5 1x10-8 Chai & Tarnawski 2002 (top actin-related term 28th in list) TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 SRF is “master regulator of the actin cytoskeleton.” SRF is key regulator of FOS oncogene and has been shown to act in conjunction with YY1 to regulate FOS. Demonstrated associations between SRF and TRAIL signaling. SRF is needed for PI3K-dependent cell proliferation. cFOS and FOSB are known targets of SRF. * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Advantages of the GREAT approach Tailored to the biology of gene regulation: Distal sites are incorporated, not ignored Variable length gene regulatory domains Multiple bindings next to same target gene rewarded Binding sites associated to (both) TSS, not gene body Extensive ontologies, some home-made http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Transcription Factor Function Prediction 1. Curate a rich library of TF binding site motifs. Pro: hundreds are now known, from SELEX, PBM, ChIP-seq. 2. Predict cross-species conserved binding sites. Pro: prediction is more accurate, sites are likely functional. 3. Search for extreme binding site concentration next to genes of any particular function. Pro: Leverage the observed phenomenon, and the very large body of knowledge about all (target) genes in the genome. Con: You will not get all binding sites of any function. Pro: You may predict many diverse TF functions. Predict SRF to regulate: Actin cytoskeleton, Muscle development, ... SRF http://cs273a.stanford.edu [Bejerano Winter 2018/19]

2. Predict conserved binding sites . = same as human Human Human Chimp Gorilla Orangutan Rhesus Tarsier Mouse lemur Bushbaby Tree shrew Mouse Guinea pig Squirrel Rabbit Alpaca Cow Cat Microbat Megabat Hedgehog Rock hyrax Tenrec Armadillo Sloth TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT ......................................... ...T..................................... ...T..............G...........G......T... ...T............C........AT..TG.....C.... ...T................C....AT...G.....C.... ...T.........................TG.......... ...T.....................C...TG.....G.G.. ...T.........................TG........CG ...T......................T...G......G... ...T........................TTG.......... ...T..........................GAC.......A ...T..........................C.......... ...T......................T.-.CA......... ...T...............G..................... TTTCCCTTAAAAGGCTTAAATAAACTCACCAGTGTTTAATT We in fact allow: Imperfect motif matches Binding site / alignment wobble Subset of species support Guard against alignment fragmentations Predict efficiently Improve state of the art using “Excess conservation” scoring http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Compare to shuffled motifs & weed aggressively SRF motif shuffle #1 shuffle #2 shuffle #3 shuffle #4 … shuffle #10 [Wenger et al, Genome Research, 2013] http://cs273a.stanford.edu [Bejerano Winter 2018/19]

3. Predict binding site/TF functions MYOG BMP4 MYF6 NKX2-5 gene transcription start site SRF binding site CAV3 ACTA1 Enhancer to gene association  SRF must regulate muscle structure development: predicted to bind next to 157 genes (p=7.43×10-41) http://cs273a.stanford.edu [Bejerano Winter 2018/19]

PRISM vs. ChIP-seq  GREAT Actin cytoskeleton SRF T cell Term PRISM ChIP-seq actin cytoskeleton Known http://cs273a.stanford.edu [Bejerano Winter 2018/19]

PRISM vs. ChIP-seq  GREAT Actin cytoskeleton SRF T cell Term PRISM ChIP-seq actin cytoskeleton Known structural constituent of muscle Known dilated heart ventricles Known regulation of insulin secretion Novel muscle SRF heart http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Independently Eroded Binding Sites http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Convergent phenotypic evolution Trait ✔ Convergent lineages New trait evolves ✔ http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Could Collect Individual Convergent Amino Acids More exciting to study a pathway/group of genes than it is to study individual ones. Even more exciting (including journal-wise) if you spin a second angle. For example, ask: 1. Did convergent molecular evolution contribute much to convergent phenotypic evolution? 2. Could highly constrained amino acids in oft pleiotropic genes have contributed? http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Don't Just Collect Individual Events Best if you can resolve some controversy. For example: echolocating bats and whales share 770 convergent substitutions of otherwise highly conserved amino acids. But all three trees, where one or both echolocating groups is swapped for its non-echolocating sister group, also present 700+ convergent amino acids. So? Is there molecular convergence or not? http://cs273a.stanford.edu [Bejerano Winter 2018/19]

1. Identify all conserved amino acids http://cs273a.stanford.edu [Bejerano Winter 2018/19]

2. Identify subset of convergent amino acids http://cs273a.stanford.edu [Bejerano Winter 2018/19]

3. Ask: which group of genes most affected? http://cs273a.stanford.edu [Bejerano Winter 2018/19]

4. Is group of genes most affected relevant? http://cs273a.stanford.edu [Bejerano Winter 2018/19]

5. Rejoice! http://cs273a.stanford.edu [Bejerano Winter 2018/19]

The Cochlea Human Rhesus Mouse Dolphin Killer whale Cow Dog Black flying fox Megabat David’s myotis Microbat Big brown bat Armadillo Echolocation independently evolves Max. hearing frequency Dolphin ~100 kHz Bat Human ~20 kHz Dog ~40 kHz http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Powerful Method http://cs273a.stanford.edu [Bejerano Winter 2018/19]

20+ amino acids per function/story http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Very Appealing Testable Targets http://cs273a.stanford.edu [Bejerano Winter 2018/19]

One of the amino acids found Prestin (SLC26A5) Liu Z. et al., Mol. Biol. Evol., 2014 Winter H. et al., J. Cell Sci. 2006 N7T http://cs273a.stanford.edu [Bejerano Winter 2018/19]

Summary Listen to the Genome See the blueprint So much to discover in animal genomes So many methods to develop So relevant to human health Element type x mutation type x Species/trait x loss/gain Mammals, birds, fish (,fly, worm, ..) Listen to the Genome See the blueprint http://cs273a.stanford.edu [Bejerano Winter 2018/19]