CS273A Lecture 11: Comparative Genomics II

Slides:



Advertisements
Similar presentations
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12:
Advertisements

GENE TREES Abhita Chugh. Phylogenetic tree Evolutionary tree showing the relationship among various entities that are believed to have a common ancestor.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Duplication, rearrangement, and mutation of DNA contribute to genome evolution Chapter 21, Section 5.
Molecular Evolution Revised 29/12/06
Sequence Similarity Searching Class 4 March 2010.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
CS273a Lecture 9/10, Aut 10, Batzoglou Multiple Sequence Alignment.
Sequence similarity.
[Bejerano Fall09/10] 1 Milestones due today. Anything to report?
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
[Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise & Multiple sequence alignments
[Bejerano Fall11/12] 1 Primer Friday 10am Beckman B-302 Introduction to the UCSC Browser.
Mouse Genome Sequencing
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 11:
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
NEW NEWS of HUMAN FROM MOUSE and CHIMP Nature 420 (6915), 5 Dec 2002 Genome Research 13(3), March 2003.
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 17:
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Introduction to Phylogenetics
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Accessing and visualizing genomics data
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS273A Lecture 17: Cross Species Comparisons
CS273A Lecture 15: Inferring Evolution: Chains & Nets II
Evolutionary genomics can now be applied beyond ‘model’ organisms
Basics of Comparative Genomics
Genomes and Their Evolution
Very important to know the difference between the trees!
Genomes and Their Evolution
CS273A Lecture 12: Inferring Evolution: Chains & Nets
CS273A Lecture 14: Inferring Evolution: Chains & Nets
Volume 2, Issue 4, Pages (October 2012)
CS273A Lecture 8: Inferring Evolution: Chains & Nets
The Human Genome Source Code
Gene Density and Noncoding DNA
Chapter 6 Clusters and Repeats.
Basics of Comparative Genomics
The Human Genome Source Code
Presentation transcript:

CS273A Lecture 11: Comparative Genomics II MW  12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas http://cs273a.stanford.edu [BejeranoFall13/14]

Announcements Some mid term feedback feedback: You seem to like us We like you too! Teach us more biology / Teach us more algorithms We’ll highlight follow-up classes towards the end of the quarter Give us more references Start with Wikipedia. Then ask us for any specifics on Piazza. How do all the different topics we cover tie together? They all teach you about the human genome! Its functions, its evolution and its contribution to disease – it’s a big canvas What are the most important problems in the field? Different people will give you different answers Every topic we introduce to you is not fully resolved! Homework is very technical. Hard to focus on the insights. This is part of our daily challenge. We should make you like the taste of it, because we sure do! Your project will give you a taste of real open ended research. http://cs273a.stanford.edu [BejeranoFall13/14]

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG Genome Evolution 3

Comparative Genomics “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky “Nothing in Evolution Makes Sense Except in the Light of Computation” Yours Truly human mouse rat chimp chicken fugu zfish dog tetra opossum cow macaque platypus While we may not agree about everything… Incomplete list: also cat, armadillo, elephant, rabbit, shrew, tenrec, wallaby (2x) Some not mammals, obviously, but useful outgroups in phylogenetic analysis of mammals T http://cs273a.stanford.edu [BejeranoFall13/14]

Terminology Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) Gene tree single ancestral gene Species tree Speciation Duplication Loss http://cs273a.stanford.edu [BejeranoFall13/14]

Conservation implies function purifying selection vs. neutral evolution Note: Lack of sequence conservation does NOT imply lack of function. NOR does it rule out function conservation. http://cs273a.stanford.edu [BejeranoFall13/14]

Dotplots Dotplots are a simple way of seeing alignments We really like to see good visual demonstrations, not just tables of numbers It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match. You see the alignment as a diagonal Note that DNA dotplots are messier because the alphabet has only 4 letters Smoothing by windows helps: http://cs273a.stanford.edu [BejeranoFall13/14]

Chaining Alignments Chaining highlights homologous regions between genomes, bridging the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures. http://cs273a.stanford.edu [BejeranoFall13/14]

Another Chain Example In Human Browser In Mouse Browser … … … … Human Sequence Mouse Sequence A B C A B C D E D B’ E In Human Browser In Mouse Browser Implicit Human sequence Implicit Mouse sequence … … D E Mouse chains … Human chains … D E D E B’ http://cs273a.stanford.edu [BejeranoFall13/14]

Chains join together related local alignments likely ortholog likely paralogs shared domain? Protease Regulatory Subunit 3 http://cs273a.stanford.edu [BejeranoFall13/14]

Note: repeats are a nuisance human If, for example, human and mouse have each 10,000 copies of the same repeat: We will obtain and need to output 108 alignments of all these copies to each other. Note that for the sake of this comparison interspersed repeats and simple repeats are equal nuisances. However, note that simple repeats, but not interspersed repeats, violate the assumption that similar sequences are homologous. mouse Solution: 1 Discover all repetitive sequences in each genome. 2 Mask them when doing genome to genome comparison. 3 Chain your alignments. 4 Add back to the alignments only repeat matches that lie within pre-computed chains. This re-introduces back into the chains (mostly)orthologous copies. (Which is valuable!) http://cs273a.stanford.edu [BejeranoFall13/14]

Chains a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non-decreasing. (i.e. always increasing or flat) double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. not just orthologs, but paralogs too, can result in good chains. but that's useful! chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [BejeranoFall13/14]

Before and After Chaining http://cs273a.stanford.edu [BejeranoFall13/14]

Chaining Algorithm Input - blocks of gapless alignments from (b)lastz Dynamic program based on the recurrence relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i See [Kent et al, 2003] “Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes” http://cs273a.stanford.edu [BejeranoFall13/14]

Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy. http://cs273a.stanford.edu [BejeranoFall13/14]

Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes. http://cs273a.stanford.edu [BejeranoFall13/14]

Nets attempt to capture the ortholog (they also hide everything else) http://cs273a.stanford.edu [BejeranoFall13/14]

Nets/chains can reveal retrogenes (and when they jumped in!) http://cs273a.stanford.edu [BejeranoFall13/14]

Nets a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. a net is single-coverage for target but not for query. because it's single-coverage in the target, it's no longer symmetrical. the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again. nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level. GB: for human inspection always prefer looking at the chains! [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [BejeranoFall13/14]

Before and After Netting http://cs273a.stanford.edu [BejeranoFall13/14]

Convert / LiftOver "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. LiftOver – batch utility http://cs273a.stanford.edu [BejeranoFall13/14]

Drawbacks Chains Nets Inversions not handled optimally > > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < < < < < chr1 < < < < Nets > > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < http://cs273a.stanford.edu [BejeranoFall13/14]

Self Chain reveals paralogs (self net is meaningless) http://cs273a.stanford.edu [BejeranoFall13/14]

Let’s put the chains and nets to good use… http://cs273a.stanford.edu [BejeranoFall13/14]

The Genotype - Phenotype divide Can we find evolutionary patterns that are distinct enough to be phenotypically revealing? Problem #1: Too many nucleotide changes between any pair of related species (or individuals). The vast majority of these are near/neutral. Species A Species B http://cs273a.stanford.edu [BejeranoFall13/14]

Matching Genotype to Phenotype is hard Number of rearrangements Most mutations are near/neutral. http://cs273a.stanford.edu [BejeranoFall13/14] 26

What about a tree of related species? What if we could find evolutionary patterns that were distinct enough to be phenotypically revealing? Species A Species B . Genomes: Inherited with Modifications. Traits: Come and Go. ancestor Species H http://cs273a.stanford.edu [BejeranoFall13/14]

What happens when an ancestral trait “goes”? ancestral trait information ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]

A lot of DNA and many traits vary between any two species. ancestral trait information A lot of DNA and many traits vary between any two species. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]

A lot of DNA and many traits vary between any two species. ancestral trait information A lot of DNA and many traits vary between any two species. What about independent trait loss? vitamin C synthesis, tail, body hair, dentition features, etc. etc. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]

Phenotype Genome ancestral trait information ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]

matches trait presence/absence pattern The PG screen      matches trait presence/absence pattern http://cs273a.stanford.edu [BejeranoFall13/14] [Hiller et al., 2012a]

The PG screen Capture the independent genomic switch from purifying selection  neutral evolution in all and only the trait loss species. Robust to: Different trait disabling times. Different trait disabling mutations. http://cs273a.stanford.edu [BejeranoFall13/14]

Branding ;-) But does it work? Forward Genetics: phenotype genotype Forward Genetics: Search for mutations that segregate with the trait Forward Genomics: Search for regions that are lost only in species lacking the trait But does it work? http://cs273a.stanford.edu [BejeranoFall13/14]

Vitamin C Synthesis human rats & mice synthesize vitamin C cannot synthesize vitamin C http://cs273a.stanford.edu [BejeranoFall13/14]

The Vitamin C synthesis “phenotree” vitamin C synthesis was lost 3-4 times independently in mammalian evolution Fwd Genomics asks: Do one or more genomic loci look like THAT? http://cs273a.stanford.edu [BejeranoFall13/14]

Start by using chains and nets! species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 2 outgroup First we use lastz, chaining & netting to align the reference genome to orthologous sequences in all other species’ genomes.

Insertion in species 1 or We quantify divergence by comparing sequences to the reconstructed ancestral sequence Mutation in species 1 or 2? Insertion in species 1 or deletion in species 2 ? reconstruct ancestral sequence species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 2 outgroup ancestor ACCCTATCGATT-CA species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA 14 identical bases species 2 11 identical bases percent of identical bases: species 1 93% species 2 79%  more diverged

Sequencing errors mimic divergence ancestor ACCCTATCGATT-CAATGG species 1 ACCCTATCGATTGCAAGGG 89% identical bases species 2 TCCGTAACG--T-CTATCG 61% identical bases sequence quality scores high sequencing error rate  treat species 2 as missing data

Assembly gaps mimic divergence Sanger reads assembly gap ????????? species 1 species 2 species 3 species 4 species 5 conserved region  treat species 1 as missing data

... matrix: 33 species x 544,549 regions Reconstruct the evolutionary history of all conserved regions, coding and non-coding 544,549 conserved regions 93% 70% reconstruct ancestral locus 85% ... matrix: 33 species x 544,549 regions Reconstruct ancestral sequence Measure extant species divergence Avoid Low quality sequence Assembly gaps Seek perfect phenotree match http://cs273a.stanford.edu [BejeranoFall13/14]

We quantify the match to the vitamin C pattern by counting the number of species that violate the pattern Percent identity Percent identity 100 100      1 violation 2 violations http://cs273a.stanford.edu [BejeranoFall13/14]

Regions matching the vitamin C trait are clustered perfect match 544,549 conserved regions 1 2 3 4 no. of violating species 5 6 7 8 9 10 no match  these conserved regions are all exons of a single gene http://cs273a.stanford.edu [BejeranoFall13/14]

This gene is more diverged in all non-vitamin C synthesizing species http://cs273a.stanford.edu [BejeranoFall13/14]

What is the function of this gene ? 33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase encodes the enzyme responsible for vitamin C biosynthesis Note: No likely shared disabling mutation. We learned about both evolution and function. http://cs273a.stanford.edu [BejeranoFall13/14]

The Power of Forward Genomics 33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase Forward genomics works. Can it work for continuous traits? With only two independent losses? And many unknown values? http://cs273a.stanford.edu [BejeranoFall13/14]

Bile Bile is a fluid produced by the liver that aids the digestion of lipids in the small intestine. http://cs273a.stanford.edu [BejeranoFall13/14]

Bile Phospholipids Different mammals have remarkably different levels of biliary phospholipids: http://cs273a.stanford.edu [BejeranoFall13/14]

ABCB4 is a phospholipid transporter http://cs273a.stanford.edu [BejeranoFall13/14]

Find “Cure” Models for Human Disease Human ABCB4 mutations lower patient biliary phospholipid levels to guinea pig levels but are detrimental. Our discovery: Guinea pig and horse have inactivated the Abcb4 gene in their natural state. How can they do it? create KO gene Natural KO try to fix/treat find nature’s cure! http://cs273a.stanford.edu [BejeranoFall13/14]

Forward Genomics: How General? Maybe we just got lucky? Simulation: our discoveries are not serendipitous More losses, more branch length => more likely [Hiller et al., 2012a] http://cs273a.stanford.edu [BejeranoFall13/14]

Forward Genomics: It’s not just enzymes We find hundreds of Conserved Non-coding Elements (CNEs) independently lost using just 8 mammalian genomes. [Hiller et al., 2012b] http://cs273a.stanford.edu [BejeranoFall13/14]

9 independent CNE losses near DIAPH2 in dog and guinea pig Diaphanous homolog 2 may play a role in the development and normal function of the ovaries. Mutations of this gene have been linked to premature ovarian failure. http://cs273a.stanford.edu [BejeranoFall13/14]

How many independent trait losses? 42% of measured traits! [Hiller et al., 2012a] http://cs273a.stanford.edu [BejeranoFall13/14]

http://cs273a.stanford.edu [BejeranoFall13/14]