Download presentation
Presentation is loading. Please wait.
1
CS273A Lecture 11: Comparative Genomics II
MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas [BejeranoFall13/14]
2
Announcements Some mid term feedback feedback: You seem to like us
We like you too! Teach us more biology / Teach us more algorithms We’ll highlight follow-up classes towards the end of the quarter Give us more references Start with Wikipedia. Then ask us for any specifics on Piazza. How do all the different topics we cover tie together? They all teach you about the human genome! Its functions, its evolution and its contribution to disease – it’s a big canvas What are the most important problems in the field? Different people will give you different answers Every topic we introduce to you is not fully resolved! Homework is very technical. Hard to focus on the insights. This is part of our daily challenge. We should make you like the taste of it, because we sure do! Your project will give you a taste of real open ended research. [BejeranoFall13/14]
3
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG Genome Evolution 3
4
Comparative Genomics “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky “Nothing in Evolution Makes Sense Except in the Light of Computation” Yours Truly human mouse rat chimp chicken fugu zfish dog tetra opossum cow macaque platypus While we may not agree about everything… Incomplete list: also cat, armadillo, elephant, rabbit, shrew, tenrec, wallaby (2x) Some not mammals, obviously, but useful outgroups in phylogenetic analysis of mammals T [BejeranoFall13/14]
5
Terminology Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) Gene tree single ancestral gene Species tree Speciation Duplication Loss [BejeranoFall13/14]
6
Conservation implies function
purifying selection vs. neutral evolution Note: Lack of sequence conservation does NOT imply lack of function. NOR does it rule out function conservation. [BejeranoFall13/14]
7
Dotplots Dotplots are a simple way of seeing alignments
We really like to see good visual demonstrations, not just tables of numbers It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match. You see the alignment as a diagonal Note that DNA dotplots are messier because the alphabet has only 4 letters Smoothing by windows helps: [BejeranoFall13/14]
8
Chaining Alignments Chaining highlights homologous regions between genomes, bridging the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures. [BejeranoFall13/14]
9
Another Chain Example In Human Browser In Mouse Browser … … … …
Human Sequence Mouse Sequence A B C A B C D E D B’ E In Human Browser In Mouse Browser Implicit Human sequence Implicit Mouse sequence … … D E Mouse chains … Human chains … D E D E B’ [BejeranoFall13/14]
10
Chains join together related local alignments
likely ortholog likely paralogs shared domain? Protease Regulatory Subunit 3 [BejeranoFall13/14]
11
Note: repeats are a nuisance
human If, for example, human and mouse have each 10,000 copies of the same repeat: We will obtain and need to output 108 alignments of all these copies to each other. Note that for the sake of this comparison interspersed repeats and simple repeats are equal nuisances. However, note that simple repeats, but not interspersed repeats, violate the assumption that similar sequences are homologous. mouse Solution: 1 Discover all repetitive sequences in each genome. 2 Mask them when doing genome to genome comparison. 3 Chain your alignments. 4 Add back to the alignments only repeat matches that lie within pre-computed chains. This re-introduces back into the chains (mostly)orthologous copies. (Which is valuable!) [BejeranoFall13/14]
12
Chains a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. Within a chain, target and query coords are monotonically non-decreasing. (i.e. always increasing or flat) double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. not just orthologs, but paralogs too, can result in good chains. but that's useful! chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki] [BejeranoFall13/14]
13
Before and After Chaining
[BejeranoFall13/14]
14
Chaining Algorithm Input - blocks of gapless alignments from (b)lastz
Dynamic program based on the recurrence relationship: score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i See [Kent et al, 2003] “Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes” [BejeranoFall13/14]
15
Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy. [BejeranoFall13/14]
16
Net highlights rearrangements
A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes. [BejeranoFall13/14]
17
Nets attempt to capture the ortholog
(they also hide everything else) [BejeranoFall13/14]
18
Nets/chains can reveal retrogenes (and when they jumped in!)
[BejeranoFall13/14]
19
Nets a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. a net is single-coverage for target but not for query. because it's single-coverage in the target, it's no longer symmetrical. the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again. nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level. GB: for human inspection always prefer looking at the chains! [Angie Hinrichs, UCSC wiki] [BejeranoFall13/14]
20
Before and After Netting
[BejeranoFall13/14]
21
Convert / LiftOver "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. LiftOver – batch utility [BejeranoFall13/14]
22
Drawbacks Chains Nets Inversions not handled optimally
> > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < < < < < chr1 < < < < Nets > > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < [BejeranoFall13/14]
23
Self Chain reveals paralogs
(self net is meaningless) [BejeranoFall13/14]
24
Let’s put the chains and nets to good use…
[BejeranoFall13/14]
25
The Genotype - Phenotype divide
Can we find evolutionary patterns that are distinct enough to be phenotypically revealing? Problem #1: Too many nucleotide changes between any pair of related species (or individuals). The vast majority of these are near/neutral. Species A Species B [BejeranoFall13/14]
26
Matching Genotype to Phenotype is hard
Number of rearrangements Most mutations are near/neutral. [BejeranoFall13/14] 26
27
What about a tree of related species?
What if we could find evolutionary patterns that were distinct enough to be phenotypically revealing? Species A Species B . Genomes: Inherited with Modifications. Traits: Come and Go. ancestor Species H [BejeranoFall13/14]
28
What happens when an ancestral trait “goes”?
ancestral trait information ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time [BejeranoFall13/14]
29
A lot of DNA and many traits vary between any two species.
ancestral trait information A lot of DNA and many traits vary between any two species. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time [BejeranoFall13/14]
30
A lot of DNA and many traits vary between any two species.
ancestral trait information A lot of DNA and many traits vary between any two species. What about independent trait loss? vitamin C synthesis, tail, body hair, dentition features, etc. etc. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time [BejeranoFall13/14]
31
Phenotype Genome ancestral trait information ancestor
Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time [BejeranoFall13/14]
32
matches trait presence/absence pattern
The PG screen matches trait presence/absence pattern [BejeranoFall13/14] [Hiller et al., 2012a]
33
The PG screen Capture the independent genomic switch from purifying selection neutral evolution in all and only the trait loss species. Robust to: Different trait disabling times. Different trait disabling mutations. [BejeranoFall13/14]
34
Branding ;-) But does it work? Forward Genetics:
phenotype genotype Forward Genetics: Search for mutations that segregate with the trait Forward Genomics: Search for regions that are lost only in species lacking the trait But does it work? [BejeranoFall13/14]
35
Vitamin C Synthesis human rats & mice synthesize vitamin C
cannot synthesize vitamin C [BejeranoFall13/14]
36
The Vitamin C synthesis “phenotree”
vitamin C synthesis was lost 3-4 times independently in mammalian evolution Fwd Genomics asks: Do one or more genomic loci look like THAT? [BejeranoFall13/14]
37
Start by using chains and nets!
species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 2 outgroup First we use lastz, chaining & netting to align the reference genome to orthologous sequences in all other species’ genomes.
38
Insertion in species 1 or
We quantify divergence by comparing sequences to the reconstructed ancestral sequence Mutation in species 1 or 2? Insertion in species 1 or deletion in species 2 ? reconstruct ancestral sequence species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 2 outgroup ancestor ACCCTATCGATT-CA species 1 ACCCTATCGATTGCA TCCGTATCG-TT-CA 14 identical bases species 2 11 identical bases percent of identical bases: species 1 93% species 2 79% more diverged
39
Sequencing errors mimic divergence
ancestor ACCCTATCGATT-CAATGG species 1 ACCCTATCGATTGCAAGGG 89% identical bases species 2 TCCGTAACG--T-CTATCG 61% identical bases sequence quality scores high sequencing error rate treat species 2 as missing data
40
Assembly gaps mimic divergence
Sanger reads assembly gap ????????? species 1 species 2 species 3 species 4 species 5 conserved region treat species 1 as missing data
41
... matrix: 33 species x 544,549 regions
Reconstruct the evolutionary history of all conserved regions, coding and non-coding 544,549 conserved regions 93% 70% reconstruct ancestral locus 85% ... matrix: 33 species x 544,549 regions Reconstruct ancestral sequence Measure extant species divergence Avoid Low quality sequence Assembly gaps Seek perfect phenotree match [BejeranoFall13/14]
42
We quantify the match to the vitamin C pattern by counting the number of species that violate the pattern Percent identity Percent identity 100 100 1 violation 2 violations [BejeranoFall13/14]
43
Regions matching the vitamin C trait are clustered
perfect match 544,549 conserved regions 1 2 3 4 no. of violating species 5 6 7 8 9 10 no match these conserved regions are all exons of a single gene [BejeranoFall13/14]
44
This gene is more diverged in all non-vitamin C synthesizing species
[BejeranoFall13/14]
45
What is the function of this gene ?
33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase encodes the enzyme responsible for vitamin C biosynthesis Note: No likely shared disabling mutation. We learned about both evolution and function. [BejeranoFall13/14]
46
The Power of Forward Genomics
33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase Forward genomics works. Can it work for continuous traits? With only two independent losses? And many unknown values? [BejeranoFall13/14]
47
Bile Bile is a fluid produced by the liver that aids the digestion of lipids in the small intestine. [BejeranoFall13/14]
48
Bile Phospholipids Different mammals have remarkably different levels of biliary phospholipids: [BejeranoFall13/14]
49
ABCB4 is a phospholipid transporter
[BejeranoFall13/14]
50
Find “Cure” Models for Human Disease
Human ABCB4 mutations lower patient biliary phospholipid levels to guinea pig levels but are detrimental. Our discovery: Guinea pig and horse have inactivated the Abcb4 gene in their natural state. How can they do it? create KO gene Natural KO try to fix/treat find nature’s cure! [BejeranoFall13/14]
51
Forward Genomics: How General?
Maybe we just got lucky? Simulation: our discoveries are not serendipitous More losses, more branch length => more likely [Hiller et al., 2012a] [BejeranoFall13/14]
52
Forward Genomics: It’s not just enzymes
We find hundreds of Conserved Non-coding Elements (CNEs) independently lost using just 8 mammalian genomes. [Hiller et al., 2012b] [BejeranoFall13/14]
53
9 independent CNE losses near DIAPH2 in dog and guinea pig
Diaphanous homolog 2 may play a role in the development and normal function of the ovaries. Mutations of this gene have been linked to premature ovarian failure. [BejeranoFall13/14]
54
How many independent trait losses?
42% of measured traits! [Hiller et al., 2012a] [BejeranoFall13/14]
55
http://cs273a.stanford.edu [BejeranoFall13/14]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.