TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.

Slides:



Advertisements
Similar presentations
Periodic clusters. Non periodic clusters That was only the beginning…
Advertisements

Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Two short pieces MicroRNA Alternative splicing.
Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Speaker: HU Xue-Jia Supervisor: WU Yun-Dong Date: 19/12/2013.
A turbo intro to (the bioinformatics of) microRNAs 11/ Peter Hagedorn.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Comparative Motif Finding
Introduction to BioInformatics GCB/CIS535
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
Lecture 12 Splicing and gene prediction in eukaryotes
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Introns and Exons DNA is interrupted by short sequences that are not in the final mRNA Called introns Exons = RNA kept in the final sequence.
Ultraconserved Elements in the Human Genome Bejerano, G., et.al. Katie Allen & Megan Mosher.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and.
Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
RNA Folding. RNA Folding Algorithms Intuitively: given a sequence, find the structure with the maximal number of base pairs For nested structures, four.
More regulating gene expression. Combinations of 3 nucleotides code for each 1 amino acid in a protein. We looked at the mechanisms of gene expression,
Manolis Kellis modENCODE analysis group January 11, 2007 Part 1: Target identification: comparative vs. exprmt. (really the topic for today) Part 2: Target.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Marco Magistri , Journal Club. A non-coding RNA (ncRNA) is any RNA molecule that is not translated into a protein “Structural genes encode proteins.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Protein and RNA Families
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
Chapter 3 The Interrupted Gene.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
Motif Search and RNA Structure Prediction Lesson 9.
Shai Carmi, Erez Levanon Bar-Ilan University
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Translation- taking the message of DNA and converting it into an amino acid sequence.
KEY CONCEPT 8.5 Translation converts an mRNA message into a polypeptide, or protein.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
Comparative genomics in flies and mammals
EGASP 2005 Evaluation Protocol
Very important to know the difference between the trees!
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Chapter 4 The Interrupted Gene.
Phylogenetic footprinting and shadowing
Study phylogeny in the context of species evolution
Presentation transcript:

TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name

Title - 32 pt Arial

COMPARATIVE GENOMICS Manolis Kellis Board of Scientific Counselors January 2007

TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA CATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC AGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC CGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT AGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG ATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA AAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT TGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA TTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG ATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT TTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA GTTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG CTCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA ACTGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT GTCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT GCGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA ATTAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC TACAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG ATTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG ATGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA AGAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT CTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG AACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC ATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA AAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA GCATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT TTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA ATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG ATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG TTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC GAGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA GAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA AAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA TTTTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT ACCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT TAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA GAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA CAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA ACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC AACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT TGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT TCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTA ATGCTGAAATCTATCTTTGGAAAAGATTTACAA

Genes Encode proteins Regulatory motifs Control gene expression

32 mammals 9 yeasts 12 flies The power of comparative genomics Comparative genomics reveals selection –Functional elements mostly conserved –Non-functional regions mostly diverged  Functional regions stand out Comparative genomics reveals function –Each type of function under unique constraints (Proteins, RNA, motifs, each evolve differently) –Discover them by their distinct evolutionary patterns  Evolutionary signatures for each type of element humanmouseratchimpdog 8 Candida

Comparative genomics leads to… 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

Distinguishing genes from non-coding regions Dmel TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsec TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dsim TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dyak TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC Dere TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT Dana TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA Dpse TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC Dper TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC Dwil TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC Dmoj TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT Dvir TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC Dgri TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints –Gaps are multiples of three (preserve amino acid translation) –Mutations are largely 3-periodic (silent codon substitutions) –Specific triplets exchanged more frequently (conservative substs. ) –Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as ‘evolutionary signatures’ –Computational test for each of them –Combine and score systematically Splice Frame-shifting indels Periodic mutations Synonymous substs.

Power of evolutionary signatures Signatures much more precise than level of conservation Before: Parsing a genome into high-conservation / low-conservation Now: Parse into protein-coding conservation / RNA-like / motif-like, etc. Probabilistic framework Hidden Markov Models (HMMs) Generative model, learn emission, transition probabilities Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) Discriminative dual of HMMs, learn weights on features Easy to integrate diverse signals, gradient ascent for training

Known genes stand out Substitution typical of protein-coding regions Substitution typical of intergenic regions

CG6664/FBtr Previously-annotated start codon Newly-identified start codon Ability to identify subtle events ATG Translation start corrected for 200 genes Protein-coding conservation Continued protein-coding conservation No more conservation Hundreds of read-through regions identified New mechanism of post-transcriptional control. Many questions remain. Enriched in brain proteins, ion channels. Under ADAR control. Stop codon read through 2 nd stop codon

Towards a revised genome annotation –Curation: FlyBase integrates prediction with cDNA, protein, literature –Experimentation: BDGP large-scale functional validation novel exons High-accuracy reannotation –Ability to detect small genes & exons (40AA: 95|99|99%, 20AA: 87|96|99%) –Detect subtle events: sequencing errors, start/stop and splice site changes –Recognize unusual gene structures  read-through, uORFs, RNA editing D. simulans D. erecta D. persimilis D. melanog. Summary: Revisiting fly genome annotation (…) 454 genes800 genes668 genes12,000 genes Confirmed DubiousNovelRefined Powerful approach for comprehensive genome annotation sen | pre | spe

Comparative genomics 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

The regulatory code Multiple levels of regulation –Temporal and spatial regulation, disease, development –Chromatin, pre- / post-transcriptional, splicing, translational Combinatorial coding of individual motifs –The core: a relatively small number of regulatory motifs –Regions: diverse motif combinations specify diverse functions Regulatory motifs –Summarize information across thousands of sites Distinguish: regulatory motifs vs. motif instances –Challenging to discover Small (6-8 nucleotides), subtle (frequent degenerate positions), dispersed (act at a distance), diverse (sequence composition) Enhancer regions 5’-UTR Promoter motifs 3’-UTR Splicing signalsMotifs at RNA level

Regulatory motif discovery Study known motifs Derive conservation rules Discover novel motifs

Known motifs are preferentially conserved In multi-species alignments: known motifs  conservation islands –Conserved biology: Conserved regulatory code, same words are functional –Preferential conservation: Stand out from surrounding nucleotides –Good signal for identifying individual instances of known motifs Need additional power for motif discovery: –Conservation not limited to exact binding site  additional bases would be found –Weakly constrained positions can diverge  Real motifs will be missed –How do we discover motifs de novo?  Use basic property of regulatory motifs  Evaluate genome-wide conservation over thousands of instances Err  human CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC dog CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC mouse CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC rat CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG- * * * ********** *** *** * Gabpa Errα

ConsensusMCSMatches to known Expression enrichment PromotersEnhancers 1CTAATTAAA65.6engrailed (en) TTKCAATTAA57.3reversed-polarity (repo) WATTRATTK54.9araucan (ara) AAATTTATGCK54.4paired (prd) GCAATAAA51ventral veins lacking (vvl) DTAATTTRYNR46.7Ultrabithorax (Ubx) TGATTAAT45.7apterous (ap) YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT RATTKAATT GCACGTGT39.5fushi tarazu (ftz) AACASCTG38.8broad-Z3 (br-Z3) AATTRMATTA TATGCWAAT TAATTATG37.5Antennapedia (Antp) CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG33.8Abdominal B (Abd-B) TTTYMATTA33.6extradenticle (exd) TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT32.9gooseberry-neuro (gsb-n) TTATTTAYR32.9Deformed (Dfd)30.7 Systematically discover regulatory motifs

Functional clustering of motifs and tissues

Motif discovery in human enhancer regions Can identify 40% of enhancers with 50 motifs –3X enrichment (vs. 15% of intergenic regions) Motif combinations further improve performance –5X enrichment for top 30 motif combinations Chromatin signatures of enhancer regionsMotif signatures of enhancer regions 74 Enhancers 208 Promoters H3K4me3RNAPII p300H3K4me1

Evolutionary signatures for microRNA genes Genome-wide discovery of miRNAs –41 novel miRNA genes. Rediscover 81% of known (61 of 74). Reject 4 dubious. –454 sequencing of small RNAs confirms 27 of 41 novel miRNAs (66%). Genomic properties: –Introns of known genes, including several transcription factors –Genomic clustering of known and novel miRNAs: poly-cistronic precursors –Two ‘dubious’ protein-coding genes are in fact miRNAs  Improved annotation of miRNA genes

Functional properties of microRNA targets Refine annotation of known miRNA genes –Start adjustments suggested by the evolutionary signatures, confirmed by sequencing –Small change in start (+2 nucleotides) implies great change in target spectrum (>95%) miRNA targets –Novel miRNAs include many novel families  distinct groupings of genes. –Targets of novel show large overlap with targets of known  denser miRNA network miR10 * as a master Hox regulator –For three genes, both miRNA+ and miRNA* seem functional by evolution and sequencing. –For miR-10, the star shows stronger signal, more sequencing reads, more predicted targets. –Both miR-10+ and miR-10* targets several Hox genes, more than any other miRNA.

Comparative genomics 1. Genome interpretation –Decode the human genome –Discover all functional elements  The building blocks 2. Cell circuitry –Discover all control constructs –Regulatory network properties  The interconnections 3. Evolutionary innovation –Emergence of new functions –Genome and network duplication  The dynamics

Resolving power in mammals, flies, fungi Neutral:2.57 subs/site (opp: sps: 4.87 ) Coding:1.16 subs/site Detect:6-mer at FP mammals 17 yeasts 12 flies 8 Candida 9 Yeasts Post-duplication Diploid Haploid Pre-dup P P P P P P Neutral:4.13 subs/site Coding:1.65 subs/site Detect: 6-mer at Neutral:15.5 subs/site (Yeast: 6.5 Candida: 6.5 ) Coding:7.91 subs/site Detect: 3-mer at sub/site 0.1 sub/site 0.8 sub/site