Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.

Slides:



Advertisements
Similar presentations
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Advertisements

Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Tropical Geometry for Biology Lior Pachter and Bernd Sturmfels Department of Mathematics U.C. Berkeley.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Gene Finding (DNA signals) Genome Sequencing and assembly
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
Introduction to BioInformatics GCB/CIS535
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374.
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
“Gene Finding in Novel Genomes” by Ian Korf Presented by: Christine Lee SoCAL BSI 2004.
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Lecture 12 Splicing and gene prediction in eukaryotes
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
(Combinatorics of) Alignment and Gene Finding Lior Pachter Basic definitions (alignment) Combinatorics of alignment Pair hidden Markov models Alignment.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
VISTA family of computational tools for comparative genomics How can we leverage genome sequences from many species to learn about genome function?How.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
Eukaryotic Gene Finding
Visualization of genomic data
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Ensembl Genome Repository.
Modeling of Spliceosome
Introduction to Alternative Splicing and my research report
Gene Structure.
Gene Structure.
Presentation transcript:

Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley

The Gene Finding Problem 5’3’ DNA Exon 1 Exon 2Exon 3Exon 4 Intron 1Intron 2Intron 3 polyA signalPyrimidine tract Branchpoint CTG A C Splice site CAG Splice site GGTGAG Translation Initiation ATG Stop codon TAG/TGA/TAA Promoter TATA

Approaches to Gene Recognition Naïve (mid80s - mid90s) ORFfinder, BLAST.. Statistical de novo Genie (96),Genscan (97), FGENESH.. Systems Ensembl.. “Ask not what mathematics can do for biology, ask what biology can do for mathematics” - Stanislaw Ulam

Difficulty of naïve approaches n = number of acceptor splice sites m = number of donor splice sites n+m+1 (Fibonacci #) Number of gene structures = F n+m+1 (Fibonacci #) 1,1,2,3,5,8,13,21,34… 1,1,2,3,5,8,13,21,34…

statistical gene finding TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

Using GHMMs for ab-initio gene finding In practice, have observed sequence Predict genes by estimating hidden state sequence Usual solution: single most likely sequence of hidden states (Viterbi). TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA TAATATGTCCACGGTTGTACACGGCAG GTATTG AG GTATTG AG ATGTAACTG AA

Results High sensitivity / low specificity Exon / Intron length distributions Identification of GC isochore - gene richness dep. Splice site models

Comparative Gene Finding

Comparison of 1196 orthologous mRNAs (Makalowski et al., 1996) Sequence identity: –exons: 84.6% –protein: 85.4% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

Comparison of 117 complete genes Batzoglou/Pachter et al % of genes equal number of coding exons Exceptions: Spermidine Synthase Lymphotoxin Beta 73% of coding exons have equal length 95% of coding exons have length equal mod 3 Intron conservation 35% Intron length ratio longer/shorter: 1.5

SLAM- alignment & gene finding Input: –Pair of syntenic sequences (FASTA). Output: –CDS and CNS predictions in both sequences. –Protein predictions. –Protein and CNS alignment.

SLAM components Splice site detector –VLMM Intron and intergenic regions –2nd order Markov chain –independent geometric lengths Coding sequence –PHMM on protein level –generalized length distribution Conserved non-coding sequence –PHMM on DNA level

Input:

Output:

What have we learned from comparative gene finding? conservation is a stronger splice site indicator than consensus intron lengths have diverged gene structure conservation is more powerful than sequence conservation for prediction consensus for GC splice sites

SLAM whole genome run Align the genomes Construct a synteny map Chop up into SLAMable pieces Run SLAM Collate results

Alignment project:

Linux cluster with GHz PC, 750Mb of RAM Three days to align the entire mouse genome against the human genome

Finding regulatory regionsGodzilla Gene name Enolase -Experimentally defined enhancer (beta- enolase)

Experimental gene verification with RT-PCR predicted intron primer Intron > 1000bp Aligning human/mouse Exons > 60bp

SLAM CNS data

Single exon data

Acknowledgments Marina Alexandersson – Gothenburg, Sweden (SLAM) Nick Bray – LBNL/UCB math (Avid alignment program) Simon Cawley - Affymetrix (SLAM) Olivier Couronne – LBNL (Godzilla) Colin Dewey - Berkerley (SLAM) Alex Poliakov - LBNL (Godzilla, VISTA) Chuck Sugnet - UCSC (SLAM) Inna Dubchak - LBNL Eddy Rubin - LBNL