Gene prediction roderic guigó i serra IMIM/UPF/CRG.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
Gene prediction in ENCODE roderic guigó i serra crg-imim-upf, barcelona Advanced Bioinformatics, chsl, october 2005.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.
Jul /16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.
Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Genes. Outline  Genes: definitions  Molecular genetics - methodology  Genome Content  Molecular structure of mRNA-coding genes  Genetics  Gene regulation.
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.
Comparative ab initio prediction of gene structures using pair HMMs
R ESEARCH G ENOME B IOINFORMATICS L AB R ESEARCH at G ENOME B IOINFORMATICS L AB Josep F. Abril Ferrando and Genís Parra Farré Genome BioInformatics Research.
Eukaryotic Gene Finding
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Part I: Identifying sequences with … Speaker : S. Gaj Date
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Finding genes by comparing genomes roderic guigó i serra imim/upf/crg, barcelona.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
CAMPBELL BIOLOGY IN FOCUS © 2014 Pearson Education, Inc. Urry Cain Wasserman Minorsky Jackson Reece 18 Genomes and Their Evolution Questions prepared by.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
EGASP 2005 Evaluation Protocol
What is a Hidden Markov Model?
EGASP 2005 Evaluation Protocol
Basics of BLAST Basic BLAST Search - What is BLAST?
Experimental Verification Department of Genetic Medicine
Gene Hunting: Design and statistics
Genes, Genomes, and Genomics
Visualization of genomic data
Eukaryotic Gene Finding
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Visualization of genomic data
Prediction of selenoprotein genes in eukaryotic genomes roderic guigó i serra, bioinformatica, UPF curs 2005/ /29/2018 Bioinformatica UPF març.
Geneid: training on S. lycopersicum
Next Generation Sequencing and Human Genome Databases
Identify D. melanogaster ortholog
Relationship between Genotype and Phenotype
Volume 116, Issue 4, Pages (February 2004)
closing in on the set of human genes. The ENCODE project.
Summarized by Sun Kim SNU Biointelligence Lab.
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Presentation transcript:

gene prediction roderic guigó i serra IMIM/UPF/CRG

number of genes in chromosome 22 initial annotation545Dunham et al., 1999 genscan+RT-PCR590Das et al., 2001 genscan+microarrays730Shoemaker et al., 2001 reviewed annotation726chr22 team, sanger, 2001 mouse shotgun data+20(our data) geneid predictions794 genscan predictions1128

number of genes in human genome Consortium Celera Consortium+Celera Hogenesch et al DBsearches Wrigth et al., 2001 HumanGenomeSciences Haseltine, 2001

decodificació del genoma ACTCAGCCCCAGCGGAGGTGAAGGACGTCCTTCCCCAGGAGCCGGTGAGAAGCGCAGTCGGGGGCACGGGGATGAGCTCAGGGGCCTCTAGAAAGAT GTAGCTGGGACCTCGGGAAGCCCTGGCCTCCAGGTAGTCTCAGGAGAGCTACTCAGGGTCGGGCTTGGGGAGAGGAGGAGCGGGGGTGAGGCCAGCA GCAGGGGACTGGACCTGGGAAGGGCTGGGCAGCAGAGACGACCCGACCCGCTAGAAGGTGGGGTGGGGAGAGCATGTGGACTAGGAGCTAAGCCACA GCAGGACCCCCACGAGTTGTCACTGTCATTTATCGAGCACCTACTGGGTGTCCCCAGTGTCCTCAGATCTCCATAACTGGGAAGCCAGGGGCAGCGA CACGGTAGCTAGCCGTCGATTGGAGAACTTTAAAATGAGGACTGAATTAGCTCATAAATGGAAAACGGCGCTTAAATGTGAGGTTAGAGCTTAGAAT GTGAAGGGAGAATGAGGAATGCGAGACTGGGACTGAGATGGAACCGGCGGTGGGGAGGGGGAGGGGGTGTGGAATTTGAACCCCGGGAGAGAAAGAT GGAATTTTGGCTATGGAGGCCGACCTGGGGATGGGGAAATAAGAGAAGACCAGGAGGGAGTTAAATAGGGAATGGGTTGGGGGCGGCTTGGTAACTG TTTGTGCTGGGATTAGGCTGTTGCAGATAATGGAGCAAGGCTTGGAAGGCTAACCTGGGGTGGGGCCGGGTTGGGGTCGGGCTGGGGGCGGGAGGAG TCCTCACTGGCGGTTGATTGACAGTTTCTCCTTCCCCAGACTGGCCAATCACAGGCAGGAAGATGAAGGTTCTGTGGGCTGCGTTGCTGGTCACATT CCTGGCAGGTATGGGGCGGGGCTTGCTCGGTTTTCCCCGCTTCTCCCCCTCTCATCCTCACCTCAACCTCCTGGCCCCATTCAAGCACACCCTGGGC CCCCTCTTCTTCTGCTGGTCTGTCCCCTGAGGGGAAAGCCCAGGTCTGAGGCTTCTATGCTGCTTTCTGGCTCAGAACAGCGATTTGACGCTCTGTG AGCCTCGGTTCCTCCCCCGCTTTTTTTTTTTCAGCCAGAGTCTCACTCTGTCGCCCAGGCTGGAGTGCAGTGGCGCAATCTCAGCTCACTGCAAGCT CCGCCTCCCGGGTTCACGCTATTCTCCCGCCTCAGCCTCCCGAGTAGCTGGGACTACAGGCGCCCGCCACCATGCCCGGCTAATTTTTTGTACTTTG AGTAGGGAAGGGGTTTCACTGTATTATCCAGGATGGTCTCTATCTCCTGACCTCGTGATCTGCCCGCCTGGCCTCCCAAAGTGCTGGAATTACAGGC GTGAGCCTCCGCGCCCGGCCTCCCCATCCTTAATATAGGAGTTAGAAGTTTTTGTTTGTTTGTTTTGTTTTGTTTTTGTTTTGTTTTGAGATGAAGT CCCTCTGTCGCCCAGGCTGGAGTGCAGTGGCTCCCAGGCTGGAGTTCAGTGGCTGGATCTCGGCTCACTGCAAGCTCCGCCTCCCAGGTTCACGCCA TTCTCCTGCCTCAGCCTCCGGAGTAGCTGGGACTACAGGAACATGCCACCACACCCGACTAACTTTTTTTGTATTTTTAGTAGAGACGGGGTTTCAC CATGTTGGCCAGGCTGGTCTGGAACTCCTGACCTCAGGTGATCTGCCTGCTTCAACCTCCCAAAGTGCTGGGATTACAGACGTGGGCCACCGCGCCC GGCTGGGAGTTAAGAGGTTTCTAATGCATTGCATTAGAATACCAGACACGGGACAGCTGTGATCTTTATTCTCCATCACCCCACACAGCCCTGCCTG GGGCACACAAGGACACTCAATACACGCTTTTCGGGCGCGGTGGCTCAAGCTGTAATCCCAGCACTTTGGGAGGCTGAGGCGGGTGGTACATGAGGTC AGGAGATCGAGACCATCCTGGCTAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAAACTAGCCCGGGCGTGGTGGCGGGCGCCTGTAGTCCC AGCTACTCGGAGGCTGAGGCAGGAGAATGGCGTGAACCTGGGAGGCGGAGCTTGCAGTGAGCCGAGATCGCGCCACTGCACTCCAGCCTGGGTGACA CAGCGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATACACGCTTTTCCGCTAGGCACGGTGGCTCACCCCTGTAATCCCAGCA TTTTGGGAGGCCAAGGTGGGAGGATCACTTGAGCCCAGGAGTTCAACACCAGACTCAGCAACATAGTGAGACTCTCTCTACTAAAAATACAAAAATT AGCCAGGCCTGGTGCCACACACCTGTGGTCCCAGCTACTCAGAAGGCTAAGGCAGGAGGATCGCTTAAGCCCAGAAGGTCAAGGTTGCAGTGAACCA CGTTCAGGCCACTGCAGTCCAGCCTGGGTGACAGAGCAAGACCCTGTCTGTAAATAAATAACGCTTTTCAAGTGATTAAACAGACTCCCCCCTCACC CTGCCCACCATGGCTCCAAAGCAGCATTTGTGGAGCACCTTCTGTGTGCCCCTAGGTACTAGCTGCCTGGACGGGGTCAGAAGGAACCTGAACCACC TTCAACTTGTTCCACACAGGATGCCAGGCCAAGGTGGAGCAACCGGTGGAGCCAGAGACAGAACCCGACGTTCGCCAGCAGGCTGAGTGGCAGAGCG GCCAGCCCTGGGAGCTGGCACTGGGTCGCTTTTGGGATTACCTGCGCTGGGTGCAGACACTGTCTGAGCAGGTGCAGGAGGAGCTGCTCAGCCCCCA GGTCACCCAGGAACTGACGTGAGTGTCCCCATCCCGGCCCTTGACCCTCCTGGTGGGCGGCTATACCTCCCCAGGTCCAGGTTTCATTCTGCCCCTG CCACTAAGTCTTGGGGGCCTGGGTCTCTGCTGGTTCTAGCTTCCTCTTCCCATTTCTGACTCCTGGCTTTAGCTCTCTGGAATTCTCTCTCTCAGTT CTGTTTCTCCCTCTTCCCTTCTGACTCAGCCTGTCACACTCGTCCTGGCGCTGTCTCTGTCCTTCACTAGCTCTTTTATATAGAGACAGAGAGATGG GGTCTCACTGTGTTGCCCAGGCTGGTCTTGAACTTCTGGGCTCAAGCGATCCTCCCACCTCGCCTCCCAAAGTGCTGGGAATAGAGACATGAGCCAC CTTGCTCGGCCTCCTAGCTCTTTCTTCGTCTCTGCCTCTGCTCTCTGCGTCTGTCTTTGTCTCCTCTCTGCCTCTGTCCCGTTCCTTCTCTCTTGGT TCACTGCCCTTCTGTCTCTCCCTGTTCTCCTTAGGAGACTCTCCTCTCTTCCTTCTCGAGTCTCTCTGGCTGATCCCCATCTCACCCACACCTATCC the human genome sequence

QIKDLLVSSSTDLDTTLVLVNAIYFKGMW KTAFNAEDTREMPFHVTKQESKPVQMMCM NNSFNVATLPAEKMKILELPFASGDLSML VLLPDEVSDLERIEKTINFEKLTEWTNPN TMEKRRVKVYLPQMKIEEKYNLTSVLMAL GMTDLFIPSANLTGISSAESLKISQAVHG AFMELSEDGIEMAGSTGVIEDIKHSPESE QFRADHPFLFLIKHNPTNTIVYFGRYWSP the amino acid sequence of the proteins

EXONS INTRONS ELEMENT REGULADOR ‘UPSTREAM’ ELEMENT REGULADOR ‘DOWNSTREAM’ PROMOTOR Estructura dels Gens

Del DNA al RNA

Del RNA a la Proteïna

Mecanisme Molecular

Prediction of splice sites

accuracy of gene prediction programs

rosseta ( Batzoglou et al., 2000 ) cem (Bafna and Huson, 2000) sgp1 (Wiehe et al., 2000) twinscan (Korf et al., 2001) slam ( Patcher et al., 2001) sgp2 (Guigó et al., in preparation) comparative gene prediciton

Query Sequence tblastx HSPs geneid Exons HSPs Projections SGP Exons syntenic gene prediction (sgp2)

benchmarking sgp2 - accuracy scimog mit

Predicting “novel” genes in the human genome golden path annotations additional blastn matches to ENSEMBL + REFSEQ tblastx geneid exons tblastx sgp genes Golden Path Oct 7, 2000 freeze. RepeatMasked TraceDB, as on February 2001

“novel” genes ? 48,890 genic regions (known genes or similar) 15,489 genes longer than 100 aa predicted by sgp 13,302 non redundant predictions 8,416 supported by tblastx hits to mouse 1.5 3,331 predicted genes with at least two exons suported by tblastx hits predicted genes supported by tblastx hits covering at least 75% of the prediction 4,050 supported sgp predictions 25% of them not overlapping genscan predictions

validation of predictions EST identity18% NR similarity31% CDD (NCBI)24% Mouse ESTs28% Rat ESTs19% Tetraodon15% at least one of the above 56%

Experimental validation

chr22 chr21 human genome vs. Mouse traceDB

SN SP CC SNe SPe SNSP ME WE chr22.assem chr22.shot human genome vs. Mouse assemblies

chr22chr21 776Predicted known low complexity-5 -26short intronless testing novel predictions experimentally In total 81 predictions. For 40 of them, adjacent exon pairs were selected for rt-pcr

Positive controls N Success rate refseq7896% Known tissue specific genes 2025% Low expressing genes13Not ready Twinscan with EST support Not ready Test sets TwinscanNot ready SGP4028% preliminary results

aknowledgments IMIM-UPF-CRG, Barcelona Josep F. Abril Genís Parra Roderic Guigó GlaxoSmithKline, King of Prussia Pankaj Agarwal Max Plank Institute for Chemical Ecology, Jena Thomas Wiehe Whitehead Institute/MIT Center for Genome Research, Cambridge Gwen Acton Dan Brown Kerstin Mouse Sequence Consortium