Jul-01-0806/16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain.
Gene predictions for eukaryotes attgccagtacgtagctagctacacgtatgctattacggatctgtagcttagcgtatct gtatgctgttagctgtacgtacgtatttttctagagcttcgtagtctatggctagtcgt.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Eukaryotic Gene Finding
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Introduction to ab initio and evidence-based gene finding Wilson Leung08/2015.
Gene prediction roderic guigó i serra IMIM/UPF/CRG.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Web Databases for Drosophila
Annotation for D. virilis
bacteria and eukaryotes
Annotating The data.
Visualization of genomic data
Eukaryotic Gene Finding
Visualization of genomic data
Ab initio gene prediction
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Geneid: training on S. lycopersicum
Ensembl Genome Repository.
Presentation transcript:

Jul /16/08Bioinformatics Workshop - Malaga Genome Bioinformatics Tyler Alioto Center for Genomic Regulation Barcelona, Spain

Jul-01-08Bioinformatics Workshop - Malaga Node 1 of the INB GN1 Bioinformática y Genómica Genome Bioinformatic Lab, CRG Roderic Guigó (PI)

Jul-01-08Bioinformatics Workshop - Malaga Themes Gene prediction ab initio => GeneID dual-genome => SGP2 u12 introns => GeneID v1.3 and U12DB combiner => GenePC Genome feature visualization gff2ps Alternative splicing ASTALAVISTA Gene expression regulatory elements meta and mmeta alignment

Jul-01-08Bioinformatics Workshop - Malaga Eukaryotic gene structure

Jul-01-08Bioinformatics Workshop - Malaga Eukaryotic gene structure EXONS INTRONS UPSTREAM REGULATOR DOWNSTREAM REGULATOR PROMOTOR acceptordonor

Jul-01-08Bioinformatics Workshop - Malaga The Splicing Code

Jul-01-08Bioinformatics Workshop - Malaga Gene Prediction Strategies Expressed Sequence (cDNA) or protein sequence available? Yes  Spliced alignment BLAT, Exonerate, est_genome, spidey, GMAP, Genewise No  Integrated gene prediction Informant genome(s) available? Yes  Dual or n-genome de novo predictors: SGP2, Twinscan, NSCAN, (Genomescan – same or cross genome protein blastx)‏ No  ab initio predictors geneid, genscan, augustus, fgenesh, genemark, etc. Many newer gene predictors can run in multiple modes depending on the evidence available.

Jul-01-08Bioinformatics Workshop - Malaga Gene Prediction Strategies

Jul-01-08Bioinformatics Workshop - Malaga Frameworks for gene prediction Hierarchical exon-buliding and chaining Hidden Markov Models (many flavors) HMM, GHMM, GPHMM, Phylo-HMM Conditional Random Fields (new!) Conrad, Contrast... and, no doubt, more to come All of them involve parsing the optimal path of exons using dynamic programming (e.g. GenAmic, Viterbi algorithms)

Jul /16/08Bioinformatics Workshop - Malaga How does GeneID approach gene prediction?

Jul-01-08Bioinformatics Workshop - Malaga The gene prediction problem a1a1 a2a2 a3a3 a4a4 d1d1 d2d2 d3d3 d4d4 d5d5 e1e1 e2e2 e3e3 e4e4 e5e5 e6e6 e7e7 e8e8 sites exons genes e1e1 e4e4 e8e8

Jul-01-08Bioinformatics Workshop - Malaga GeneID Geneid follows a hierarchical structure: signal  exon  gene signalexongene Exon score: Score of exon-defining signals + protein-coding potential (log-likelihood ratios) Dynamic programming algorithm: maximize score of assembled exons  assembled gene

Jul-01-08Bioinformatics Workshop - Malaga T G C A GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT CAGGTATAC TGTGTGAGT AAGGTAAGT ATGGCAGGGACCGTGACGGAAGCCTGGGATGTGGCAGTATTTGCTGCCCGACGGCGCAAT GATGAAGACGACACCACAAGGGATAGCTTGTTCACTTATACCAACAGCAACAATACCCGG GGCCCCTTTGAAGGTCCAAACTATCACATTGCGCCACGCTGGGTCTACAATATCACTTCT GTCTGGATGATTTTTGTGGTCATCGCTTCAATCTTCACCAATGGTTTGGTATTGGTGGCC ACTGCCAAATTCAAGAAGCTACGGCATCCTCTGAACTGGATTCTGGTAAACTTGGCGATA GCTGATCTGGGTGAGACGGTTATTGCCAGTACCATCAGTGTCATCAACCAGATCTCTGGC Training GeneID

Jul-01-08Bioinformatics Workshop - Malaga Running GeneID command line or on geneid server NAME geneid - a program to annotate genomic sequences SYNOPSIS geneid[-bdaefitnxszr] [-DA] [-Z] [-p gene_prefix] [-G] [-3] [-X] [-M] [-m] [-WCF] [-o] [-j lower_bound_coord] [-k upper_bound_coord] [-O ] [-R ] [-S ] [-P ] [-E exonweight] [-V evidence_exonweight] [-Bv] [-h] RELEASE geneid v 1.3 OPTIONS -b: Output Start codons -d: Output Donor splice sites -a: Output Acceptor splice sites -e: Output Stop codons -f: Output Initial exons -i: Output Internal exons -t: Output Terminal exons -n: Output introns -s: Output Single genes -x: Output all predicted exons -z: Output Open Reading Frames -D: Output genomic sequence of exons in predicted genes -A: Output amino acid sequence derived from predicted CDS -p: Prefix this value to the names of predicted genes, peptides and CDS -G: Use GFF format to print predictions -3: Use GFF3 format to print predictions -X: Use extended-format to print gene predictions -M: Use XML format to print gene predictions -m: Show DTD for XML-format output -j Begin prediction at this coordinate -k End prediction at this coordinate -W: Only Forward sense prediction (Watson)‏ -C: Only Reverse sense prediction (Crick)‏ -U: Allow U12 introns (Requires appropriate U12 parameters to be set in the parameter file)‏ -r: Use recursive splicing -F: Force the prediction of one gene structure -o: Only running exon prediction (disable gene prediction)‏ -O : Only running gene prediction (not exon prediction)‏ -Z: Activate Open Reading Frames searching -R : Provide annotations to improve predictions -S : Using information from protein sequence alignments to improve predictions -E: Add this value to the exon weight parameter (see parameter file)‏ -V: Add this value to the score of evidence exons -P : Use other than default parameter file (human)‏ -B: Display memory required to execute geneid given a sequence -v: Verbose. Display info messages -h: Show this help AUTHORS geneid_v1.3 has been developed by Enrique Blanco, Tyler Alioto and Roderic Guigo. Parameter files have been created by Genis Parra and Tyler Alioto. Any bug or suggestion can be reported to

Jul-01-08Bioinformatics Workshop - Malaga GeneID output ## gff-version 2 ## date Mon Nov 26 14:37: ## source-version: geneid v # Sequence HS Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = # Gene 1 (Forward). 9 exons. 391 aa. Score = HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Terminal HS307871_1

Jul-01-08Bioinformatics Workshop - Malaga GFF: a standard annotation format Stands for: Gene Finding Format -or- General Feature Format Designed as a single line record for describing features on DNA sequence -- originally used for gene prediction output 9 tab-delimited fields common to all versions seq source feature begin end score strand frame group The group field differs between versions, but in every case no tabs are allowed GFF2: group is a unique description, usually the gene name. NCOA1 GFF2.5 / GTF (Gene Transfer Format): tag-value pairs introduced, start_codon and stop_codon are required features for CDS transcript_id “NM_056789” ; gene_id “NCOA1” GFF3: Capitalized tags follow Sequence Ontology (SO) relationships, FASTA seqs can be embedded ID=NM_056789_exon1; Parent=NM_056789; note=“5’ UTR exon”

Jul-01-08Bioinformatics Workshop - Malaga GeneID output ## gff-version 2 ## date Mon Nov 26 14:37: ## source-version: geneid v # Sequence HS Length = 4514 bps # Optimal Gene Structure. 1 genes. Score = # Gene 1 (Forward). 9 exons. 391 aa. Score = HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Internal HS307871_1 HS307871geneid_v1.2 Terminal HS307871_1

Jul-01-08Bioinformatics Workshop - Malaga Visualizing features with gff2ps generated by Josep Abril

Jul-01-08Bioinformatics Workshop - Malaga Visualizing features on UCSC genome browser (custom tracks) If “your” genome is served by UCSC, this is a good option because: browsing is dynamic access to other annotations can view DNA sequence can do complex intersections and filtering gff2ps is good when: your genome is not on UCSC you want more flexible layout options you want to run it ‘offline’

Jul-01-08Bioinformatics Workshop - Malaga Extensions to GeneID Syntenic Gene Prediction (dual-genome) Evidence-based (constrained) gene prediction U12 intron detection Combining gene predictions Selenoprotein gene prediction

Jul-01-08Bioinformatics Workshop - Malaga Syntenic Gene Prediction: SGP2

Jul-01-08Bioinformatics Workshop - Malaga Minor splicing and U12 introns U12 introns make up a minor proportion of all introns (~0.33% in human, less in insects) But they can be found in 2-3% of genes Normally ignored, but this causes annotation problems Easy to predict due to highly conserved donor and branch sites

Jul-01-08Bioinformatics Workshop - Malaga Splice Signal Profiles: major and minor

Jul-01-08Bioinformatics Workshop - Malaga Gathering U12 Introns U12 DB genome Human merge published all annotated introns score predict ENSEMBL? ortholog search (17 species)‏ + spliced alignment genome Fruit Fly all annotated introns score predict merge aln to EST/ mRNA aln to EST/ mRNA

Jul-01-08Bioinformatics Workshop - Malaga

Jul-01-08Bioinformatics Workshop - Malaga Coming Soon: GenePC a Gene Prediction Combiner

Jul-01-08Bioinformatics Workshop - Malaga Tutorial Homepage GBL Homepage