UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.

Slides:



Advertisements
Similar presentations
1 Q1-Q3 results. 2 RF lengths 3 Filtered RF length distribution.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Genome Browsing with the UCSC Genome Browser
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
From Gene to Protein. Genes code for... Proteins RNAs.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Translation and Transcription
Genome Annotation BCB 660 October 20, From Carson Holt.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Eukaryotic cells modify RNA after transcription
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.
is accessible at: The following pages are a schematic representation of how to navigate through ALE-HSA21.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Databases at UCSC It just *looks* like 200,000 columns.
Sackler Medical School
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
What do we already know ? The rice disease resistance gene Pi-ta Genetically mapped to chromosome 12 Rybka et al. (1997). It has also been sequenced Bryan.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Cells use information in genes to build several thousands of different proteins, each with a unique function. But not all proteins are required by the.
CFE Higher Biology DNA and the Genome Transcription.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Unit 1: DNA and the Genome Structure and function of RNA.
Considerations for multi-omics data integration Michael Tress CNIO,
bacteria and eukaryotes
RNA.
EPConDB: Endocrine Pancreas Consortium Database
Visualization of genomic data
Visualization of genomic data
From: TopHat: discovering splice junctions with RNA-Seq
Volume 116, Issue 4, Pages (February 2004)
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

UCSC Known Genes Version 3 Take 9

Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at Affy. Switched to RefSeq –Jim got paranoid Riken RNAs would take over Fan built KG 1 –Mark got annoyed at low quality predictions Fan & Mark built KG 2 –Jim got annoyed at missing genes KG 3 –The perfect set … until KG 4. Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at Affy. Switched to RefSeq –Jim got paranoid Riken RNAs would take over Fan built KG 1 –Mark got annoyed at low quality predictions Fan & Mark built KG 2 –Jim got annoyed at missing genes KG 3 –The perfect set … until KG 4.

Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Remove weak transcripts. Assign accessions. Build gene-centric database tables. Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Remove weak transcripts. Assign accessions. Build gene-centric database tables.

Genbank & Alignment Issues Using global instead of local near-best alignment, also higher stringency. Including all Genbank RNA, not just mRNA These changes not yet reflected in Genbank mRNA/RefSeq tracks. Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the.ra files but not the SQL database. Using global instead of local near-best alignment, also higher stringency. Including all Genbank RNA, not just mRNA These changes not yet reflected in Genbank mRNA/RefSeq tracks. Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the.ra files but not the SQL database.

Removing Antibody Var Regions Chromosomes 2,14,22 contain antibody regions. Thousands of transcripts for these in Genbank. Gaps are from genomic rearrangements, not splicing. Millions of possibilities. Identify regions by: –Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. –Treat anything that overlaps these as Ab fragment too. –Cluster together putative Ab fragments. –Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) Remove all alignments in Ab clusters. Replace with a single noncoding gene for each cluster near end of gene build. Chromosomes 2,14,22 contain antibody regions. Thousands of transcripts for these in Genbank. Gaps are from genomic rearrangements, not splicing. Millions of possibilities. Identify regions by: –Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. –Treat anything that overlaps these as Ab fragment too. –Cluster together putative Ab fragments. –Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) Remove all alignments in Ab clusters. Replace with a single noncoding gene for each cluster near end of gene build.

Chr22 Ab Region (lambda light chain)

Cleaning, projecting alignments BLAT sometimes leaves messy gappy ends. New heuristic: –For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary. –Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap. –Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends. –After break up throw away any pieces less than 18 bases long. For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error) Consider applying similar heuristic in mRNA track. BLAT sometimes leaves messy gappy ends. New heuristic: –For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary. –Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap. –Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends. –After break up throw away any pieces less than 18 bases long. For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error) Consider applying similar heuristic in mRNA track.

Cleaning and projecting

Cluster into splicing graph Make graph where vertices are begin/ends of exons, edges are exons and introns. Multiple input transcripts can share vertices and edges. Went over this in some detail a few weeks back… Make graph where vertices are begin/ends of exons, edges are exons and introns. Multiple input transcripts can share vertices and edges. Went over this in some detail a few weeks back…

Splicing graph and txWalk

Adding Evidence to Graph Initial evidence for each edge comes from mRNAs. If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs Make graph in mouse and map via chains. Reinforce orthologous human edges. Reinforce exon edges that overlap Exoniphy predictions. Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1. Initial evidence for each edge comes from mRNAs. If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs Make graph in mouse and map via chains. Reinforce orthologous human edges. Reinforce exon edges that overlap Exoniphy predictions. Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.

Walking graph Weight of 3 on an edge is good enough. Rank input RNA by whether refSeq, and number of good edges they use. If any good edges, output a transcript consisting of the edges used by the first RNA. Output transcript based on next RNA if the good edges it uses have not been output in same order before. Continue until reach last RNA. Weight of 3 on an edge is good enough. Rank input RNA by whether refSeq, and number of good edges they use. If any good edges, output a transcript consisting of the edges used by the first RNA. Output transcript based on next RNA if the good edges it uses have not been output in same order before. Continue until reach last RNA.

Evidence, Walk, AltSplice

Assigning Coding Regions Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS. Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS. Use bestorf program for another possible CDS. Assign an ad-hoc score to each possible CDS, choose highest scoring. More comparative genomics could really help here someday… Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS. Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS. Use bestorf program for another possible CDS. Assign an ad-hoc score to each possible CDS, choose highest scoring. More comparative genomics could really help here someday…

CDS Mapping, Filtering

Classifying and Weeding The transcripts are classified into: –Coding: CDS survives trimming stage –Near-coding: overlap coding by at least 20 bases on same strand –Antisense: overlap coding by at least 20 bases on opposite strand –Noncoding: other transcripts Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed. The transcripts are classified into: –Coding: CDS survives trimming stage –Near-coding: overlap coding by at least 20 bases on same strand –Antisense: overlap coding by at least 20 bases on opposite strand –Noncoding: other transcripts Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.

Assigning accessions Initial temporary identifiers of form..., eg chr AB Make permanent identifiers of form TX –Find exact match in previous gene set, and reuse previous accession. –Find compatible match (all introns alike) in old gene set, reuse accession, bump version. –Make up new accession otherwise. –Record genes in old set not in new. Version 7 -> version 9 mapping actually a good test of this: exact, 4732 lost, 3736 new, 464 compatible. Move to UC format in v. 10? Initial temporary identifiers of form..., eg chr AB Make permanent identifiers of form TX –Find exact match in previous gene set, and reuse previous accession. –Find compatible match (all introns alike) in old gene set, reuse accession, bump version. –Make up new accession otherwise. –Record genes in old set not in new. Version 7 -> version 9 mapping actually a good test of this: exact, 4732 lost, 3736 new, 464 compatible. Move to UC format in v. 10?

Building gene-centric tables mmBlastTab, rnBlastTab etc. homolog tables. Blastp best plus syntenic weeding. kgXref and knownToXxx tables to relate gene to other databases and tables. kgAlias table to help search on gene names. gnfAtlas2Distance to measure expression similarity between genes for Gene Sorter. 3 other expression distance tables humanVidalP2P and humanWankerP2P protein network distance tables. knownCanonical/knownIsoform tables to help people selectively view alt-splicing. pbXXX tables for proteome browser. In all about 10 hours of compute and indexing. mmBlastTab, rnBlastTab etc. homolog tables. Blastp best plus syntenic weeding. kgXref and knownToXxx tables to relate gene to other databases and tables. kgAlias table to help search on gene names. gnfAtlas2Distance to measure expression similarity between genes for Gene Sorter. 3 other expression distance tables humanVidalP2P and humanWankerP2P protein network distance tables. knownCanonical/knownIsoform tables to help people selectively view alt-splicing. pbXXX tables for proteome browser. In all about 10 hours of compute and indexing.

The Plan Next week –test preliminary integration on hg18a –resolve issues with proteome browser –Tinker on take 10, maybe take 11 Week after –Integration of final gene build into hg18a –Move hg18.knownGenes to hg18.knownGenesOld –Swap hg18a tables into hg18. Coming months –Continue to improve gene build. –Add new information from build into details pages. –Allow user filtering of which genes are shown –Allowing selection by names as well as ID’s in table browser. –Present at Cold Spring Harbor. Write up paper. Next week –test preliminary integration on hg18a –resolve issues with proteome browser –Tinker on take 10, maybe take 11 Week after –Integration of final gene build into hg18a –Move hg18.knownGenes to hg18.knownGenesOld –Swap hg18a tables into hg18. Coming months –Continue to improve gene build. –Add new information from build into details pages. –Allow user filtering of which genes are shown –Allowing selection by names as well as ID’s in table browser. –Present at Cold Spring Harbor. Write up paper.