UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.

Slides:



Advertisements
Similar presentations
Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
DNA exist in 2 places in the cell The nucleus & the Mitochondria.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Chapter 3 Ying Xu. Total numbers of occurrences of X in coding and noncoding regions. Relative frequency (RF)of X in coding regions = number of.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
CSE182-L12 Gene Finding.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
From Gene to Protein. Genes code for... Proteins RNAs.
Genome Annotation BCB 660 October 20, From Carson Holt.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
The Ensembl Gene set The “Genebuild” 21 April 2008.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene prediction in flies ● Background ● Gene prediction pipeline ● Resources.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
You should be able to label these pictures Label the following: –RNA polymerase –DNA –mRNA –tRNA –5’ end –3’ end –Amino acid –Ribosome –Polypeptide chain.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Chapter 21 Eukaryotic Genome Sequences
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Sackler Medical School
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
From Genomes to Genes Rui Alves.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Annotation of eukaryotic genomes
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Annotating The data.
GEP Annotation Workflow
Visualization of genomic data
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Volume 116, Issue 4, Pages (February 2004)
The Toy Exon Finder.
Presentation transcript:

UCSC Known Genes Version 3 Take 10

Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Remove weak transcripts. Assign accessions. Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster into splicing graph Add EST, Exoniphy, OrthoSplice info. Walk unique transcripts out of graph. Assign coding regions (CDS) to transcripts. Classify into coding, antisense, noncoding. Remove weak transcripts. Assign accessions.

Removing Antibody Var Regions Chromosomes 2,14,22 contain antibody regions. Thousands of transcripts for these in Genbank. Gaps are from genomic rearrangements, not splicing. Millions of possibilities. Identify regions by: –Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. –Treat anything that overlaps these as Ab fragment too. –Cluster together putative Ab fragments. –Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) Remove all alignments in Ab clusters. Replace with a single noncoding gene for each cluster near end of gene build. Chromosomes 2,14,22 contain antibody regions. Thousands of transcripts for these in Genbank. Gaps are from genomic rearrangements, not splicing. Millions of possibilities. Identify regions by: –Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. –Treat anything that overlaps these as Ab fragment too. –Cluster together putative Ab fragments. –Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) Remove all alignments in Ab clusters. Replace with a single noncoding gene for each cluster near end of gene build.

Chr22 Ab Region (lambda light chain)

Cleaning and projecting

Cluster into splicing graph Make graph where vertices are begin/ends of exons, edges are exons and introns. Multiple input transcripts can share vertices and edges. Make graph where vertices are begin/ends of exons, edges are exons and introns. Multiple input transcripts can share vertices and edges.

Make graph Snap soft ends to hard

Extend soft ends to hard

Consensus of soft ends

Walk graph to get nonredundant transcripts

Splicing graph and txWalk

Adding Evidence to Graph Initial evidence for each edge comes from mRNAs. If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs Make graph in mouse and map via chains. Reinforce orthologous human edges. Reinforce exon edges that overlap Exoniphy predictions. Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1. Initial evidence for each edge comes from mRNAs. If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs Make graph in mouse and map via chains. Reinforce orthologous human edges. Reinforce exon edges that overlap Exoniphy predictions. Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.

Walking graph Weight of 3 on an edge is good enough. Single exon gene edges take 4 though. Rank input RNA by whether refSeq, and number of good edges they use. If any good edges, output a transcript consisting of the edges used by the first RNA. Output transcript based on next RNA if the good edges it uses have not been output in same order before. Continue until reach last RNA. Weight of 3 on an edge is good enough. Single exon gene edges take 4 though. Rank input RNA by whether refSeq, and number of good edges they use. If any good edges, output a transcript consisting of the edges used by the first RNA. Output transcript based on next RNA if the good edges it uses have not been output in same order before. Continue until reach last RNA.

Evidence, Walk, AltSplice

Assigning Coding Regions Score ORF as so: –1 point for each base in orf –50 points for initial ATG –100 points if ATG follows Kozak rules G after ATG or A/G 3 bases before –-400 points if nonsense mediated decay Last intron more than 55 bases past stop codon –-0.5 points for each base in upstream ORF –-0.5 points each base in upstream Kozak ORF –+1 point each base also ORF in other species Rhesus, mouse, dog Scheme agrees with RefSeq reviewed ~96% of the time. Score ORF as so: –1 point for each base in orf –50 points for initial ATG –100 points if ATG follows Kozak rules G after ATG or A/G 3 bases before –-400 points if nonsense mediated decay Last intron more than 55 bases past stop codon –-0.5 points for each base in upstream ORF –-0.5 points each base in upstream Kozak ORF –+1 point each base also ORF in other species Rhesus, mouse, dog Scheme agrees with RefSeq reviewed ~96% of the time.

Comparing ORF Finders methodsamecloseinout Big orf62.9%30.4%4.0%2.7% Kozak87.2%7.4%2.3%2.2% twinOrf * 85.6%7.5%2.3%1.8% bestOrf80.9%14.4%2.9%1.9% txCdsPredict92.8%4.7%1.1%1.3% + ortho93.3%4.4%1.1%1.3% Comparison vs. RefSeq reviewed ORF annotations. * twinOrf only predicts if has homologous sequence. This run with dog, only adds up to 97.2% for this reason.

CDS Mapping, Filtering

Classifying and Weeding The transcripts are classified into: –Coding: CDS survives trimming stage –Near-coding: overlap coding by at least 20 bases on same strand –Antisense: overlap coding by at least 20 bases on opposite strand –Noncoding: other transcripts Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed. The transcripts are classified into: –Coding: CDS survives trimming stage –Near-coding: overlap coding by at least 20 bases on same strand –Antisense: overlap coding by at least 20 bases on opposite strand –Noncoding: other transcripts Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.

Take 10 Statistics classgenestranscripts coding nearCodingN/A4469 antisense uncoding RefSeq Statistics classgenestranscripts coding nearCodingN/A14 antisense19 uncoding590592