Xt ESTs 32,000 unique transcript set –16,000 clusters –16,000 singletons Clusters –9,000 (55%) have a blastx hit –4,000 might be full-length –2,000 ~98%

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Promoters Information about where to start transcription.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Eukaryotic Gene Finding
Sequence alignment, E-value & Extreme value distribution
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Eukaryotic Gene Finding
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
Cleaning Genomes: So easy - even a program head can do it Igor Bogorad.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
 GEP Implementation at Mt. San Jacinto Community College Nick Reeves, Ph.D.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
DNA and Translation Gene: section of DNA that creates a specific protein Approx 25,000 human genes Proteins are used to build cells and tissue Protein.
Essential Basic Part Types Coding Sequences (C) - Complete open reading frames (type I), or sequences encoding polypeptides but lacking either a stop codon.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
Arabidopsis Thaliana A Study of Genes and Embryo Development By Garen Polatoglu.
Finding genes in the genome
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Work Presentation Novel RNA genes in A. thaliana Gaurav Moghe Oct, 2008-Nov, 2008.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
What is BLAST? Basic BLAST search What is BLAST?
Bacterial infection by lytic virus
Annotation for D. virilis
bacteria and eukaryotes
Annotating The data.
Bacterial infection by lytic virus
Distribution of Introns among Full Length cDNA
Basics of BLAST Basic BLAST Search - What is BLAST?
P-POD-PANTHER: update
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Genome Center of Wisconsin, UW-Madison
Recitation 7 2/4/09 PSSMs+Gene finding
Genome Editing with Apollo
Gene Annotation with DNA Subway
Transcriptome analysis
Introduction to Bioinformatics II
A: OAZ1 mRNA transcript of 775-1, and parental cell lines showing the stop codon introduced by the nonsense mutations in the and transcripts,
What do you with a whole genome sequence?
Practice Clone 3 Download and get ready!.
ORF identification in Allgenes Project
Introduction to Alternative Splicing and my research report
Protein Synthesis.
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Nucleotide and predicted amino acid sequence of the adult mouse brain cdr2 cDNA. Nucleotide and predicted amino acid sequence of the adult mouse brain.
Presentation transcript:

Xt ESTs 32,000 unique transcript set –16,000 clusters –16,000 singletons Clusters –9,000 (55%) have a blastx hit –4,000 might be full-length –2,000 ~98% probability of being FL Singletons –5,500 (35%) have a blastx hit –1,500 might be full-length –200 – 500 ‘probably’ FL

What are we looking for? FL perfect –good enough to spend £500 on a morphelino FL probable –likely enough for a gain of function expt Gene transcript –Good enough to put on an array For FL, distinguish between –knowing it’s full-length and –being sure of which ATG is the start

Looking for full-length transcripts Perfect full-length -Open reading frame -defined by clear prior stop codon -Clear ATG 3’ of STOP codon -Reasonable run of stop free sequence before another stop signal or end of ESTs -Consensus sequence agrees with ESTs -Blastx data -Blastx hits indicating coding sequence -Start of matching proteins exactly aligned with predicted start methionine -No other protein alignments consensus sequence CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCG PROTEIN Hs 1e-187 Gene name =================================================================================== PROTEIN Mm 1e-190 Gene name =================================================================================== PROTEIN Dr 1e-201 Gene name =================================================================================== PROTEIN Xl 1e-202 Gene name =================================================================================== GCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC TCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC AGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAG AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGCGCTAT CTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCG TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTATACG

Blast aligned with ATG Less perfect, but possible sufficient, indications of full- length 1. Blast hits line up with ATG -Perfect PROTEIN Hs 1e-187 Gene name =================================================================================== PROTEIN Mm 1e-190 Gene name =================================================================================== PROTEIN Dr 1e-201 Gene name =================================================================================== PROTEIN Xl 1e-202 Gene name =================================================================================== AGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC GAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC GCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAG AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGCGCTAT CTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCG TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTATACG -Weak hits, maybe several agree PROTEIN Ce 8.2e-9 Gene name =================================================================================== AGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC GAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTATACG -Strong hits but not clear agreement, predicted proteins confuse PROTEIN Hs 1e-187 Gene name =================================================================================== PROTEIN Mm 1e-190 Gene name =================================================================================== PREDICTED Dr 1e-201 Gene name =================================================================================== PROTEIN Xl 1e-202 Gene name ============================================================================= AGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC GAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC GCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAG AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGCGCTAT CTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCG TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTA

Proteins alignments start within ORF 2. Proteins aligned within well-defined ORF PROTEIN Hs 1e-10 Gene name =================================================================================== PROTEIN Dr 1e-19 Gene name =================================================================================== FRAGMENT Dm 1e-19 Gene name =================================================================================== PREDICTED Mm 1e-50 Gene name =================================================================================== PROTEIN Dr 1e-87 Gene name ============================================================ GCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC TCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC AGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAG AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGCGCTAT CTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCG TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTATACG

Proteins alignments overlap ORF 3. Proteins aligned some part overlaps well-defined ORF Weak hits, indication of domain homology quite likely to be FL PROTEIN Hs 1e-4 Gene name ========================================================================================================================== PROTEIN Dr 1e-5 Gene name ========================================================================================================================= FRAGMENT Dm 1e-6 Gene name ===================================================================================================================== PREDICTED Mm 1e-8 Gene name ========================================================================================================================== PROTEIN Dr 1e-8 Gene name ================================================================================================================================ CTATATATATATATCGATCGCTTAGGCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGC TCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATA AGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATAGGC AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATAGGCT Strong hits, probabably real homolog, ORF may be artefact of sequencing error, or in UTR PROTEIN Hs 1e-81 Gene name ======================================================================================================== PROTEIN Dr 1e-98 Gene name =================================================================================================== PROTEIN Xl 1e-107 Gene name ================================================================================================= CTATATATATATATCGATCGCTTAGGCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGC TCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATA AGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATAGGC AGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTATGGCGTCTCTAGGATCTGCTTCGCTATTATAGGCT

Protein alignment has upstream STOP 4. There are protein alignments and a well-defined STOP codon upstream PROTEIN Hs 1e-187 Gene name ============================================================= PROTEIN Mm 1e-190 Gene name ================================================================ PROTEIN Dr 1e-201 Gene name ================================================================ PROTEIN Xl 1e-202 Gene name ================================================================ GCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC -Mostly applicable to small clusters where codons are not well agreed

Long open reading frame…. 5. There is a long open reading frame, but maybe no blastx hits  more than 500 (?)  GCTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCT CTTCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGC TCTTCTAGAGTCAGAGCGTCATGAGCTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGC CTTCTTCTATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCG TATTAGGATCGCTCGATTGCTAGGCTTAGCTGATGCGGGCTTCTTCTCGAGAGAAACTCGGATTAGCGGCTTCGCGTCTCTAGGATCTGCTTCGCTATTATAGGCTTCGGATTAGGCGCTATTATCGGCGCTATACG -May just be in UTR -plenty of long ORFs observed in obvious UTR -May not even be RNA… -what about blastn data? -ESTscan would also be useful