Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology.

Slides:



Advertisements
Similar presentations
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Advertisements

Weixi Zhong Mentor: Dr. Andrew Cameron Center for Computational Regulatory Genomics California Institute of Technology.
Homology Based Analysis of the Human/Mouse lncRNome
Annotating a Scarlet Runner Bean genome fragment put together by shotgun sequencing Scarlet Runner ean Max Bachour.
HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Bioinformatics for biomedicine Seminar: Sequence analysis of a favourite gene Lecture 5, Per Kraulis
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Protein Homology Discovery Mixed bag of proteins Protein Homologies PHD Genes Database Open reading frame finder Proteins Database BLAST Clustering Protein.
A Computational Analysis of the H Region of Mouse Olfactory Receptor Locus 28 Deanna Mendez SoCalBSI August 2004.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Automatic methods for functional annotation of sequences Petri Törönen.
Mouse Genome Sequencing
The Ensembl Gene set The “Genebuild” 21 April 2008.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
The progress of Glossina genomics at RIKEN GSC Todd Taylor RIKEN Genomic Sciences Center, Yokohama, Japan (on behalf of Masahira Hattori)
Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy August 31, 2009.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Copyright © 2009 Pearson Education, Inc. Art and Photos in PowerPoint ® Concepts of Genetics Ninth Edition Klug, Cummings, Spencer, Palladino Chapter 21.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Strongylocentrotus purpuratus
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Copyright © 2009 Pearson Education, Inc. Genomics, Bioinformatics, and Proteomics Chapter 21 Lecture Concepts of Genetics Tenth Edition.
BIOINFORMATIK I UEBUNG 2 mRNA processing.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
1 GMOD Meeting, Spring 2005 Peili Zhang, FlyBase - Harvard Comparative Genome Annotation of Drosophila pseudoobscura and Its Implementation in chado.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Chapter 1 Introduction.
Development of a Chicken Unigene Database Project No. 9 Mentors: Dr. Wellington Martins - Dr. Joan Burnside Animal Science Dept. University of Delaware.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Large-scale Prediction of Yeast Gene Function Introduction to Bio-Informatics Winter Roi Adadi Naama Kraus
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
SRB Genome Assembly and Analysis From 454 Sequences HC70AL S Brandon Le & Min Chen.
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
What is BLAST? Basic BLAST search What is BLAST?
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Graduate Research with Bioinformatics Research Mentors Nancy Warter-Perez, ECE Robert Vellanoweth Chem and Biochem Fellow Sean Caonguyen 8/20/08.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Bos taurus Olfactory Receptor Katie Davis 1,2 and Sandra Rodriguez-Zas 1 1 Department of Animal Sciences, University of Illinois Urbana-Champaign, 2 ACES.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
What is BLAST? Basic BLAST search What is BLAST?
CS515: Bioinformatic Algorithms
Bacteriophage Gene Functions
BIOL 433 Plant Genetics Term 2,
Basics of BLAST Basic BLAST Search - What is BLAST?
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Genome organization and Bioinformatics
Identify D. melanogaster ortholog
BIOL 433 Plant Genetics Term 2,
Basic Local Alignment Search Tool
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology

Objective ► Curate the non annotated, predicted genes of the sea urchin genome. ► Learn to annotate genes and register as many as possible to spbase.org

Importance ► The purple sea urchin: the only non- chordate deuterostome with a sequenced genome. ► It could help us understand the evolution of biological processes such as odor perception and immunity. ► Developments made in the project could benefit future genome projects.

Strongylocentrotus purpuratus ► Phylum: Echinodermata ► Radially symmetrical shell, 3 – 10 cm. ► Spines can reach 3 cm long. ► Moves slowly, feeding mostly on algae. ► Reproduces by external fertilization.

Phylogeny

Data Flow Estimated Set of 23,300 genes

Genome Sequencing ► WGS = Whole Genome Shotgun Sequencing  Genome assembly named Spur_v0.5 ► CAPSS = Cloned-Array Pooled Shotgun Sequencing Strategy  Genome assembly named Spur_v2.1

Data Flow Estimated Set of 23,300 genes

Sequencing ► WGS: ► Extract DNA ► Digest ► Sequence the Fragments ► Assemble the genome. ► CAPSS: ► Combines WGS with BAC. ► Uses BACs as framework for genome assembly.

CAPSS

Data Flow Estimated Set of 23,300 genes

GLEAN GLEAN Statistical Algorithm EnsemblGenscanGnomon

Discrepancy ► Spur_v0.5 – ► 28,944 predicted ► ~10,044 annotated ► 18,944 non annotated ► ~ 5,700 gene difference possibly due to:  4 – 5% species polymorphism (E. Davidson, et al.)  Assembly error  Prediction error ► Spur_v2.1 ► 23,300 estimated ► Gene number reduced when duplicates overlap

Methods ► Python Filtering ► Python Searching ► BioPython module:  BLAST hit FASTA sequences ► Grep-like functions:  GLEAN models by protein type  FASTA sequences in GLEAN protein databse Infile: Gene list If conditions meet: Print to outfile Check against: Data file

Example List GLEAN3_00003ref|NP_ | hypothetical protein [Mesorhizobium loti] >gi| GLEAN3_00004ref|NP_ | CG33087-PC [Drosophila melanogaster] >gi| GLEAN3_00005ref|NP_ | abnormal NUClease NUC-1, deoxyribonuclease DLAD e-11 GLEAN3_00008ref|XP_ | similar to RIKEN cDNA B130016O10 gene [Homo sap e-62 GLEAN3_00010gb|AAH | FLJ11712 protein [Homo sapiens] 86 6e-16 GLEAN3_00011gb|AAH | FLJ11712 protein [Homo sapiens] 143 3e-32 GLEAN3_00014ref|NP_ | ubiquitin-conjugating enzyme E2A, RAD6 homolog; e-59 GLEAN3_00018failed GLEAN3_00019failed GLEAN3_00020failed GLEAN3_00021ref|NP_ | chaperone protein - related [Arabidopsis thalia e-23 GLEAN3_00023failed GLEAN3_00024sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin e-29 GLEAN3_00027gb|AAD | reverse transcriptase-like protein [Takifugu rubr e-41 GLEAN3_00028gb|AAH | MGC64389 protein [Xenopus laevis] 164 3e-39 GLEAN3_00029failed GLEAN3_00030ref|XP_ | similar to Olfactory receptor 10T2 [Homo sapien e-06 GLEAN3_00032dbj|BAA | Nfrl [Xenopus laevis] 339 7e-92 GLEAN3_00033ref|XP_ | RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45 GLEAN3_00034dbj|BAC | unnamed protein product [Homo sapiens] 207 5e-52 GLEAN3_00037dbj|BAC | zVeph-A [Danio rerio] 112 4e-23 GLEAN3_00038ref|NP_ | solute carrier family 16, member 3; monocarboxy GLEAN3_00039failed

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Different name, same genome coordinates Genes removed: 139

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Evidence for gene expression Genes removed: 1,603

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: No hits Genes removed: 3,145

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by Sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Exactly the same BLAST hit Genes removed: 4,545

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Condition: Successful Reciprocal BLAST match Genes removed: 3,952

Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Good Reciprocal Blast

Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Bad Reciprocal Blast

Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,470) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Conditions: Names such as “hypothetical”, “predicted”, “unnamed” Genes removed: 3,041

Annotation Process Search sequences of proteins of similar type or domain (use GLEAN DB and PFAM) Build phylogeny tree with Clustal X. Annotate gene following Spbase guidelines. If necessary: Do some research on the protein type or its domains. (Using PFAM)

Contributions to Annotation ► AnnotationAssist.py  Automates searching for families in the Glean database  Autofetches sequences for Clustal X  Stores everything on a unique directory based on Glean model name and family

References ► Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell 15, 1175 (1978) ► CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, Genome Res. 11, 1619 (2001).

Acknowledgments ► Dr. Andrew Cameron ► David Felt ► Lauren Lee and Nowelle Ibarra ► SoCalBSI Staff and Coordinator ► SoCalBSI Participants ► Funding:  NIH  NSF  DOE  Beckman Institute