Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology
Objective ► Curate the non annotated, predicted genes of the sea urchin genome. ► Learn to annotate genes and register as many as possible to spbase.org
Importance ► The purple sea urchin: the only non- chordate deuterostome with a sequenced genome. ► It could help us understand the evolution of biological processes such as odor perception and immunity. ► Developments made in the project could benefit future genome projects.
Strongylocentrotus purpuratus ► Phylum: Echinodermata ► Radially symmetrical shell, 3 – 10 cm. ► Spines can reach 3 cm long. ► Moves slowly, feeding mostly on algae. ► Reproduces by external fertilization.
Phylogeny
Data Flow Estimated Set of 23,300 genes
Genome Sequencing ► WGS = Whole Genome Shotgun Sequencing Genome assembly named Spur_v0.5 ► CAPSS = Cloned-Array Pooled Shotgun Sequencing Strategy Genome assembly named Spur_v2.1
Data Flow Estimated Set of 23,300 genes
Sequencing ► WGS: ► Extract DNA ► Digest ► Sequence the Fragments ► Assemble the genome. ► CAPSS: ► Combines WGS with BAC. ► Uses BACs as framework for genome assembly.
CAPSS
Data Flow Estimated Set of 23,300 genes
GLEAN GLEAN Statistical Algorithm EnsemblGenscanGnomon
Discrepancy ► Spur_v0.5 – ► 28,944 predicted ► ~10,044 annotated ► 18,944 non annotated ► ~ 5,700 gene difference possibly due to: 4 – 5% species polymorphism (E. Davidson, et al.) Assembly error Prediction error ► Spur_v2.1 ► 23,300 estimated ► Gene number reduced when duplicates overlap
Methods ► Python Filtering ► Python Searching ► BioPython module: BLAST hit FASTA sequences ► Grep-like functions: GLEAN models by protein type FASTA sequences in GLEAN protein databse Infile: Gene list If conditions meet: Print to outfile Check against: Data file
Example List GLEAN3_00003ref|NP_ | hypothetical protein [Mesorhizobium loti] >gi| GLEAN3_00004ref|NP_ | CG33087-PC [Drosophila melanogaster] >gi| GLEAN3_00005ref|NP_ | abnormal NUClease NUC-1, deoxyribonuclease DLAD e-11 GLEAN3_00008ref|XP_ | similar to RIKEN cDNA B130016O10 gene [Homo sap e-62 GLEAN3_00010gb|AAH | FLJ11712 protein [Homo sapiens] 86 6e-16 GLEAN3_00011gb|AAH | FLJ11712 protein [Homo sapiens] 143 3e-32 GLEAN3_00014ref|NP_ | ubiquitin-conjugating enzyme E2A, RAD6 homolog; e-59 GLEAN3_00018failed GLEAN3_00019failed GLEAN3_00020failed GLEAN3_00021ref|NP_ | chaperone protein - related [Arabidopsis thalia e-23 GLEAN3_00023failed GLEAN3_00024sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin e-29 GLEAN3_00027gb|AAD | reverse transcriptase-like protein [Takifugu rubr e-41 GLEAN3_00028gb|AAH | MGC64389 protein [Xenopus laevis] 164 3e-39 GLEAN3_00029failed GLEAN3_00030ref|XP_ | similar to Olfactory receptor 10T2 [Homo sapien e-06 GLEAN3_00032dbj|BAA | Nfrl [Xenopus laevis] 339 7e-92 GLEAN3_00033ref|XP_ | RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45 GLEAN3_00034dbj|BAC | unnamed protein product [Homo sapiens] 207 5e-52 GLEAN3_00037dbj|BAC | zVeph-A [Danio rerio] 112 4e-23 GLEAN3_00038ref|NP_ | solute carrier family 16, member 3; monocarboxy GLEAN3_00039failed
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Different name, same genome coordinates Genes removed: 139
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Evidence for gene expression Genes removed: 1,603
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: No hits Genes removed: 3,145
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by Sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Exactly the same BLAST hit Genes removed: 4,545
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Condition: Successful Reciprocal BLAST match Genes removed: 3,952
Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Good Reciprocal Blast
Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Bad Reciprocal Blast
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,470) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Conditions: Names such as “hypothetical”, “predicted”, “unnamed” Genes removed: 3,041
Annotation Process Search sequences of proteins of similar type or domain (use GLEAN DB and PFAM) Build phylogeny tree with Clustal X. Annotate gene following Spbase guidelines. If necessary: Do some research on the protein type or its domains. (Using PFAM)
Contributions to Annotation ► AnnotationAssist.py Automates searching for families in the Glean database Autofetches sequences for Clustal X Stores everything on a unique directory based on Glean model name and family
References ► Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell 15, 1175 (1978) ► CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, Genome Res. 11, 1619 (2001).
Acknowledgments ► Dr. Andrew Cameron ► David Felt ► Lauren Lee and Nowelle Ibarra ► SoCalBSI Staff and Coordinator ► SoCalBSI Participants ► Funding: NIH NSF DOE Beckman Institute