Download presentation
Presentation is loading. Please wait.
1
Slow and Steady: The Sea Urchin Genome Project David A. Schwarz Mentor: Dr. Andrew Cameron Site: California Institute of Technology
2
Objective ► Curate the non annotated, predicted genes of the sea urchin genome. ► Learn to annotate genes and register as many as possible to spbase.org
3
Importance ► The purple sea urchin: the only non- chordate deuterostome with a sequenced genome. ► It could help us understand the evolution of biological processes such as odor perception and immunity. ► Developments made in the project could benefit future genome projects.
4
Strongylocentrotus purpuratus ► Phylum: Echinodermata ► Radially symmetrical shell, 3 – 10 cm. ► Spines can reach 3 cm long. ► Moves slowly, feeding mostly on algae. ► Reproduces by external fertilization.
5
Phylogeny
6
Data Flow Estimated Set of 23,300 genes
7
Genome Sequencing ► WGS = Whole Genome Shotgun Sequencing Genome assembly named Spur_v0.5 ► CAPSS = Cloned-Array Pooled Shotgun Sequencing Strategy Genome assembly named Spur_v2.1
8
Data Flow Estimated Set of 23,300 genes
9
Sequencing ► WGS: ► Extract DNA ► Digest ► Sequence the Fragments ► Assemble the genome. ► CAPSS: ► Combines WGS with BAC. ► Uses BACs as framework for genome assembly.
10
CAPSS
11
Data Flow Estimated Set of 23,300 genes
12
GLEAN GLEAN Statistical Algorithm EnsemblGenscanGnomon
13
Discrepancy ► Spur_v0.5 – ► 28,944 predicted ► ~10,044 annotated ► 18,944 non annotated ► ~ 5,700 gene difference possibly due to: 4 – 5% species polymorphism (E. Davidson, et al.) Assembly error Prediction error ► Spur_v2.1 ► 23,300 estimated ► Gene number reduced when duplicates overlap
14
Methods ► Python Filtering ► Python Searching ► BioPython module: BLAST hit FASTA sequences ► Grep-like functions: GLEAN models by protein type FASTA sequences in GLEAN protein databse Infile: Gene list If conditions meet: Print to outfile Check against: Data file
15
Example List GLEAN3_00003ref|NP_104627.1| hypothetical protein [Mesorhizobium loti] >gi|1... 38 0.48 GLEAN3_00004ref|NP_788284.1| CG33087-PC [Drosophila melanogaster] >gi|232403... 40 0.19 GLEAN3_00005ref|NP_509604.1| abnormal NUClease NUC-1, deoxyribonuclease DLAD... 69 4e-11 GLEAN3_00008ref|XP_293875.3| similar to RIKEN cDNA B130016O10 gene [Homo sap... 240 5e-62 GLEAN3_00010gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 86 6e-16 GLEAN3_00011gb|AAH36744.1| FLJ11712 protein [Homo sapiens] 143 3e-32 GLEAN3_00014ref|NP_062642.1| ubiquitin-conjugating enzyme E2A, RAD6 homolog;... 229 2e-59 GLEAN3_00018failed GLEAN3_00019failed GLEAN3_00020failed GLEAN3_00021ref|NP_196259.2| chaperone protein - related [Arabidopsis thalia... 110 4e-23 GLEAN3_00023failed GLEAN3_00024sp|O42587|PRSA_XENLA 26S protease regulatory subunit 6A (TAT-bin... 130 1e-29 GLEAN3_00027gb|AAD19348.1| reverse transcriptase-like protein [Takifugu rubr... 172 2e-41 GLEAN3_00028gb|AAH53792.1| MGC64389 protein [Xenopus laevis] 164 3e-39 GLEAN3_00029failed GLEAN3_00030ref|XP_060945.2| similar to Olfactory receptor 10T2 [Homo sapien... 54 5e-06 GLEAN3_00032dbj|BAA22375.1| Nfrl [Xenopus laevis] 339 7e-92 GLEAN3_00033ref|XP_354640.1| RIKEN cDNA D430035D22 gene [Mus musculus] 186 1e-45 GLEAN3_00034dbj|BAC04242.1| unnamed protein product [Homo sapiens] 207 5e-52 GLEAN3_00037dbj|BAC02921.1| zVeph-A [Danio rerio] 112 4e-23 GLEAN3_00038ref|NP_004198.1| solute carrier family 16, member 3; monocarboxy... 44 0.008 GLEAN3_00039failed
16
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Different name, same genome coordinates Genes removed: 139
18
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Evidence for gene expression Genes removed: 1,603
19
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: No hits Genes removed: 3,145
20
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by Sequence (9,469) Filtering by Reciprocal Blast (5,319) Filtering by Protein Quality (2,478) Condition: Exactly the same BLAST hit Genes removed: 4,545
21
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,469) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Condition: Successful Reciprocal BLAST match Genes removed: 3,952
22
Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Good Reciprocal Blast
23
Reciprocal Blast Sea urchin protein database (GLEAN) NCBI Nr database A B X Y GLEAN_ANCBI Protein B(score)(e-value) Bad Reciprocal Blast
24
Data Curation Non-annotated Genes (18,900) Filtering by coordinates (18,761) Filtering by mRNA expression (17,159) Filtering by BLAST failures (14,014) Filtering by sequence (9,470) Filtering by Reciprocal Blast (5,519) Filtering by Protein Quality (2,478) Conditions: Names such as “hypothetical”, “predicted”, “unnamed” Genes removed: 3,041
25
Annotation Process Search sequences of proteins of similar type or domain (use GLEAN DB and PFAM) Build phylogeny tree with Clustal X. Annotate gene following Spbase guidelines. If necessary: Do some research on the protein type or its domains. (Using PFAM)
27
Contributions to Annotation ► AnnotationAssist.py Automates searching for families in the Glean database Autofetches sequences for Clustal X Stores everything on a unique directory based on Glean model name and family
28
References ► Polymorphism: R.J. Britten, A. Cetta, E.H. Davidson, Cell 15, 1175 (1978) ► CAPSS: W. W. Cai, R. Chen, R. A. Gibbs, A. Bradley, Genome Res. 11, 1619 (2001).
29
Acknowledgments ► Dr. Andrew Cameron ► David Felt ► Lauren Lee and Nowelle Ibarra ► SoCalBSI Staff and Coordinator ► SoCalBSI Participants ► Funding: NIH NSF DOE Beckman Institute
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.