Download presentation
Presentation is loading. Please wait.
Published byMonica Davidson Modified over 8 years ago
1
A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng
2
Protein domains are distinct units of protein three-dimensional structure, which also carry function. Proteins can be composed of single or multiple domains. A few thousand conserved domain models are sufficient to cover more than two thirds of known protein sequences. Marchler-Bauer A, et al. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Research 31:383-387 (2003).
3
The growth of the number of proteins known vs. the growth in the number of unique domains Geer,L.Y., Domrachev,M., Lipman,D.J. and Bryant,S.H. (2002) CDART: Protein Homology by Domain Architecture. Genome Res., 12, 1619–1623.
4
Conserved Domain Database (CDD): http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml a curated Entrez database of conserved domain alignments at NCBI currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI, such as COG.
5
Data generation using GeneKeyDB -- create a master table of associatioship between -- locuslink id and cdd_key CREATE TABLE peng_cddlist as (SELECT a.ll_id, b.ll_refseq_nm_id, c.cdd_key, c.cdd_evalue, a.organism FROM ll_locus a, ll_refseq_nm b, ll_np_cdd c WHERE a.ll_id = b.ll_id and b.ll_refseq_nm_id = c.ll_refseq_nm_id ); commit;
6
Summary of Data LocusCD Mouse61021999 Human87322786
7
Looking at groups of domains We look at a list of cdd domains and return the proteins that are found exclusively in the intersection of those domains. If a second (third, etc.) list of domains is added, we look at the proteins found exclusively in the intersection of this list, and we combine this with previous lists and do the same.
8
A B A + B Looking at groups of domains
9
Options This can be done using either human or mouse data. We can turn the exclusivity off, so that we return all proteins in the intersection of the list of cdd keys.
10
Sample Input and Output Input the first list of domains. The domains should be separated by spaces and should all be on one line. 1 438 (1 438): Input another list of domains separated by spaces (or hit q to quit): 1825 (1825): (1 438 1825): 28992 83666 Input another list of domains separated by spaces (or hit q to quit):
11
Why useful? A thought 2003
12
?: log[P(k)] ~ - k k: the number of CDs per protein
13
Redundancy in CDD?
14
Following works: 1.Remove CDD redundancy 2.Distribution of the minimal set of proteins across different biological processes/subcellular location (GO terms) 3.Application in other types of graph with same or different dataset, such genes + TBS
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.