A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng.

Slides:



Advertisements
Similar presentations
GoMiner: (Zeeberg et al., Genome Biology, March 2003) For Tour of GoMiner: Advance using forward arrow.
Advertisements

Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Pfam(Protein families )
CDD – a conserved domain database Aron Marchler-Bauer NCBI, National Library of Medicine, NIH DIMACS Workshop on Protein Domains: Identification, Classification.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
The Protein Data Bank (PDB)
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Protein and Function Databases
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
Identification of protein homology using domain architecture Byungwook LEE Sep. 9, 2009 Korean Bioinformation Center (KOBIC) Eighth International Conference.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Functional Annotation of Proteins via the CAFA Challenge Lee Tien Duncan Renfrow-Symon Shilpa Nadimpalli Mengfei Cao COMP150PBT | Fall 2010.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
K Phone: Web: A Software Package for the Design and Analysis of Microbial Functional.
Protein and RNA Families
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Copyright OpenHelix. No use or reproduction without express written consent1.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Protein Domain Database
Bioinformatics and Computational Biology
Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Investigations of HIV-1 Env Evolution Evolutionary Bioinformatics Education: A BioQUEST Curriculum Consortium Approach Grand Valley State University August.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Figure 1. Number of CCDS IDs and genes represented in the human (A) and mouse (B) CCDS releases. The X-axis indicates the year in which a CCDS dataset.
Genome Annotation Continued
Gene Expression Omnibus (GEO)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Investigations of HIV-1 Env Evolution
Protein domains Jasmin sutkovic
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Problems from last section
Objective- To graph a relationship in a table.
BioGRID: Biological General Repository for Interaction Datasets
TF candidate selection pipeline.
The genomic distribution of essential and non-essential mouse genes, separated into known and predicted essentiality. The genomic distribution of essential.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng

Protein domains are distinct units of protein three-dimensional structure, which also carry function. Proteins can be composed of single or multiple domains. A few thousand conserved domain models are sufficient to cover more than two thirds of known protein sequences. Marchler-Bauer A, et al. CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Research 31: (2003).

The growth of the number of proteins known vs. the growth in the number of unique domains Geer,L.Y., Domrachev,M., Lipman,D.J. and Bryant,S.H. (2002) CDART: Protein Homology by Domain Architecture. Genome Res., 12, 1619–1623.

Conserved Domain Database (CDD): a curated Entrez database of conserved domain alignments at NCBI currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI, such as COG.

Data generation using GeneKeyDB -- create a master table of associatioship between -- locuslink id and cdd_key CREATE TABLE peng_cddlist as (SELECT a.ll_id, b.ll_refseq_nm_id, c.cdd_key, c.cdd_evalue, a.organism FROM ll_locus a, ll_refseq_nm b, ll_np_cdd c WHERE a.ll_id = b.ll_id and b.ll_refseq_nm_id = c.ll_refseq_nm_id ); commit;

Summary of Data LocusCD Mouse Human

Looking at groups of domains We look at a list of cdd domains and return the proteins that are found exclusively in the intersection of those domains. If a second (third, etc.) list of domains is added, we look at the proteins found exclusively in the intersection of this list, and we combine this with previous lists and do the same.

A B A + B Looking at groups of domains

Options This can be done using either human or mouse data. We can turn the exclusivity off, so that we return all proteins in the intersection of the list of cdd keys.

Sample Input and Output Input the first list of domains. The domains should be separated by spaces and should all be on one line (1 438): Input another list of domains separated by spaces (or hit q to quit): 1825 (1825): ( ): Input another list of domains separated by spaces (or hit q to quit):

Why useful? A thought 2003

?: log[P(k)] ~ -  k k: the number of CDs per protein

Redundancy in CDD?

Following works: 1.Remove CDD redundancy 2.Distribution of the minimal set of proteins across different biological processes/subcellular location (GO terms) 3.Application in other types of graph with same or different dataset, such genes + TBS