PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
Microarray Data Analysis Day 2
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
COG and GO tutorial.
Genome analysis and annotation Part II. THE INSTITUTE FOR GENOMIC RESEARCH TIGRTIGR Evidence View S.mansoni PASA assemblies S. japonicum EST alignments.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Pharm 202 Computer Aided Drug Design Phil Bourne -> Courses -> Pharm 202 Several slides are taken from UC Berkley.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
The Encyclopedia of Life (EOL) Project An initiative to analyze and provide annotation for putative protein sequences from all publicly available genome.
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
(The Encyclopedia of Life (EOL)) medicine researcheducation The Annotation and Cataloging of Proteins, Life's Building Blocks for… The Open Notebook.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
DATA MANAGEMENT AND CURATION AT TAIR
Motif discovery and Protein Databases Tutorial 5.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Anotation Process What follows is a simulation of the process of annotating, using the proposed graphical interface. The interface does not yet exist.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
InterPro Sandra Orchard.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
1 1 High Throughput Proteomics and the Encyclopedia of Life Mark A. Miller, Ph.D. Integrative BioScience Program San Diego Supercomputer Center.
Functional and structural genomics using PEDANT
Gene Annotation & Gene Ontology
Annotating with GO: an overview
Protein Families, Motifs & Domains.
iGAP: Integrative Grid-enabled Genome Annotation Pipeline
Introduction to the Gene Ontology
Functional manual annotation including GO
Overview of the Encyclopedia of Life (EOL) Project
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Gene expression analysis
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Presentation transcript:

PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang, Ph.D.

PAT project Sequence Structure Function Bioinformatics

PAT project PAT: Structure-aided function annotation PAT is a collaborating project between Ceres and San Diego Supercomputer Center: Importance of structure-aided function annotation –Structure contains more function information than sequence, like active site, binding motif etc. –Structure is more conserved than sequence during evolution, therefore protein sequences can have similar structures even without clearly detected sequence similarity. It means that we have bigger chance to find the function relationship from structure similarity than from sequence similarity using advanced structure prediction programs like PSI-BLAST and threading algorithm. –Structure prediction programs can also be used to predict all sorts of structure features of proteins, like trans-membrane tendency, electrostatics potential distribution, or coil-coil fold tendency. Those structure features are also valuable to biologists to guess the possible functions of novel genes.

PAT project Fold recognition Frequently implies biochemical function

PAT project Highlights in PAT annotations Domain-based prediction –Structure domain PDB, SCOP –Sequence domain Pfam Predictions are strictly benchmarked

PAT project Reliability categories CategoryReliable levelBenchmark ACertain>99.9% BReliable>99% CProbable>90% DPossible>50% EPotential>10%

PAT project Methods Programs Protein sequences were analyzed using a spectrum of programs, including structure prediction, function prediction and feature annotation methods. Database All the results were organized and stored in an Oracle relational database for the ease of data access and process. Interface Web-based interface convenient for both computational and non- computational biologist users.

PAT project Programs used in PAT pipeline Protein structure and function –Homology modeling BLAST, PSI-BLAST search against protein structure database –Threading 123D+ search against a protein fold library Protein class and features COILS, TMHMM, SignalP, PSI-pred, PSORT

PAT project Protein sequences Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Create PSI-BLAST profiles for Protein sequences Store assigned regions in the DB Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB NR, PFAM Domain location prediction by sequence structure info sequence info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30)

PAT project GUI:Top Level

PAT project Example: P450 family Sequence relatives detected by ordinary Blast search –313 hits, when E-score cutoff is –324 hits, when E-score cutoff is 0.01 Sequence relatives detected by PAT –367 hits with confidence greater or equal to 99%

PAT project Figure 2. SCOP results, super-family level. It displayed the number of true positive predictions versus the number of false positive predictions for the SCOP test set. Here, if two proteins share the first three SCOP sccs ids, e.g., d and d , they are considered having the same structure in super-family level. The results in this figure displayed that PSI-BLAST are superior than both NCBI- BLAST and WU-BLAST in picking up the true positives.

PAT project Acknowledgement Dr. Nickolai Alexandrov Dr. Philip E. Bourne Dr. Wilfred W. Li Dr. Greg B. Quinn Dr. Ilya E. Shindyalov

PAT project Gene Ontology (GO) project Gene Ontology Consortium ( Controlled vocabularies for the description of gene functions. Three dimensions –Molecular Function the tasks performed by individual gene products; examples are transcription factor and DNA helicase –Biological Process broad biological goals, such as purine metabolism or mitosis, that are accomplished by ordered assemblies of molecular functions –Cellular Component subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

PAT project Three dimensions of GO Molecular Function Biological process Cellular Component Gene product

PAT project.GO: : Gene_Ontology.GO: : molecular_function.GO: : binding.GO: : nucleic acid binding.GO: : DNA binding.GO: : transcription factor.GO: : transcription regulator.GO: : transcription factor Hierarchical structure of GO term tree

PAT project The evidence codes used in GO IC inferred by curator IDA inferred from direct assay IEA inferred from electronic annotation IEP inferred from expression pattern IGI inferred from genetic interaction IMP inferred from mutant phenotype IPI inferred from physical interaction ISS inferred from sequence or structural similarity NAS non-traceable author statement ND no biological data available TAS traceable author statement NR not recorded

PAT project Process to annotate Ceres peptide Download GO annotations from TAIR website ( Annotating methods If the sequence of the Ceres peptide is the same as a GO database sequence based on locus name, copy all the annotations of the GO database sequence to the Ceres peptide. Else For each Ceres peptide, pick up its best hit that does have the TAIR annotation, and then copy its annotation to this Ceres peptide.

PAT project Example: P450 family Sequence relatives detected by simple Blast search –313 hits, when E-score cutoff is –324 hits, when E-score cutoff is 0.01 Sequence relatives detected by PAT –367 hits with confidence greater or equal to 99% Sequence relatives annotated by GO –365 hits –Number of Hits based on evidence 295 with ISS ( inferred from sequence or structural similarity) 67 with IEA ( inferred from electronic annotation) 2 with TAS (traceable author statement) 1 with IDA ( inferred from direct assay)

PAT project Acknowledgement Dr. Nickolai Alexandrov Mr. Eric Zetterbaum Dr. Richard Flavell etc.