Functional and structural genomics using PEDANT

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Homology Based Analysis of the Human/Mouse lncRNome
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
GenomePixelizer - a visualization tool for comparative genomics within and between species. A. Kozik, E. Kochetkova, and R. Michelmore (Department of Vegetable.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Protein Modules An Introduction to Bioinformatics.
Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.
陳虹瑋 國立陽明大學 生物資訊學程 Genome Engineering Lab. Genome Engineering Lab The Newest.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
INTRODUCTION ● Expressed sequence tags offer a low cost approach to gene discovery ● For a range of non-model organisms, ESTs represent the only sequence.
EMBL-EBI MSD Search and Visualization tools Jawahar Swaminathan.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
BUSINESS SENSITIVE 1 SAAW - Sequence Annotation and Analysis Workshop Boyu Yang and Gene Godbold Battelle Memorial Institute, Charlottesville Operations.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
What is BLAST? Basic BLAST search What is BLAST?
Metagenomic Species Diversity.
Chapter 14 Protein Structure Classification
Biological Databases By: Komal Arora.
Data-intensive Computing: Case Study Area 1: Bioinformatics
Basics of BLAST Basic BLAST Search - What is BLAST?
Overview of the Encyclopedia of Life (EOL) Project
Demo: Protein Information Resource
Basics of Comparative Genomics
Lettuce/Sunflower EST CGPDB project.
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
1 Department of Engineering, 2 Department of Mathematics,
Sequence Based Analysis Tutorial
BLAST.
Basic Local Alignment Search Tool
Basics of Comparative Genomics
Supporting High-Performance Data Processing on Flat-Files
Basic Local Alignment Search Tool
Overview of Enzyme, Protein and Network Databases
Presentation transcript:

Functional and structural genomics using PEDANT 陽明生技所 生物資訊學程 林千涵

Introduction With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools From case-oriented sequence analysis work to automated large-scale genome annotation

Introduction-PEDANT Difference of existing genome analysis programs protein oriented vs. DNA oriented analysis interactive work vs. commandline operation bioinformatics method applied user interface conveniency feature, project management and data editors fidelity of result produced Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search) a workhorse for general bioinformatics research a common framework for a number of genome analysis projects a complete database of automated genomes a tool for routine analysis of large amounts of genomic contigs and ESTs

System Architecture Overview database module: storing, modifying and accessing data processing module: bioinformatics computations user interface: web based communication

System Architecture-Cont. Data access primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output ) secondary table: parsed program results simplified schema Operation in command line mode applying bioinformatics methods to sequences parsing data tables querying the resulting databases Web interface No static HTML pages required DNA and Protein viewers make direct access to the SQL tables Implementation and system requirements Perl 5, and C++ for graphical viewer Performance parallel capabilities

Schema

Bioinformatics Method Overview of the PEDANT processing pipeline identification of coding regions and various analysis genetics elements homology search detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition automatically attributed to pre-defined functional categories Prediction of genes and other genetic elements Table 1 choose one of 15 genetic codes http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c Functional and structural categories similarity search : PSI-BLAST(Position-Specific Iterated BLAST) special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS significant matches of PIR: annotations, keywords, enzyme classification and superfamily information with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case) low complexity region, membrance regions, coiled coils and signal peptides comparison of SCOP with IMPALA functional structural

Table 1

Bioinformatics Method-Cont. Yeast biological role categories first system of biological role of categories : E.Coli MIPS: advanced hierarchical functional catalogue (Yeast) Multidimensionality-protein:gene is M:M automated assignment to MIPS is first approximation, will be refined by manual annotation Distribution of ORFs Visualization a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation Protein report page

Distribution of ORFs

Protein report page

Bioinformatics Method-Cont.2 Automatic versus manual annotation Problem of error propagation erroneous annotation by human error and spurious similarity hits with filtering algorithms and domain structure ? quality improvement of manual review of human experts ! Manual annotation Catalogue independent Flexibility: first place in higher category and later step move to the finer categories 528 categories: 20 main categories and 6 levels confidence levels: “reject”, “low”, “medium”, “high” and default is “auto” Data release management new release data can be intelligently merged with existing data pool transfer manual annotation between subsequent data release “manual” field: “yes” or ”no” and default is “no” initially example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”

Manual annotation transfer Two genes fuse to one contig Two contigs fuse to one Gene boundary change Appears new gene

The PEDANT Genome Database Annotation of publicly available completely sequenced and unfinished genomes Genome annotated by MIPS Completely sequenced and published genomic sequences Unfinished and/or unpublished genomics sequences gene prediction by ORPHEUS, allow large overlaps between ORFs PEDANT as a structural genomics resource-0.3M proteins class-based approach, cost-saving (i)non-redundant protein sequence databases (ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles (iii)construct a SCOP profile library using IMPALA (iv)IMPALA search with each genomic sequence against SCOP library same procedure for nr PDB sequence database performance of IMPALA Cross-genome comparison treat each genome as an individual contig : creat cross-genome datasets without any modification 44 genomes

Performance of IMPALA

Applications Arabidopsis thaliana chromosome IV 3744 predicted protein coding genes roughly 30% are known proteins or strongly similar to known proteins multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species Assembled human transcripts human UniGene subjected PEDANT analysis, compare over 75000 contigs this MySQL DB is close to 8GB acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects Analysis of the GroEL substrates GroEL: a common E.Coli chaperonin structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation

Classification of predicted genes Classification by the degree of homology to functionally characterized proteins based on BLAST scores

Summary and Outlook PEDANT is a useful tool for genome annotation and bioinformatics research It can automated and manual assignment of gene product to functional and structural categories extensive hyperlinked protein report and advanced viewers Outlook better decision rules need to be employed manually annotate predicted genetics eelments(ex. LTRs) supporting Oracle RDBMS automatic gene prediction pipeline for higher eukaryotes interactive capabilities