Quality of services of a primary nucleotide sequence database

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
BIOINFORMATICS Ency Lee.
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
An Introduction to Bioinformatics Molecular Biology Databases.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
On line (DNA and amino acid) Sequence Information
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Organizing information in the post-genomic era The rise of bioinformatics.
Sequence Search and Analysis SPE 1653 (703)
Function preserves sequences
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Bacterial infection by lytic virus
Bacterial infection by lytic virus
Data Mining with BioMart
Archives and Information Retrieval
생물정보학 Bioinformatics.
Genome Center of Wisconsin, UW-Madison
BLAST.
Introduction to Bioinformatics
Basic Local Alignment Search Tool
Explore Evolution: Instrument for Analysis
Lesson 3 Bioinformatics Laboratory
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Quality of services of a primary nucleotide sequence database Hideaki Sugawara Center for Information Biology and DDBJ, National Institute of Genetics SOKEN-DAI.

Primary Biological Databases International Nucleotide Sequence Database Collaboration (INSDC) serves communities JPO: Japan Patent Office DDBJ: DNA Data Bank of Japan NIG: National Institute of Genetics EMBL: European Molecular Biology Laboratory EBI: Europen Bioinfomatics Institute NCBI: National Center for Biotechnology Information EPO: European Patent Office USTPO: US Trademark and Patent Offrice JPO DDBJ NIG GenBank NCBI EMBL EBI EPO USTPO 2006/10/23 Primary Biological Databases

Infrastructure for the services Informatics Contents Mangement 2006/10/23 Primary Biological Databases

Today I will introduce: Web services (programmable interface) Derived databases Q&A databases A bug tracking system 2006/10/23 Primary Biological Databases

Mash-up of biological data resources It has bee feasible to mash-up diverse databases and tools by hand to some extent. ① Connect to a journal database to find accession numbers of INSDC A. Web site of Journal database Users with Web browser ②. Move to INSDC to search amino acid sequences by the accession numbers found in the step ① B. Web site of Nucleotide database C. Web site of Protein Data Bank (PDB) ③. Move to the Protein Data Bank (PDB) to get the 3D structure. The three steps in the above are accomplished only by click, copy and paste by mouse. However, It is very difficult when dealing with a lot of data. Nucleic Acid ResearchのDB特別号に掲載されるデータベース数は、500をこえ、 自分の目的に沿った様々な生物情報資源を効率的に取得するためには、 Webブラウザを通じて、様々なデータベースや解析ツールにアクセスする必要がある。 例えば、この例で示しておりますのは、論文に記載されている配列データについて、タンパク立体構造が 判明しているのかどうかを調べたいときのアクセス例を示しております。 一件だけならば、このようなアクセスでも苦はないのですが、複数ですと、 取得は非常に困難になってしまます。より・・・ありと考えております。 Large-scale databases are produced in OMICS. Therefore, a machanical mash-up by program will be required more and more. 2006/10/23 Primary Biological Databases

Web services for the biological problem solving environment We implemented a SOAP (Simple Object Access Protocol) server and Web services that provide a program-friendly interface. We propose standardization of bioinformatics services to improve the interoperability of desperately diverse biological data resources. 2006/10/23 Primary Biological Databases

The world of Web services 「Web services」 provider EBI, NCBI, ----- DDBJ UDDI Registration of Web services SOAP (HTTP/HTTPS) Search Web services WSDL XML Users UDDI (Universal Description, Discovery and Integration) WSDL (Web Services Description Language). 英語で説明を書く!!! 具体的なサービスの例として、Web servicesの提供者によって、UDDI()と呼ばれる利用可能なサービスの登録 が行われ、利用者はそのUDDIから自分が利用したいサービスの検索を行います。 次に、自分の利用したいサービスを見つけた利用者は、サービスが登録されている提供者から、 そのサイトで提供さえているサービスの一覧とアクセス方法が記載されているWSDLを取得し、 実際に、SOAPを介したサービスを使用することが出来ます。 WSDLとSOAPを介したサービスの提供はXMLファイルにより交換が行われます。 2006/10/23 Primary Biological Databases

XML Central of DDBJ: http://xml.nig.ac.jp/index.html As DDBJ Web services server, We open the SOAP interfaces with the several services which offer by DDBJ. 2006/10/23 Primary Biological Databases

Workflow composed of methods provided by Web services 141 methods in 15 services are available. Blast ClustalW DDBJ Ensembl Fasta Getentry GIB GTOP NCBI Genome Annotation PML RefSeq SPS SRS TxSearch VecScreen Open OMIM Gene Ontology Rfam developed Custom made workflows 2006/10/23 Primary Biological Databases

Easy to bind Web services methods from your program Download ActivePerl for WINDOWS and you can call methods in DDBJ Web services. Specify a WDSL file and call a method: In the following example, the method of getXML_DDBJEntry(accession) of the Getentry Web services is called from a Perl program: 1. Include the package needed to use SOAP service. 2. Specifies WSDL file of SOAP service you want to use. 3. Call the service you want to use. 2006/10/23 Primary Biological Databases

Workflow with Online Mendelian Inheritance in Man (OMIM) Web services tool Data Local program 閾値に関しては、記載する。 This workflow reveal homology relationship between human disease genes and genes of other eukaryotes. 2006/10/23 Primary Biological Databases

Environmental DNA sequences automatic annotation workflow In order to make full use of gene information included in nucleotide sequence database, we developed workflow of gene finding of DNA fragments obtained from pooled genome samples of uncultured microbes in environmental samples. BLASTP Environmental Sample EMBOSS (getorf) ORF Selected ORF Attached Product InterPro Scan FLOW Execute getorf in EMBOSS which finds and outputs the sequences of open reading frames (ORFs). The ORFs can be defined as regions of a specified minimum size (longer than 300bp) between START and STOP codons. (The minimum size cut-off for getorf was ) The protein product was attacheded by searching the bacterial division of DDBJ release 62 against predicted ORFs using BLASTP. (The e-value cut-off for BLASTP was 1e-40.) 遺伝子情報 We observed DNA fragments obtained from pooled genome samples of uncultured microbes in environmental and clinical samples. Evidence-based gene finding. Evidence in the form of protein alignments was used to determine the most likely coding frame. The approximate start and stop positions were likewise determined from the bounding coordinates of the alignments and refined to identify specific start and stop codons. This methodology was applied to several different lines of evidence in a series of steps as follows: First evidence was generated by searching the bacterial portion of the nraa dataset against the Sargasso data using tblastn. All blast searches were performed using NCBI blastall version 2.2.26. Unless stated otherwise all blast searches were performed with these generic parameters: -Y 3000000000000 -F "m L" -U T -v 5. The NCBI non-redundant amino acid dataset (nraa) was downloaded from NCBI on September 2nd and contains 1,510,260 peptides of which 626,877 were classified as bacterial. The tblastn searches using the bacterial subset of nraa were performed with these additional parameters: -e 1e-4 -b 10000 -K 10000. The blastx searches against all of nraa were done with these additional parameters: -e 1e-3 -b 5000 -K 10000. The tblastx searches used these parameters: -e 1e-3 -b 5000 -K 5000. The search of the Sargasso data against itself using blastn used these parameters: -W 20 -K 10000 -b 10000 -q -9 -e 1e-40. 2006/10/23 Primary Biological Databases

Access frequency of DDBJ Web services  Examples of methods frequently used: getentry, blast, RefSeq、GO 、 2006/10/23 Primary Biological Databases

Web services in the world EBI, NCBI, ---- DDBJ SOAP(HTTP/HTTPS) 利用者 (Web services map prepared by EBI) 2006/10/23 Primary Biological Databases

Primary Biological Databases Derived databases 2006/10/23 Primary Biological Databases

>60M entries >100G bps ~ 0.5 projects INSDC is not only large-scale but also complex. It is not easy to retrieve what you want to anapyze.

Primary Biological Databases A simple derived database: subset of complete microorganisms genomes from INSDC 2006/10/23 Primary Biological Databases

Gene prediction programs used 2006/10/23 Primary Biological Databases

Parameters: the cutoff length used number >20 >(=)30 >33.3aa (100bp) >40aa >50aa >60aa >66.6aa (200bp) >80 >100aa >150aa >200aa >300aa >400aa 1 25 3 7 4 2 6 2006/10/23 Primary Biological Databases

COG ID in the qualifier of /note Annotation: description of products ~ Hahella chejuensis KCTC 2396 CDS 1023521..1024429 /gene="argB" /locus_tag="HCH_01027" /EC_number="2.7.2.8" /note="COG0548" /codon_start=1 /transl_table=11 /product="Acetylglutamate kinase" /protein_id="ABC27909.1" /db_xref="GI:83631942" /translation="MLDRDNALQVAAVLSKALPYIQRFAGKTIVIKYGGNAMTDEELK NSFARDVVMMKLVGINPIVVHGGGPQIGDLLQRLNIKSSFINGLRVTDSETMDVVEMV LGGSVNKDIVALINRNGGKAIGLTGKDANFITARKLEVTRATPDMQKPEIIDIGHVGE VTGVRKDIITMLTDSDCIPVIAPIGVGQDGASYNINADLVAGKVAEVLQAEKLMLLTN IAGLMNKEGKVLTGLSTKQVDELIADGTIHGGMLPKIECALSAVKNGVHSAHIIDGRV PHATLLEIFTDEGVGTLITRKGCDDA" COG ID in the qualifier of /note ~ Archaeoglobus fulgidus DSM 4304 CDS complement(1141715..1142587) /locus_tag="AF_1280" /note="similar to GB:L77117 SP:Q60382 PID:1592260 percent identity: 56.06; identified by sequence similarity; putative" /codon_start=1 /transl_table=11 /product="acetylglutamate kinase (argB)" /protein_id="AAB89966.1" /db_xref="GI:2649301" /translation="MENVELLIEALPYIKDFHSTTMVIKIGGHAMVNDRILEDTIKDI VLLYFVGIKPVVVHGGGPEISEKMEKFGLKPKFVEGLRVTDKETMEVVEMVLDGKVNS KIVTTFIRNGGKAVGLSGKDGLLIVARKKEMRMKKGEEEVIIDLGFVGETEFVNPEII RILLDNGFIPVVSPVATDLAGNTYNLNADVVAGDIAAALKAKKLIMLTDVPGILENPD DKSTLISRIRLSELENMRSKGVIRGGMIPKVDAVIKALKSGVERAHIIDGSRPHSILI ELFTKEGIGTMVEP" Some details of homology analysis in the /note qualifier ~ Agrobacterium tumefaciens C58 circular chromosome CDS complement(373582..374466) /gene="AGR_C_666" /note="acetylglutamate kinase PA5323 {imported} - Pseudomonas aeruginosa (strain PAO1)" /codon_start=1 /transl_table=11 /product="AGR_C_666p" /protein_id="AAK86197.1" /db_xref="GI:15155294" /translation="MTSSESEIQARLLAQALPFMQKYENKTIVVKYGGHAMGDSTLGK AFAEDIALLKQSGINPIVVHGGGPQIGAMLSKMGIESKFEGGLRVTDAKTVEIVEMVL AGSINKEIVALINQTGEWAIGLCGKDGNMVFAEKAKKTVIDPDSNIERVLDLGFVGEV VEVDRTLLDLLAKSEMIPVIAPVAPGRDGATYNINADTFAGAIAGALHATRLLFLTDV PGVLDKNKELIKELTVSEARALIKDGTISGGMIPKVETCIDAIKAGVQGVVILNGKTP HSVLLEIFTEGAGTLIVP" Product ID in the qualifier of /product Product name in the qualifier of /note 2006/10/23 Primary Biological Databases

Annotation: description of products unknown hypothetical protein probable orf predicted protein putative protein hypothetical conserved protein uncharacterized protein conserved domain protein conseved hypthetical 2006/10/23 Primary Biological Databases

Annotation: inconsistency or biological variation Agrobacterium tumefaciens C58 circular chromosome (Cereon) Agrobacterium tumefaciens C58 circular chromosome (U. Washington) 2006/10/23 Primary Biological Databases

Primary Biological Databases A derived database: contents of complete microorganisms genomes were evaluated Gene Trek in Procayote Space 2006/10/23 Primary Biological Databases

Primary Biological Databases Features of GTPS Simultaneous prediction and evaluation of all the possible protein coding genes (ORFs) in prokaryote genomes All the ORFs are graded The criteria and evidence data are available Web site http://gtps.ddbj.nig.ac.jp/ Updated once a year 2006/10/23 Primary Biological Databases

Primary Biological Databases Short history of GTPS ver. (data froze in) strains Archaea Bacteria 2003 (Jul 2003) 123 14 109 2004 (Sep 2004) 183 17 166 2005 (Feb 2006) 302 25 277 2006/10/23 Primary Biological Databases

Not putative membrane nor unknown protein Grade blastp hit InterProScan hit Coverage Subject Subject alignment subject query AAAA BBBB Function known motif Not putative membrane nor unknown protein ≧ 70% AAA BBB Unknown motif & or AA BB No hit AAAA1-D3 grades Potential genes alignment/ ORF Putative membrane or unknown protein Function known or unknown motif ≧ 70% A B ≧ 70% Putative membrane protein No hit C or Function known or unknown motif No hit ≧ 70% Unknown protein D No hit & ≧ 70% Unknown protein E No hit or X No hit No hit 2006/10/23 Primary Biological Databases

Potential genes (AAAA1 - D3) Number of ORFs sorted by each grade GTPS ver. 2003 GTPS ver. 2004 GTPS ver. 2005 Grades AAAA-A BBBB-B C D E X Total 283,247 7,208 4,680 79,779 6,788 466,681 848,383 431,672 10,250 7,511 107,382 10,225 687,110 1,254,150 752,186 20,755 16,227 137,656 17,739 278,075 1,222,638 Glimmer2 and RBS finder Glimmer 3 Potential genes (AAAA1 - D3) 370,876 551,246 903,845 INSDC 362,828 537,312 904,530 2006/10/23 Primary Biological Databases

Comparison of the number of protein coding genes between GTPS and INSDC GTPS ver. 2003 GTPS ver. 2004 GTPS ver. 2005 Identical 3' matched New ORFs (Not annotated in INSDC) Not predicted by Glimmer Total (potential genes) 261,720 92,206 14,954 1,996 370,876 70.6% 24.9% 4.0% 0.5% 390,557 133,146 23,935 3,608 551,246 70.8% 24.2% 4.3% 0.7% 576,762 252,365 31,438 43,280 903,845 63.8% 27.9% 3.5% 4.8% Glimmer 3 was used for ver.2005. 2006/10/23 Primary Biological Databases

How many genes are out there? The ratio of the clustered ORFs to all the ORFs among the sampled genomes increased with the increasing number of the genomes. The ratio is not yet saturated at the 180 genomes (29.5% of the genes among 180 genomes were clustered and presumed to have the same function.). 2006/10/23 Primary Biological Databases

Primary Biological Databases Summary It is the long term mission of the primary database to archive all the data published for the long term. At the same time, it is requested to provide objective/reliable data and tools for the mash-up. It proposes and maintains standards of biological data processing for the long term. 2006/10/23 Primary Biological Databases