Download presentation
Presentation is loading. Please wait.
1
APAN e-Science Workshop e-Bio System for Bio- Knowledge Discovery 2003.8.27 Sangsoo Kim Nat’l Genome Informat’n Ct. Korea Res. Inst. of Biosci. & Biotech.
2
Bio-Databases & Servers Contents –Bibliographic (Journal abstracts such as Medline) –Experimental data (Sequences or structures) –Results from annotation and analyses –Bioinformatic analysis tools Purpose –Storing & managing raw data –Querying for knowledge discovery –Sharing information with others –Serving others with online analysis
3
New Role of Databases New discoveries of biological knowledge are published in scientific journals But journal space is limited and not suitable to publish large amount of high throughput data The supplementary information is provided in an accompanying website Readers can download the supplementary information and analyze from different aspect Combination with other information may surprise with unexpected results Journal publishers require supplementary information deposited in public archives
4
Example - Nucleotide Sequence Repositories Nucleotide sequences discovered by sequencing experiments are deposited in any one of the public archives and the journal paper list the accession numbers only (without deposition, you cannot publish sequence discovery in journals) Public archives are –DDBJ operated by CIB, NIG in Japan –EMBL operated by EMBL-EBI in UK –GenBank operated by NCBI, NIH in USA The contents of these archives are exchanged daily and freely accessible to everybody Now extended to archive DNA chip data as well
5
Growth of GenBank A Nucleotide Sequence Repository Human Genome Project
6
RTFM Entrez: Home Page
7
GenBank as HTML Entrez: Display FASTA as HTML
8
Example – BLAST Servers Originally developed to compare my sequence to those in the repository in order to check whether mine is novel or not Extended to detect distantly related sequences, serving as the major sequence annotation tool Servers accept various kinds of queries and return alignment results over WWW The most widely used bioinformatic tool For the analysis of many sequences, better to use local installation
9
http://www.ncbi.nlm.nih.gov/BLAST programquerydatabase blastndnadna blastpproteinprotein blastxdna (6x)protein tblastnproteindna (6x) tblastxdna (6x)dna (6x) RTFM BLAST (Basic Local Alignment Sequence Tool)
10
DescriptionsAlignments BLASTN (Cont'd)
11
Example – Derived Databases Swiss-Prot & PIR –Proteins are predicted from deposited nucleotide sequences, either being mRNA or genomic DNA –Functions and features of the protein is annotated manually by experts Protein motifs –Prosite, pfam, BLOCKS, InterPro –Keyword querying and motif detection of user’s sequence Gene Ontology –Hierarchical organization of biological terms –Cataloging associated gene products
12
Expert Protein Analysis System ExPASy (http://www.expasy.ch)
13
NiceProt View
14
Gene Ontology Systematic classification of biological terminology –Molecular function –Biological process –Cellular component Controlled vocabulary Associated GENE list
16
Data Mining Objective: –Discovery of (biological) knowledge by querying information in the databases and comprehending it Problems: –Too many databases –Different protocols for access –Lack of standards –Poor quality or propagation of errors Solutions: –Data warehousing or federated databases
17
Catalog of Bio-DBs arranged by Data Domain
18
Database of Databases Data warehousing –Collect all databases by mirroring –Store in a unified format –Entrez (NCBI) or SRS (EBI) –Powerful but heavy maintenance load Federated databases –Maintained by participating members –Accessed by common protocols –Bio-DAS or Web Services via SOAP/XML –Next generation technology, but dependent on both the cooperation by members and Internet bandwidth
19
www.ngic.re.kr
20
www.ncbi.nih.gov /LocusLink
21
New Data Types Textual –Nucleotide or amino acid sequences –Associated feature annotation –Bibliographical texts Numeric –Gene expression profiles –Results from statistical analysis Graphical –Protein-protein interaction network –Genetic network –Biochemical reaction pathways
25
Building a Nation from a Land of City States Lincoln D. Stein Cold Spring Harbor Laboratory
26
Italy in the Middle Ages
27
Bioinformatics, ca. 2002 Bioinformatics In the XXI Century
28
Making Easy Things Hard Give me all human sequences submitted to GenBank/EMBL last week.
29
Lots of ways to do it Download weekly update of GenBank/EMBL from FTP site Use official network-based interfaces to data: –NCBI toolkit –EBI CORBA & XEMBL servers Use friendly web interfaces at NCBI, EBI
30
Perl/Java/Python to the Rescue One script to do the web fetch Another to parse the file format A third to move into private database A fourth to repeat this weekly Result: –6,719 scripts that do the same thing –None of them work together
31
What ’ s Wrong with This? My EMBL fetcher is poorly documented so you write your own Your fetcher won ’ t work with my parser My parser won ’ t work with your fetcher We ’ ve now wasted 20 hours rather than 10 Multiply this by 6,719
32
What ’ s else is Wrong? NCBI/EBI tweaks something 6,719 scripts fail at once 6,719 bioinformaticists tear their hair 21,261 biologists curse the bioinformaticists 6,719 bioinformaticists curse their own existence
33
Unifying Bioinformatics Services MIMBD: Meetings on the Interconnection of Molecular Biology Databases Federated models: Gaea, Kleisli Data warehouses: GUS, MODs, Ensembl, UCSC Ad hoc web services Formal web services
34
Ad hoc services BioXXX Your Script Conf file
35
Formal Web Services SeqFetch Service BLAT Service Microarray Service BLAST Service SeqFetch Service GO Service
36
Formal Web Services Service Registry SeqFetch Service BLAT Service Microarray Service BLAST Service SeqFetch Service GO Service
37
Formal Web Services Your Script Service Registry BioXXX Microarray Service SeqFetch Service BLAT Service Microarray Service BLAST Service SeqFetch Service GO Service
38
Technical Infrastructure is Here* Common vocabulary: GO Transport format: XML Data definition language: XSD Wire protocol: SOAP Service definition language: WSDL Service registry: UDDI *(almost)
39
Distributed Annotation System http://www.biodas.org http://www.biodas.org Reference Server AC003027 AC005122 M10154 Annotation Server AC003027 M10154 WI1029AFM820AFM1126WI443 AC005122 Annotation Server Thursday 10:30 AM Canyon IV
40
Europe, ca 2000
41
Bioinformatics, ca 2010?
42
NGIC KNIH Human Proteome Animal Ag-Bio Crop Plant Microbial Universities Research Institutes Industry Collection and Sharing of National Genome Information
43
NGIC KNIH Human Proteome Animal Ag-Bio Crop Plant Microbial Data Grid KISTIETRI Application Grid National Genome Information Network
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.