Download presentation
Presentation is loading. Please wait.
1
IT-based Protein Sequence Analysis 2004-06-07 Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information
2
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 2/ 31 Contents BLAST(Basic Local Alignment Search Tool) Information Retrieval/Text Categorization N-Gram Indexing & Retrieval KRISTAL-2002 Information System Bio-KRISTAL ProSeS (Protein Sequence Retrieval) Performance of ProSeS ProSLP (Protein Sequence Categorization) Performance of ProSLP Future Works
3
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 3/ 31 BLAST BLAST (Basic Local Alignment Search Tool) –Prevailed sequence retrieval tool –Search DNA/Protein sequences based on LOCAL homologies among sequences –Adopts primitive and limited indexing scheme for 3-Gram –Extends matches from candidate 3-grams to retrieve similar sequences Pattern matching algorithms have relatively high computational complexity and cause severe retrieval delay in various BLAST services –Only supports sequence retrieval
4
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 4/ 31 Information(Text) Retrieval Text features: words as indexes Index storage: Inverted file Text search: search word-document list from inverted file A ladybug has beautiful … Bugs hide from enemy … enemy of aphids is wasps that … Ladybug as enemy agri… Night heron has short legs and … (1) (2) (3) (4) (5) ladybug enemy...... 1,5 2,3,5............ Text CollectionInverted FileSearch & Retrieval (Ladybug) 1, 5 (enemy) 2, 3, 5 (ladybug&enemy) 5 (ladybug|enemy) 5, 1, 2, 3
5
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 5/ 31 Category – Feature Relation DB Text Categorization Text feature: Usually words Feature storage: Various methods to store category-feature relations Text Categorization: Compare input document and category-feature relations (Insect) (Bird) (Agricul) Category-FeatureCategorization Category (Insect) As a natural enemy, a ladybug eats about 400 aphids a day … (enemy, ladybug, aphids) Insect (95%) Agricul. (67%) (Agricul) A ladybug has beautiful … Bugs hide from enemy … enemy of aphids is wasps that … Ladybug as enemy agri… Night heron has short legs and … Text Collection
6
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 6/ 31 Protein sequence as a Text? Is it possible to regard a protein sequence as a natural language in 20 amino acid alphabets? –If YES, text retrieval and categorization algorithms can be applied to retrieve and classify protein sequences. –Cf. Chinese texts, usually written without spaces, have been successfully retrieved with 2-gram indexing.
7
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 7/ 31 Indexing Protein Sequences Overlapping N-Gram method –Candidates: N = 3, 4, 5, 6 –Example: 4-Gram indexing for a protein sequence TASHNPGGKEHGDFGIGAPAPEDFTDQI TASH ASHN SHNP HNPG NPGG PGGK GGKE GKEH.... EDFT DFTD FTDQ TDQI
8
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 8/ 31 Sequence Retrieval Vector Space Model –Sequences are represented as vectors of occurring n- grams –Similarities are computed by inner-product between vectors –Advantages Fast Easy to implement –Weaknesses Does not reflect LOCALITY information among n-grams This may cause malfunction in computing similarities among low sequence homologies
9
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 9/ 31 Sequence Retrieval: Similarity
10
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 10/ 31 KRISTAL-2002 General purpose Information Retrieval & Management System developed by KISTI Successfully applied to retrieval and management of bibliographies, full-texts, journal articles, theses database. Operating as major information system for Science & Technology services of KISTI. Supports Boolean, vector space, extended Boolean retrieval models. Client/Server Architecture Supports some DBMS facilities such as logging, on- line document editing, and consistency control
11
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 11/ 31 Bio-KRISTAL Developed on the top of KRISTAL-2002 Design –Sequence indexers, proteins sequence classification engine, and sequence retrieval model are implemented as a part of KRISTAL-2002 Information system. Status –Protein Sequence Indexer (completed) Applied to ProSeS (http://proses.kisti.re.kr) –DNA Sequence Indexer (developing) –Protein Sequence Classifier (completed) Applied to ProSLP (http://proslp.kisti.re.kr) –Novel retrieval model dedicated to DNA/Protein sequences (in design)
12
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 12/ 31 Bio-KRISTAL Architecture Bio-KRISTAL System Architecture KRISTAL-2002 Information Retrieval & Management System Annotation Indexer DB1DB2DBn … Catalog Set Fast Information Retrieval Engine … Data Loader Set Manage r Retrieval Oriented Storage Engine DNA Sequence Indexer Protein Sequence Indexer DNA/Protein Sequence Retrieval Engine Protein Sequence Classification Engine
13
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 13/ 31 ProSeS Protein Sequence Search URL: http://proses.kisti.re.krhttp://proses.kisti.re.kr Target DB: PIR-NREF –Protein sequence retrieval based on 5-Gram indexing –Similarity search by vector space model –Alternative or Supplementary to BLAST Additional services –Provides related superfamily information –Provides prediction of subcellular location –Suggests major keywords to help annotation Intending to provides an overall sequence ANALYSIS service
14
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 14/ 31 ProSeS Web interface : FASTA format
15
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 15/ 31 ProSeS Search result : Abstract view
16
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 16/ 31 ProSeS Search Result: Alignment –Smith-Waterman algorithm
17
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 17/ 31 ProSeS Prediction of protein subcellular localization Related protein superfamilies
18
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 18/ 31 ProSeS Major keywords
19
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 19/ 31 Performance of ProSeS Test Data –PIR-NREF Release 1.26: 1.3 million sequences Test query: 100 protein sequences randomly chosen Test N = 3, 4, 5, 6 Method : compare with BLAST results –Regarding BLAST results as correct answers, 11-pt. average precision for each sequence is measured. Cf. Though it cannot be guaranteed that BLAST output is the correct answers, measures against BLAST will show overall performance of ProSeS.
20
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 20/ 31 N-Gram Information –N6-A18: among 20 amino acid codes, V I and F Y, which show the highest score in BLOSUM62 scoring matrix Performance of ProSeS
21
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 21/ 31 Performance of ProSeS 11-pt. Average Recall-Precision Graph Pr. at 0.1 = 0.87 11pt. Avg. = 0.63
22
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 22/ 31 Performance of ProSeS Performance of N-Grams tested –5-Gram shows the best performance. It ’ s retrieval time is 38 times faster than BLAST. Similarity between BLAST and ProSeS outputs is about 63% in terms of 11pt. Average precision. But 5-gram requires 5.3 times of disk space to store the index information
23
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 23/ 31 Performance of ProSeS Conclusion –Among n-grams tested, 5-gram showed the best performance for protein sequence retrieval. –The overall similarity of outputs of ProSeS and BLAST is about 63% and top 10% similarity is 87%. –ProSeS with 5-gram is 38 times as fast as BLAST. Discussion –ProSeS and BLAST presented almost identical results for sequences with strong homologies. –However, they showed very different behaviors for sequence with low similarities or small local alignments.
24
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 24/ 31 ProSLP – Subcellular Localization Protein Subcelluar Localization Prediction) URL: http://proslp.kisti.re.krhttp://proslp.kisti.re.kr Data set : 52,000 sequences from Swiss-Prot DB with subcellular locations annotated. Classifier : kNN (k-Nearest Neighbor) Text Categorization algorithm –Predicts subcellular location(s) based on the top k similar sequences with the input sequence Additional Service –Function as a meta agent for other prediction services such as Psort, Ploc, and Predotar
25
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 25/ 31 ProSLP – Subcellular Localization Web Interface : FASTA format
26
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 26/ 31 ProSLP – Subcellular Localization Prediction Results
27
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 27/ 31 Performance of ProSLP Data Set
28
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 28/ 31 Performance of ProSLP Results for data sets
29
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 29/ 31 Performance of ProSLP Comparison with PLOC
30
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 30/ 31 Performance of ProSLP Conclusions –kNN classifier for subcellular localization showed 81% of precision for small PLOC data set and 93% for large SLP data set. Discussions –The more training samples, the higher the precision of kNN classifier.
31
Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 31/ 31 Future Works Novel Retrieval Model which supports localities among n-grams in a sequence. Mixed N-gram Indexing: for example 5,6-grams –More storage will be required but partial locality may be retained. Extension of Bio-KRISTAL to DNA sequences Development of practical service for Protein Sequence Analysis –Fast sequence retrieval, functional/structural classification, prediction of protein subcellular localization, text mining etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.