Presentation is loading. Please wait.

Presentation is loading. Please wait.

IT-based Protein Sequence Analysis 2004-06-07 Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information.

Similar presentations


Presentation on theme: "IT-based Protein Sequence Analysis 2004-06-07 Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information."— Presentation transcript:

1 IT-based Protein Sequence Analysis 2004-06-07 Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information

2 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 2/ 31 Contents BLAST(Basic Local Alignment Search Tool) Information Retrieval/Text Categorization N-Gram Indexing & Retrieval KRISTAL-2002 Information System Bio-KRISTAL ProSeS (Protein Sequence Retrieval) Performance of ProSeS ProSLP (Protein Sequence Categorization) Performance of ProSLP Future Works

3 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 3/ 31 BLAST BLAST (Basic Local Alignment Search Tool) –Prevailed sequence retrieval tool –Search DNA/Protein sequences based on LOCAL homologies among sequences –Adopts primitive and limited indexing scheme for 3-Gram –Extends matches from candidate 3-grams to retrieve similar sequences Pattern matching algorithms have relatively high computational complexity and cause severe retrieval delay in various BLAST services –Only supports sequence retrieval

4 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 4/ 31 Information(Text) Retrieval Text features: words as indexes Index storage: Inverted file Text search: search word-document list from inverted file A ladybug has beautiful … Bugs hide from enemy … enemy of aphids is wasps that … Ladybug as enemy agri… Night heron has short legs and … (1) (2) (3) (4) (5) ladybug enemy...... 1,5 2,3,5............ Text CollectionInverted FileSearch & Retrieval (Ladybug) 1, 5 (enemy) 2, 3, 5 (ladybug&enemy) 5 (ladybug|enemy) 5, 1, 2, 3

5 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 5/ 31 Category – Feature Relation DB Text Categorization Text feature: Usually words Feature storage: Various methods to store category-feature relations Text Categorization: Compare input document and category-feature relations (Insect) (Bird) (Agricul) Category-FeatureCategorization Category (Insect) As a natural enemy, a ladybug eats about 400 aphids a day … (enemy, ladybug, aphids) Insect (95%) Agricul. (67%) (Agricul) A ladybug has beautiful … Bugs hide from enemy … enemy of aphids is wasps that … Ladybug as enemy agri… Night heron has short legs and … Text Collection

6 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 6/ 31 Protein sequence as a Text? Is it possible to regard a protein sequence as a natural language in 20 amino acid alphabets? –If YES, text retrieval and categorization algorithms can be applied to retrieve and classify protein sequences. –Cf. Chinese texts, usually written without spaces, have been successfully retrieved with 2-gram indexing.

7 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 7/ 31 Indexing Protein Sequences Overlapping N-Gram method –Candidates: N = 3, 4, 5, 6 –Example: 4-Gram indexing for a protein sequence TASHNPGGKEHGDFGIGAPAPEDFTDQI TASH ASHN SHNP HNPG NPGG PGGK GGKE GKEH.... EDFT DFTD FTDQ TDQI

8 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 8/ 31 Sequence Retrieval Vector Space Model –Sequences are represented as vectors of occurring n- grams –Similarities are computed by inner-product between vectors –Advantages Fast Easy to implement –Weaknesses Does not reflect LOCALITY information among n-grams This may cause malfunction in computing similarities among low sequence homologies

9 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 9/ 31 Sequence Retrieval: Similarity

10 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 10/ 31 KRISTAL-2002 General purpose Information Retrieval & Management System developed by KISTI Successfully applied to retrieval and management of bibliographies, full-texts, journal articles, theses database. Operating as major information system for Science & Technology services of KISTI. Supports Boolean, vector space, extended Boolean retrieval models. Client/Server Architecture Supports some DBMS facilities such as logging, on- line document editing, and consistency control

11 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 11/ 31 Bio-KRISTAL Developed on the top of KRISTAL-2002 Design –Sequence indexers, proteins sequence classification engine, and sequence retrieval model are implemented as a part of KRISTAL-2002 Information system. Status –Protein Sequence Indexer (completed) Applied to ProSeS (http://proses.kisti.re.kr) –DNA Sequence Indexer (developing) –Protein Sequence Classifier (completed) Applied to ProSLP (http://proslp.kisti.re.kr) –Novel retrieval model dedicated to DNA/Protein sequences (in design)

12 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 12/ 31 Bio-KRISTAL Architecture Bio-KRISTAL System Architecture KRISTAL-2002 Information Retrieval & Management System Annotation Indexer DB1DB2DBn … Catalog Set Fast Information Retrieval Engine … Data Loader Set Manage r Retrieval Oriented Storage Engine DNA Sequence Indexer Protein Sequence Indexer DNA/Protein Sequence Retrieval Engine Protein Sequence Classification Engine

13 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 13/ 31 ProSeS Protein Sequence Search URL: http://proses.kisti.re.krhttp://proses.kisti.re.kr Target DB: PIR-NREF –Protein sequence retrieval based on 5-Gram indexing –Similarity search by vector space model –Alternative or Supplementary to BLAST Additional services –Provides related superfamily information –Provides prediction of subcellular location –Suggests major keywords to help annotation Intending to provides an overall sequence ANALYSIS service

14 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 14/ 31 ProSeS Web interface : FASTA format

15 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 15/ 31 ProSeS Search result : Abstract view

16 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 16/ 31 ProSeS Search Result: Alignment –Smith-Waterman algorithm

17 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 17/ 31 ProSeS Prediction of protein subcellular localization Related protein superfamilies

18 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 18/ 31 ProSeS Major keywords

19 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 19/ 31 Performance of ProSeS Test Data –PIR-NREF Release 1.26: 1.3 million sequences Test query: 100 protein sequences randomly chosen Test N = 3, 4, 5, 6 Method : compare with BLAST results –Regarding BLAST results as correct answers, 11-pt. average precision for each sequence is measured. Cf. Though it cannot be guaranteed that BLAST output is the correct answers, measures against BLAST will show overall performance of ProSeS.

20 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 20/ 31 N-Gram Information –N6-A18: among 20 amino acid codes, V  I and F  Y, which show the highest score in BLOSUM62 scoring matrix Performance of ProSeS

21 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 21/ 31 Performance of ProSeS 11-pt. Average Recall-Precision Graph Pr. at 0.1 = 0.87 11pt. Avg. = 0.63

22 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 22/ 31 Performance of ProSeS Performance of N-Grams tested –5-Gram shows the best performance. It ’ s retrieval time is 38 times faster than BLAST. Similarity between BLAST and ProSeS outputs is about 63% in terms of 11pt. Average precision. But 5-gram requires 5.3 times of disk space to store the index information

23 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 23/ 31 Performance of ProSeS Conclusion –Among n-grams tested, 5-gram showed the best performance for protein sequence retrieval. –The overall similarity of outputs of ProSeS and BLAST is about 63% and top 10% similarity is 87%. –ProSeS with 5-gram is 38 times as fast as BLAST. Discussion –ProSeS and BLAST presented almost identical results for sequences with strong homologies. –However, they showed very different behaviors for sequence with low similarities or small local alignments.

24 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 24/ 31 ProSLP – Subcellular Localization Protein Subcelluar Localization Prediction) URL: http://proslp.kisti.re.krhttp://proslp.kisti.re.kr Data set : 52,000 sequences from Swiss-Prot DB with subcellular locations annotated. Classifier : kNN (k-Nearest Neighbor) Text Categorization algorithm –Predicts subcellular location(s) based on the top k similar sequences with the input sequence Additional Service –Function as a meta agent for other prediction services such as Psort, Ploc, and Predotar

25 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 25/ 31 ProSLP – Subcellular Localization Web Interface : FASTA format

26 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 26/ 31 ProSLP – Subcellular Localization Prediction Results

27 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 27/ 31 Performance of ProSLP Data Set

28 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 28/ 31 Performance of ProSLP Results for data sets

29 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 29/ 31 Performance of ProSLP Comparison with PLOC

30 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 30/ 31 Performance of ProSLP Conclusions –kNN classifier for subcellular localization showed 81% of precision for small PLOC data set and 93% for large SLP data set. Discussions –The more training samples, the higher the precision of kNN classifier.

31 Center for Computational Biology & Bioinformatics / KISTI http://www.ccbb.re.kr 2015-06-01 31/ 31 Future Works Novel Retrieval Model which supports localities among n-grams in a sequence. Mixed N-gram Indexing: for example 5,6-grams –More storage will be required but partial locality may be retained. Extension of Bio-KRISTAL to DNA sequences Development of practical service for Protein Sequence Analysis –Fast sequence retrieval, functional/structural classification, prediction of protein subcellular localization, text mining etc.


Download ppt "IT-based Protein Sequence Analysis 2004-06-07 Center for Computational Biology & BIoinformatics Korea Institute of Science & Technology Information."

Similar presentations


Ads by Google