Download presentation
Presentation is loading. Please wait.
Published byDrusilla Reeves Modified over 9 years ago
1
BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic
2
2June 11, 2004 Bioinformatics Capstone Project Costello Outline Motivation and Goals Background Program Architecture Populating the Article Database Developing an Article Scoring Model BioKnOT demonstration Summary and Future Work
3
3June 11, 2004 Bioinformatics Capstone Project Costello Motivation and Goals Motivation Current online text searching methods are not good enough for highly specific research. Current online text searching methods are not good enough for highly specific research.ImportanceTimelinessRelevance Goal of Project Create an online text retrieval system that will allow users to construct their own set of highly specific, timely, and important research articles that are custom fit to a user’s needs. Create an online text retrieval system that will allow users to construct their own set of highly specific, timely, and important research articles that are custom fit to a user’s needs.
4
4June 11, 2004 Bioinformatics Capstone Project Costello D = set of documents D’ = set of documents that meet some search criteria D’ D D’ = {d 1, d 2, …d k } Where d i is an individual document and we hope d i is more interesting than d i+1 Where d i is an individual document and we hope d i is more interesting than d i+1 |D’| = huge number of documents |D’| for a filtered search on PubMed for “apoptosis” is 65,832 articles ∩ Standard Search Model
5
5June 11, 2004 Bioinformatics Capstone Project Costello BioKnOT Search Model D = set of documents D’ = set of documents that meet the initial search criteria D’ D D’ t = set of documents that pass the filter D’ t D’ D’ tu = set of documents that have been ranked by based on semantic content from user input D’ tu D’ t D’ tu = {d 1, d 2, …d k } |D’ tu | = very small and very specific Where d i is an individual document and d i is more interesting than d i+1 Where d i is an individual document and d i is more interesting than d i+1 ∩ ∩ ∩
6
6June 11, 2004 Bioinformatics Capstone Project Costello Program Architecture Initial Search Page Boolean Search Filter Page Filter Your Search apoptosis term User Input Page Submit Description User’s sentences Results Page Refine Your Search 1.Article Title … View Word Graph See All Data 2. … Actual Online Article All Stored Data On the Article (title, author(s),…) Illustration of Word Relationships in Article Word Weighting Page Add Word Weights Bad Good term Hyperlink
7
7June 11, 2004 Bioinformatics Capstone Project Costello Populating the Article Database Data we need Author(s) Author(s) Article Title Article Title Abstract Abstract Journal title Journal title Date and year of publication Date and year of publication Count of how many times the article was cited Count of how many times the article was cited URL of online full text article or PubMed Search results URL of online full text article or PubMed Search results Some Type of Accession Number Some Type of Accession Number
8
8June 11, 2004 Bioinformatics Capstone Project Costello Resources Used in Populating the Database Institute of Scientific Information (ISI) Web of Science http://bert.lib.indiana.edu:2182/portal.cgi http://bert.lib.indiana.edu:2182/portal.cgi http://bert.lib.indiana.edu:2182/portal.cgi EndNote 7 PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi http://www.ncbi.nlm.nih.gov/entrez/query.fcgi http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
9
9June 11, 2004 Bioinformatics Capstone Project Costello Steps Taken to Populate the Article Database ISI’s Web of Science Search Interface Endnote 7 Export article information PubMed Search Interface PubMed Article Abstract Interface Article Database > 2,000 Export XML and Parse Web Bot to search for URL information using article title and author(s) Either PubMed URL or Online Article URL inserted After PubMed Abstract found, Web Bot searches for online article URL
10
10June 11, 2004 Bioinformatics Capstone Project Costello Initial Search Boolean search Searches all article’s in the database with a URL Searches an article’s title and abstract Searches an article’s title and abstract
11
11June 11, 2004 Bioinformatics Capstone Project Costello Filter Page TFIDF LUCAS Web Service Web Service http://lair.indiana.edu/research/lucas/index.html http://lair.indiana.edu/research/lucas/index.html http://lair.indiana.edu/research/lucas/index.html TFIDF Calculations TF = number of occurrences of a term in a document TF = number of occurrences of a term in a document IDF = log of the total number of documents over the number of documents that contain the desired term IDF = log of the total number of documents over the number of documents that contain the desired term tf i,d = |d i | |Σ i k d i | idf i,D = log 2 |D| |{d i | d i D}| tfidf i,d = (1 + tf i,d )idf i,D if tf i,d ≥ 1
12
12June 11, 2004 Bioinformatics Capstone Project Costello Term Relationship Measurements Intra-sentence distance Sentence structure taken into account Sentence structure taken into account Inter-sentence distance Sentence structure ignored Sentence structure ignored “... and is not present in the mitochondria. Permeability is another...” “... mitochondrial permeability is an important aspect of apoptosis...” Ex.
13
13June 11, 2004 Bioinformatics Capstone Project Costello Inter-sentence vs. Intra-sentence distance Searching for the relationship cell death …cell… Doc A …death… Doc D …cell death… Doc B …cell. Death… Doc C …cell death… Doc E Documents used to Construct the Random Model Initial Search Set of Documents Document that are scored and returned to the user
14
14June 11, 2004 Bioinformatics Capstone Project Costello Visual Representation of Term Relationships Example of a Term Relationship Graph that was specified by the user Example of a Term Relationship Graph that was taken from an Article’s Abstract Graph M Graph N
15
15June 11, 2004 Bioinformatics Capstone Project Costello Scoring an Article M = User Defined Term Relationships N = Abstract of Individual Article Term Relationships S = Scoring Matrix P = Presence or Absence of a Term Relationship from M in N f = Sigmoidal Term Relationship Function Abstract Score = ∑ P M,N (i,j) × S i,j × f M i,j (N i,j ) Abstract Score = ∑ P M,N (i,j) × S i,j × f M i,j (N i,j ) P M,N (i,j) = × N i,j ≠ 0 1 M i,j × N i,j ≠ 0 -1 Otherwise
16
16June 11, 2004 Bioinformatics Capstone Project Costello Sigmoidal Scoring Function β - α x - α if α < x ≤ β if x ≤ α 1 - x - α if β < x ≤ γ 1 0 if x > γ 1 0 ½ γβα f M i,j (N i,j ) = ½ ½ β - α Term Distance % Term Membership
17
17June 11, 2004 Bioinformatics Capstone Project Costello Scoring Matrix (Random Model) Derived from the TFIDF Terms that were defined by the user and abstracts of all the articles returned by the initial term search. User defined term relationships are found in all the abstracts and the log-odds score is taken (tj | ti, is found by first finding a word, t i, that the user has defined and then opening up a 5 word reading frame,, following t i. The presence of a second user defined word, t j, must be within (tj | ti, Δ) is found by first finding a word, t i, that the user has defined and then opening up a 5 word reading frame, Δ, following t i. The presence of a second user defined word, t j, must be within Δ LOD Score(t i,t j ) = log 2 P(t j | t i, Δ) P(t i ) × P(t j )
18
18June 11, 2004 Bioinformatics Capstone Project Costello Determine important terms cell, death, human cell, death, human Look for relationships of those words in the search space. Relationships Relationships cell→death, cell→human, death→cell, death→human, human→cell, human→death Search Space (abstract) Search Space (abstract) ← The effects … cell in a human … in cancer. → Once an important term is found, a 5 word reading frame is opened. If a relationship is found within the reading frame, then the distance between the words is taken. cell→human = 3 cell→human = 3 If multiple occurrences of the same relationship are found in the search space, the average is taken. 20 words Steps to derive the Scoring Matrix
19
19June 11, 2004 Bioinformatics Capstone Project Costello Steps to derive the Scoring Matrix Lastly, these relationships, along with the individual word probabilities, can be taken, scored and structured into a matrix. P(cell→human) = =.167 P(cell→human) = =.167 P(cell) =.03 P(cell) =.03 P(human) =.06 P(human) =.06 LOD(cell→human) = 1.97 LOD(cell→human) = 1.97 Continue for all relationships Continue for all relationships 2 apoptosishumancell apoptosis01.27-1.08 Human1.6400 Cell2.351.970 12
20
20June 11, 2004 Bioinformatics Capstone Project Costello Adding User Weights to Term Matrix User is asked to enter weights for each word relationship that is found within the user’s expansion statement. Weights range from [0,2] Score is noted r i,j for term i to term j Weights multiplied by matrix values to add user’s input into the random model.
21
21June 11, 2004 Bioinformatics Capstone Project Costello S i,j celldeathprotein cell0.02.540.0 death0.980.00.0 protein-1.653.650.0 celldeathproteincell0.05.080.0 death0.980.00.0 protein-3.305.480.0 Scoring Matrix Before User’s Word Weights Scoring Matrix After User’s Word Weights cell death … 2.0 death cell …… 1.0 protein cell …… 0.5 protein death … 1.5 User’s Word Weight submissions Final Score S i,j 0 if S i,j = 0 × r i,j × S i,j if S i,j > 0 if Si,j < 0S i,j× 1 r i,j =
22
22June 11, 2004 Bioinformatics Capstone Project Costello Visual Representation of Term Relationships Example of a Term Relationship Graph that was specified by the user Example of a Term Relationship Graph that was taken from an Article’s Abstract Graph M Graph N
23
23June 11, 2004 Bioinformatics Capstone Project Costello Comparing Term Relationship Graphs In order to compare the word graphs, an adjacency matrix must be created. This is where the values of M i,j and N i,j are taken. In order to compare the word graphs, an adjacency matrix must be created. This is where the values of M i,j and N i,j are taken. apoptosistumor apoptosis05.00 tumor00fasinducefas03.00 induce00 Matrix MMatrix N
24
24June 11, 2004 Bioinformatics Capstone Project Costello Results and Refinement Support Score in the form of Citation Frequency, which is simply the citation count supplied by ISI’s Web of Science divided by the difference in year from now to the publication date. Semantic Score from the equation ∑ PM,N(i,j) × Si,j × fMi,j(Ni,j)
25
25June 11, 2004 Bioinformatics Capstone Project Costello Software Demonstration BioKnOT http://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgi PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
26
26June 11, 2004 Bioinformatics Capstone Project Costello Summary Offer a new and effective way to search research articles. BioKnOT offers many features that aid the user in deciding what factors are important in retrieving articles. Currently under submission to SIGIR Bioinformatics workshop.
27
27June 11, 2004 Bioinformatics Capstone Project Costello Future Work Adding more sophisticated support through citation frequency. Increase efficiency of scoring method Usability analysis Incorporate BioKnOT into CATPA Developing a Bioinformatics Knowledge Base locally using BioKnOT.
28
28June 11, 2004 Bioinformatics Capstone Project Costello Acknowledgments Professor MehmetDalkilic Professor Javed Mostafa Professor Sun Kim
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.