Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lindsay & Gordon’s Discovery Support Systems Model

Similar presentations


Presentation on theme: "Lindsay & Gordon’s Discovery Support Systems Model"— Presentation transcript:

1 Lindsay & Gordon’s Discovery Support Systems Model
John MacMullen SILS Bioinformatics Journal Club Fall 2002

2 SILS Bioinformatics Journal Club
Background Specialization in science leads to fragmented, ‘complementary but disjoint’ or ‘non-interactive’ literatures Attempt to find ‘undiscovered public knowledge’ in the biomedical literature Tools are needed to integrate biomedical knowledge; ‘discovery support systems’ are one class Replication and extension of Swanson & Smalheiser’s model (‘Arrowsmith’) SILS Bioinformatics Journal Club

3 Complementary but Disjoint Literatures
Fish Oil C Raynaud’s B1 – Blood Viscosity B2 – Platelet Aggregation B3 – Vascular Reactivity ‘non-interactive’ literatures Adapted from Swanson & Smalheiser, 1997 by Weeber, et al., 2001 SILS Bioinformatics Journal Club

4 Swanson & Smalheiser’s method
Systematic trial-and-error method Procedure 1: Citation acquisition Search MEDLINE for topical cites (‘C’ list) Apply stopword list and extract unique terms (‘B’ list) Search MEDLINE for ‘B’ term cites; prune list Perform MEDLINE searches for each ‘B’ term Classify results into likely categories Derive the intersection of each ‘B’ set with the restriction set, and the union of intersection sets (‘U’) Search the resulting terms of ‘U’ set in MEDLINE ‘U’ list becomes potential ‘A’ terms, with each ‘A’ term attached to the ‘B’ term that generated it Rank ‘A’ term results against ‘B’ co-occurrence Procedure 2: Relationship Mining Search for pre-existing A→C &/or A→B→C relationships Search for novel A→C relationships Output: Display of ‘A’ & ‘C’ cites by their common ‘B’ terms Goal: a plausible testable hypothesis Human relevance judgments in each step influence future steps SILS Bioinformatics Journal Club

5 Lindsay & Gordon’s method
Limited to lexical statistics, no syntactic or semantic evaluation Uses full MEDLINE record instead of title only Identify a source literature (e.g., a topic) Find all single words, bi-grams & tri-grams in source corpus; exclude stop words; normalize singular/plural Calculate 4 statistics for each token (tf, df, rf, tf*idf) Rank tokens by frequency of occurrence Identify 1 or more intermediate literatures based on stats in step (3), starting with highest ranks from (4) Run process from steps 1-4 for each intermediate literature DL complete MEDLINE records for all docs “mentioning” (i.e., word match or index term) source topic General stop word list + common medical terminology Lindsay & Gordon, 1999, pp SILS Bioinformatics Journal Club

6 Term / Phrase Statistics
tf = token frequency in sub-corpus df = doc frequency (# of docs in sub-corpus w/this term) rf = relative frequency (# of appearances in sub-corpus vs whole corpus) idf(t) = log(N / f(t)) where N = # of docs in whole corpus over number of docs with t in them tf*idf = token frequency * inverse doc frequency SILS Bioinformatics Journal Club

7 SILS Bioinformatics Journal Club
Experiments Reproduction of Swanson & Smalheiser’s magnesium / migraine connection Used their method to try to find the 11 intermediate literatures S&S found +1 new 1,081 MEDLINE records from on migraine SILS Bioinformatics Journal Club

8 SILS Bioinformatics Journal Club
Outcome Is ranking tokens by frequency effective? Arbitrary ranking and population size cutoffs Domain knowledge used (med student) Lots of manual intervention. Is this reproducible? Is it really all that different from S&S? SILS Bioinformatics Journal Club

9 SILS Bioinformatics Journal Club
Hypotheses Intermediate literatures are best identified by absolute lexical frequencies Candidate discoveries are best generated by relative lexical frequencies Results showed the opposite of H2 ( ) Search for the absence of connections between source and target literatures (Swanson’s disjointness) SILS Bioinformatics Journal Club

10 SILS Bioinformatics Journal Club
Questions Is this process more automatable? Is 1 intermediate step enough? What happens to system complexity when there are i intermediate literatures? S I1 Ii T SILS Bioinformatics Journal Club


Download ppt "Lindsay & Gordon’s Discovery Support Systems Model"

Similar presentations


Ads by Google