Where Do You Go for Biomedical Funding? Yi Liu, Ahmet Altay
Background Problem o In biomedical research there are many sources of federal funding. o How to choose the right institution for funding for a given research idea? Data o Biomedical grant summaries from 20 institutions between the period 1972 and 2009
Pre-Processing Clean up texts from mark-up/meta words/duplicates Remove institutions with less than 5000 grant information Bag-of-words approach with a pre-determined dictionary o Removed 319 stop words from text o Used stemming (Porter) to further collapse text o Dictionary size of with distinct spellings Use mgrep to annotate our data with dictionary words
Histogram for Stems per Abstract
Processing Generate a TFIDF matrix given the dictionary and abstracts TFIDF matrix is huge (83435 by ) Reduce TFIDF matrix for computational efficieny o Remove zero dictionary counts and abstracts o Use SVD and represent use a smaller sub-space of original matrix o Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.
Distribution of Singular Values
Effect of Using Eigen Sub-space Tested performance of smaller data set (400). Performance of raw TFIDF is similar to eigen sub-space.
Evaluation For a given test abstract we used kNN search to find 100 closest abstracts. Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: Tested entire data set using Leave-1-out cross-validation
Results (1)
Results (2)