Where Do You Go for Biomedical Funding? Yi Liu, Ahmet Altay.

Slides:



Advertisements
Similar presentations
Topics in learning from high dimensional data and large scale machine learning Ata Kaban School of Computer Science University of Birmingham.
Advertisements

Chapter 5: Introduction to Information Retrieval
Evaluation of Decision Forests on Text Categorization
Automatic determination of skeletal age from hand radiographs of children Image Science Institute Utrecht University C.A.Maas.
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Self Organization of a Massive Document Collection
LYRIC-BASED ARTIST NETWORK METHODOLOGY Derek Gossi CS 765 Fall 2014.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Video Summarization Using Mutual Reinforcement Principle and Shot Arrangement Patterns Lu Shi Oct. 4, 2004.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
 Introduction  Algorithm  Framework  Future work  Demo.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Latent Dirichlet Allocation a generative model for text
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Automatic Collection “Recruiter” Shuang Song. Project Goal Given a collection, automatically suggest other items to add to the collection  Design a process.
HCC class lecture 14 comments John Canny 3/9/05. Administrivia.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
SVD(Singular Value Decomposition) and Its Applications
Presented By Wanchen Lu 2/25/2013
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
CSE554AlignmentSlide 1 CSE 554 Lecture 5: Alignment Fall 2011.
Newton's Method for Functions of Several Variables Joe Castle & Megan Grywalski.
Personalized Web Search by Mapping User Queries to Categories Fang Liu Presented by Jing Zhang CS491CXZ February 26, 2004.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ Text Categorization For Turkish News.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.
Recommender Systems Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Credits to Bing Liu (UIC) and Angshul Majumdar.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.
CHAPTER 8: Nonparametric Methods Alpaydin transparencies significantly modified, extended and changed by Ch. Eick Last updated: March 4, 2011.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
KNN & Naïve Bayes Hongning Wang
An Efficient Algorithm for Incremental Update of Concept space
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
A Straightforward Author Profiling Approach in MapReduce
Bisection and Twisted SVD on GPU
Singular Value Decomposition
Concept Decomposition for Large Sparse Text Data Using Clustering
Improving K-SVD Denoising by Post-Processing its Method-Noise
Feature Selection Methods
Presentation transcript:

Where Do You Go for Biomedical Funding? Yi Liu, Ahmet Altay

Background Problem o In biomedical research there are many sources of federal funding. o How to choose the right institution for funding for a given research idea? Data o Biomedical grant summaries from 20 institutions between the period 1972 and 2009

Pre-Processing Clean up texts from mark-up/meta words/duplicates Remove institutions with less than 5000 grant information Bag-of-words approach with a pre-determined dictionary o Removed 319 stop words from text o Used stemming (Porter) to further collapse text o Dictionary size of with distinct spellings Use mgrep to annotate our data with dictionary words

Histogram for Stems per Abstract

Processing Generate a TFIDF matrix given the dictionary and abstracts TFIDF matrix is huge (83435 by ) Reduce TFIDF matrix for computational efficieny o Remove zero dictionary counts and abstracts o Use SVD and represent use a smaller sub-space of original matrix o Singular values decrease quickly. We used first 100 eigen vectors without losing much precision.

Distribution of Singular Values

Effect of Using Eigen Sub-space Tested performance of smaller data set (400). Performance of raw TFIDF is similar to eigen sub-space.

Evaluation For a given test abstract we used kNN search to find 100 closest abstracts. Used a custom scoring algorithm to pick a grantor that best represents 100 nearest neighbors found: Tested entire data set using Leave-1-out cross-validation

Results (1)

Results (2)