William W. Cohen with Ni Lao (Google), Ramnath Balasubramanyan, Dana Moshovitz-Attias School of Computer Science, Carnegie Mellon University, Reasoning.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Overview of this week Debugging tips for ML algorithms
Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Fast Query Execution for Retrieval Models based on Path Constrained Random Walks Ni Lao, William W. Cohen Carnegie Mellon University
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Learning to Extract a Broad-Coverage Knowledge Base from the Web William W. Cohen Carnegie Mellon University Machine Learning Dept and Language Technology.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
William W. Cohen Machine Learning Dept and Language Technology Dept.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Learning Relationships Defined by Linear Combinations of Constrained Random Walks William W. Cohen Machine Learning Department and Language Technologies.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Random Walks and Semi-Supervised Learning Longin Jan Latecki Based on : Xiaojin Zhu. Semi-Supervised Learning with Graphs. PhD thesis. CMU-LTI ,
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Predictively Modeling Social Text William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Ramnath Balasubramanyan, William W. Cohen Language Technologies Institute and Machine Learning Department, School of Computer Science, Carnegie Mellon.
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Algorithmic Detection of Semantic Similarity WWW 2005.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS 440 Database Management Systems Web Data Management 1.
Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Recommendation in Scholarly Big Data
Neighborhood - based Tag Prediction
Search Engines and Link Analysis on the Web
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Learning Literature Search Models from Citation Behavior
Topic models for corpora and for graphs
Learning to Rank Typed Graph Walks: Local and Global Approaches
Presentation transcript:

William W. Cohen with Ni Lao (Google), Ramnath Balasubramanyan, Dana Moshovitz-Attias School of Computer Science, Carnegie Mellon University, Reasoning With Data Extracted From the Biomedical Literature John Woolford, Jelena Jakovljevic Biology Dept, Carnegie Mellon University

Outline The scientific literature as something scientists interact with: –recommending papers (to read, cite, …) –recommending new entities (genes, algorithms, …) of interest The scientific literature as a source of data –extracting entities, relations, …. (e.g., protein-protein interactions) The scientific literature as a tool for interpreting data –and vice versa

Part 1. Recommendations for Scientists

A Graph View of the Literature Data used in this study –Yeast: 0.2M nodes, 5.5M links –Fly: 0.8M nodes, 3.5M links –E.g. the fly graph

Defining Similarity on Graphs: PPR/RWR Given type t* and node x, find y:T(y)=t* and y~x. Similarity defined by “damped” version of PageRank Similarity between nodes x and y: –“Random surfer model”: from a node z, with probability α, teleport back to x (“restart”) Else pick a y uniformly from { y’ : z  y’ } repeat from node y.... –Similarity x~y = Pr( surfer is at y | restart is always to x ) Intuitively, x~y is sum of weight of all paths from x to y, where weight of path decreases with length (and also fanout) Can easily extend to a “query” set X={x 1,…,x k } Disadvantages: [more later]

Learning How to Perform BioLiterature Retrieval Tasks Tasks: –Gene recommendation:author, year  gene studied –Citation recommendation: words,year  paper cited/read –Expert-finding:words, genes  (possible) author –Literature-recommendation: author, [papers read in past] Baseline method: –Typed RWR proximity methods Baseline learning method: –parameterize Prob(walk edge|edge label=L) and tune the parameters for each label L (somehow…) P(write)=b P(L=cite) = a P(NE) = c P(bindTo) = d P(express) = d

Similarity Queries on Graphs 1) Given type t* and node x in G, find y:T(y)=t* and y~x. 2) Given type t* and node set X, find y:T(y)=t* and y~X. Evaluation: specific families of tasks for scientific publications: –“Entity recommendation”: (given title, author, year, … predict entities mentioned in a paper, e.g. gene-protein entities) – can improve NER –Citation recommendation for a paper: (given title, year, …, of paper p, what papers should be cited by p?) –Expert-finding: (given keywords, genes, … suggest a possible author) –Literature recommendation: given researcher and year, suggest papers to read that year Why is RWR/PPR the right similarity metric? –it’s not – we should use learning to refine it

Learning Similarity Queries on Graphs Evaluation: specific families of tasks for scientific publications: –Citation recommendation for a paper: (given title, year, …, of paper p, what papers should be cited by p?) –Expert-finding: (given keywords, genes, … suggest a possible author) –“Entity recommendation”: (given title, author, year, … predict entities mentioned in a paper, e.g. gene-protein entities) –Literature recommendation: given researcher and year, suggest papers to read that year For each task: query 1, ans 1 query 2, ans 2 …. LEARNER Sim(s,p) = mapping from query  ans variant of RWRmay use RWR

Learning Proximity Measures for BioLiterature Retrieval Tasks Tasks: –Gene recommendation:author, year  gene –Reference recommendation:words,year  paper –Expert-finding:words, genes  author –Literature-recommendation: author, [papers read in past] Baseline method: –Typed RWR proximity methods Baseline learning method: –parameterize Prob(walk edge|edge label=L) and tune the parameters for each label L (somehow…) P(write)=b P(L=cite) = a P(NE) = c P(bindTo) = d P(express) = d

Path-based vs Edge-label based learning Learning one-parameter-per-edge label is limited because the context in which an edge label appears is ignored –E.g. (observed from real data – task, find papers to read) Instead, we will learn path-specific parameters PathComments Don't read about genes I’ve already read about Do read papers from my favorite authors Paths will be interpreted as constrained random walks that give a similarity-like weight to every reachable node Step 0: D 0 = {a} Start at author a Step 1: D 1 : Uniform over all papers p read by a Step 2: D 2 : Author a’ of papers in D 1 weighted by number of papers in D1 published by a’ Step 3: D 3 Papers p’ written by a’ weighted by.... … author –[ read ]  paper –[ contain ]  gene-[ contain- 1 ]  paper author –[ read ]  paper –[ write- 1 ]  author-[ write ]  paper

A Limitation of RWR Learning Methods Learning one-parameter-per-edge label is limited because the context in which an edge label appears is ignored –E.g. (observed from real data – task, find papers to read) Instead, we will learn path-specific parameters PathComments Don't read about genes I’ve already read about Do read papers from my favorite authors PathComments Do read about the genes I’m working on Don't read papers from my own lab author –[ read ]  paper –[ contain ]  gene-[ contain- 1 ]  paper author –[ read ]  paper –[ write- 1 ]  author-[ write ]  paper author –[ write ]  paper –[ contain ]  gene-[ contain- 1 ]  paper author –[ write ]  paper –[ publish -1 ]  institute-[ publlish ]  paper

12 Definitions An graph G=(T,R,X,E), is –a set of entity types T={T} and a set of relations R={R} –a set of entities (nodes) X={x}, where each node x has a type from T –a set of edges e=(x,y), where each edge has a relation label from R A path P=(R 1, …,R n ) is a sequence of relations Path Constrained Random Walk –Given a query set S of “source” nodes –Distribution D 0 at time 0 is uniform over s in S –Distribution D t at time t>0 is formed by Pick x from D t-1 Pick y uniformly from all things related to x –by an edge labeled R t –Notation: f P (s,t) = Prob(s  t; P) –In our examples type of t will be determined by R n

Path Ranking Algorithm (PRA) A PRA model scores a source-target node pair by a linear function of their path features where P is a path (sequence of link types/relation names) with length ≤ L For a relation R and a set of node pairs {(s i, t i )}, we construct a training dataset D ={(x i, y i )}, where x i is a vector of all the path features for (s i, t i ), and y i indicates whether R(s i, t i ) is true or not θ is estimated using L1,L2-regularized logistic regression [Lao & Cohen, ECML 2010]

14 Experimental Setup for BioLiterature Data sources for bio-informatics –PubMed on-line archive of over 18 million biological abstracts –PubMed Central (PMC) full-text copies of over 1 million of these papers –Saccharomyces Genome Database (SGD) a database for yeast –Flymine a database for fruit flies Tasks –Gene recommendation:author, year  gene –Venue recommendation:genes, title words  journal –Reference recommendation:title words,year  paper –Expert-finding:title words, genes  author Data split –2000 training, 2000 tuning, 2000 test Time variant graph –each edge is tagged with a time stamp (year) –only consider edges that are earlier than the query, during random walk

BioLiterature: Some Results Compare the mean average precision (MAP) of PRA to –RWR model –RWR trained with one-parameter per link Except these †, all improvements are statistically significant at p<0.05 using paired t-test

Example Path Features and their Weights A PRA+qip+pop model trained for the citation recommendation task on the yeast data 6) approx. standard IR retrieval 1) papers co-cited with on-topic papers 7,8) papers cited during the past two years 12,13) papers published during the past two years

17 Extension 1: Query Independent Paths PageRank (and other query-independent rankings): –assign an importance score (query independent) to each web page –later combined with relevance score (query dependent) We generalize pagerank to heterogeneous graphs: –We include to each query a special entity e 0 of special type T 0 –T 0 is related to all other entity types, and each type is related to all instances of that type –This defines a set of PageRank-like query independent relation paths –Compute f(*  t;P) offline for efficiency Example well cited papers productive authors all papers all authors

Extension 2: Entity-specific rankings There are entity-specific characteristics which cannot be captured by a general model –Some items are interesting to the users because of features not captured in the data –To model this, assume the identity of the entity matters –Introduce new features f(s  t; P s,t ) to account for jumping from s to t and new features f(*  t; P *,t ) –At each gradient step, add a few new features of this sort with highest gradient, count on regularization to avoid overfitting

BioLiterature: Some Results Compare the MAP of PRA to –RWR model –query independent paths (qip) –popular entity biases (pop) Except these †, all improvements are statistically significant at p<0.05 using paired t-test

Example Path Features and their Weights A PRA+qip+pop model trained for the citation recommendation task on the yeast data 9) well cited papers 10,11) key early papers about specific genes 14) old papers

Outline The scientific literature as something scientists interact with: –recommending papers (to read, cite, …) –recommending new entities (genes, algorithms, …) of interest The scientific literature as a source of data –extracting entities, relations, …. (e.g., protein-protein interactions) The scientific literature as a tool for interpreting data –and vice versa

Part 2. Extraction from the Scientific Literature: BioNELL Builds on NELL (Never Ending Language Learner), a web-based information extraction system: –a semi-supervised, coupled, multi-view system that learns concepts and relations from a fixed ontology

Examples of what NELL knows

Semi-Supervised (Bootstrapped) Learning Paris Pittsburgh Seattle Cupertino mayor of arg1 live in arg1 San Francisco Austin denial arg1 is home of traits such as arg1 it’s underconstrained!! anxiety selfishness Berlin Extract cities: Given: four seed examples of the class “city”

NP1NP2 Krzyzewski coaches the Blue Devils. athlete team coachesTeam(c,t) person coach sport playsForTeam(a,t) NP Krzyzewski coaches the Blue Devils. coach(NP) hard (underconstrained) semi-supervised learning problem much easier (more constrained) semi-supervised learning problem teamPlaysSport(t,s) playsSport(a,s) One Key to Accurate Semi-Supervised Learning 1.Easier to learn many interrelated tasks than one isolated task 2.Also easier to learn using many different types of information

SEAL: Set Expander for Any Language … … … … … ford, toyota, nissan honda Seeds Extractions *Richard C. Wang and William W. Cohen: Language-Independent Set Expansion of Named Entities using the Web. In Proceedings of IEEE International Conference on Data Mining (ICDM 2007), Omaha, NE, USA Another key: use lists and tables as well as text Single-page Patterns

Ontology and populated KB the Web CPL text extraction patterns SEAL HTML extraction patterns evidence integration RL learned inference rules Morph Morphology based extractor NELL

Ontology and populated KB the Web CPL text extraction patterns SEAL HTML extraction patterns evidence integration++ RL learned inference rules Morph Morphology based extractor bioText corpus BioNELL

Part 2. Extraction from the Scientific Literature: BioNELL BioNELL vs NELL: –automatically constructed ontology GO, ChemBio, …. plus small number of facts about mutual exclusion –automatically chosen seeds –conservative bootstrapping only use some learned facts in bootstrapping (based on PMI with concept name)

Part 2. Extraction from the Scientific Literature: BioNELL

Summary of BioNELL Advantages over traditional IE for BioText –Exploits existing ontologies –Scaling up vs “scaling out”: coupled semi-supervised learning is easier than uncoupled SSL –Trivial to introduce a new concept/relation (just add to ontology and give seed instances) Easy to customize BioNELL for a task Disadvantages –Evaluation is difficult –Limited recall Still early work in many ways

Outline The scientific literature as something scientists interact with: –recommending papers (to read, cite, …) –recommending new entities (genes, algorithms, …) of interest The scientific literature as a source of data –extracting entities, relations, …. (e.g., protein-protein interactions) The scientific literature as a tool for interpreting data –and vice versa

Part 3. Interpreting Data With Literature

Case Study: Protein-protein interactions in yeast Using known interactions between 844 proteins, curated by Munich Info Center for Protein Sequences (MIPS). Studied by Airoldi et al in 2008 JMLR paper (on mixed membership stochastic block models) Index of protein 1 Index of protein 2 p1, p2 do interact (sorted after clustering)

Case Study: Protein-protein interactions in yeast Using known interactions between 844 proteins from MIPS. … and 16k paper abstracts from SGD, annotated with the proteins that the papers refer to (all papers about these 844 proteins). Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21,… Protein annotations English text

Question: Is there information about protein interactions in the text? MIPS interactions Thresholded text co-occurrence counts

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text LinkLDA

Question: How to model this? Vac1p coordinates Rab and phosphatidylinositol 3-kinase signaling in Vps45p-dependent vesicle docking/fusion at the endosome. The vacuolar protein sorting (VPS) pathway of Saccharomyces cerevisiae mediates transport of vacuolar protein precursors from the late Golgi to the lysosome-like vacuole. Sorting of some vacuolar proteins occurs via a prevacuolar endosomal compartment and mutations in a subset of VPS genes (the class D VPS genes) interfere with the Golgi-to-endosome transport step. Several of the encoded proteins, including Pep12p/Vps6p (an endosomal target (t) SNARE) and Vps45p (a Sec1p homologue), bind each other directly [1]. Another of these proteins, Vac1p/Pep7p/Vps19p, associates with Pep12p and binds phosphatidylinositol 3- phosphate (PI(3)P), the product of the Vps34 phosphatidylinositol 3-kinase (PI 3-kinase) EP7, VPS45, VPS34, PEP12, VPS21 Protein annotations English text LinkLDA z word  M  N  z prot  L

Question: How to model this? Index of protein 1 Index of protein 2 p1, p2 do interact MMSBM of Airoldi et al 1.Draw K 2 Bernoulli distributions 2.Draw a θ i for each protein 3. For each entry i,j, in matrix a)Draw z i* from θ i b)Draw z *j from θ j c)Draw m ij from a Bernoulli associated with the pair of z’s.

Question: How to model this? Index of protein 1 Index of protein 2 p1, p2 do interact Sparse block model of Parkinnen et al, 2007 These define the “blocks” 1.Draw topics over proteins β 2.For each row in the link relation: a)Draw (z L*, z *R ) from  b)Draw a protein i from left multinomial associated with pair c)Draw a protein j from right multinomial associated with pair d) Add i,j to the link relation

Gibbs sampler for sparse block model Sampling the class pair for a link probability of class pair in the link corpus probability of the two entities in their respective classes

BlockLDA: jointly modeling blocks and text Entity distributions shared between “blocks” and “topics”

Sample topics

Recovering the interaction matrix MIPS interactionsSparse Block modelBlock-LDA

Varying The Amount of Training Data

Another Performance Test Goal: predict “functional categories” of proteins –15 categories at top-level (e.g., metabolism, cellular communication, cell fate, …) –Proteins have 2.1 categories on average –Method for predicting categories: Run with 15 topics Using held-out labeled data, associate topics with closest category If category has n true members, pick top n proteins by probability of membership in associated topic. –Metric: F1, Precision, Recall

Performance: prediction functional categories of yeast

Varying The Amount of Training Data

Sample topics – do they explain the blocks?

Another test: vetting interaction predictions and/or topics Procedure: –hand-labeling by one expert (so far) –double-blind text only MIPS interactions smaller set of pull-downs done in Woolford’s wet-lab –Y/N: is topic a meaningful category? –Y/N: if so, how many of the top 10 paper (proteins) in that category?

Another test: vetting interaction predictions and/or topics Articles

Another test: vetting interaction predictions and/or topics Proteins

Summary Big question: –can using text lead to more accurate models of data? –can you do this systematically for many modeling tasks? –can the literature give us a lens for interpreting the results of statistical modeling? Advantages: –Huge potential payoff But –Hard to evaluate! Still early work in many ways

Conclusions/summary The scientific literature as something scientists interact with: –recommending papers (to read, cite, …) –recommending new entities (genes, algorithms, …) of interest The scientific literature as a source of data –extracting entities, relations, …. (e.g., protein-protein interactions): GOFIE The scientific literature as a tool for interpreting data –and vice versa –… all we’ve evaluated to date Past usage of literature is data – so this is possibly the most general setting

Thanks to… Ni, Ramnath, Dana and others… NIH, NSF, Google AAAI Fall Symposium organizers you all for listening!