Beyond Keyword Search: Discovering Relevant Scientific Literature Khalid El-Arini and Carlos Guestrin August 22, 2011 TexPoint fonts used in EMF. Read.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Spread of Influence through a Social Network Adapted from :
Compiled by Helene van der Sandt. Is a search engine that searches for scholarly literature Can search across many disciplines Searches for articles,
Maximizing the Spread of Influence through a Social Network
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Infinite Horizon Problems
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Planning under Uncertainty
Efficient Informative Sensing using Multiple Robots
Good Research Questions. A paradigm consists of – a set of fundamental theoretical assumptions that the members of the scientific community accept as.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Near-optimal Nonmyopic Value of Information in Graphical Models Andreas Krause, Carlos Guestrin Computer Science Department Carnegie Mellon University.
Sensor placement applications Monitoring of spatial phenomena Temperature Precipitation... Active learning, Experiment design Precipitation data from Pacific.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Optimal Nonmyopic Value of Information in Graphical Models Efficient Algorithms and Theoretical Limits Andreas Krause, Carlos Guestrin Computer Science.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Sublinear time algorithms Ronitt Rubinfeld Blavatnik School of Computer Science Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual.
Link Analysis, PageRank and Search Engines on the Web
Web of Science: An Introduction Peggy Jobe
Nonmyopic Active Learning of Gaussian Processes An Exploration – Exploitation Approach Andreas Krause, Carlos Guestrin Carnegie Mellon University TexPoint.
1 Discovering Unexpected Information from Your Competitor’s Web Sites Bing Liu, Yiming Ma, Philip S. Yu Héctor A. Villa Martínez.
Information Retrieval in Practice
Metro Maps of Dafna Shahaf Carlos Guestrin Eric Horvitz.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Models of Influence in Online Social Networks
Machine Learning and Optimization For Traffic and Emergency Resource Management. Milos Hauskrecht Department of Computer Science University of Pittsburgh.
Trains of Thought: Generating Information Maps Dafna Shahaf, Carlos Guestrin and Eric Horvitz.
Search Engines and Information Retrieval Chapter 1.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Intro to Planning Or, how to represent the planning problem in logic.
1 CS 430: Information Discovery Lecture 5 Ranking.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Presented by: Omar Alqahtani Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Presented by: Siddhant Kulkarni Spring Authors: Publication:  ICDE 2015 Type:  Research Paper 2.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Inferring Networks of Diffusion and Influence
Near-optimal Observation Selection using Submodular Functions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Introduction to Information Retrieval
Variable Elimination Graphical Models – Carlos Guestrin
Analysis of Large Graphs: Overlapping Communities
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Beyond Keyword Search: Discovering Relevant Scientific Literature Khalid El-Arini and Carlos Guestrin August 22, 2011 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

2 “It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes.” - Denis Diderot, 1755 Today: 10 7 papers 10 5 publications Today: 10 7 papers 10 5 publications [Thomson Reuters Web of Knowledge]

Keyword search is dominant… 3 …but is it natural?

Specific research question Is there an approximation algorithm for the submodular covering problem that doesn’t require an integral-valued objective function? 4 Any recent papers influenced by this?

Literature review It’s 11:30pm Samoa Time. Your “Related Work” section is a bit sparse. 5 Here are some papers we’ve cited so far. Anything else? Here are some papers we’ve cited so far. Anything else?

Given a set of relevant query papers, what else should I read? 6

An example 7 query set seminal/background paper? a competing approach? Cited by all query papers Cites all query papers However, unlikely to find papers directly connected to entire query set. We need something more general… However, unlikely to find papers directly connected to entire query set. We need something more general…

Select a set of papers A with maximum influence to/from the query set Q 8

Modeling influence Ideas flow from cited papers to citing papers 9

Modeling influence Ideas flow from prior knowledge of the authors 10

Influence context 11 Why do I cite this paper? generative model of text variational inference EM … we call these concepts

Concept representation Words, phrases or important technical terms Proteins, genes, or other advanced features 12 Our assumption: Influence always occurs in the context of concepts Our assumption: Influence always occurs in the context of concepts

Influence by concept 13 plantstress (Grayed-out nodes don’t contain the given concept) Which shows more influence? Need to model the strength of each edge

Influence strength 14 common authors direct citation oxygen

Influence strength 15 (for normalization) oxygen

Influence strength 16 prevalence of “oxygen” oxygen Direct citations more indicative of influence than previous papers of the authors

Influence strength 17 prevalence of “oxygen” the weight between papers u and v w.r.t. concept c oxygen

Influence strength 18 plant prob. of influence between x and y with respect to concept c Influence exists if there is an active path between x and y (w.r.t. concept c)

Computing influence Definition is intuitive, but intractable to compute exactly #P-complete: the s-t network reliability problem Approximations 19 Sampling Sample complexity is provably logarithmic in size of corpus, but can still be slow in practice. Independence heuristic Fast, dynamic programming-based approach, but no explicit theoretical guarantees.

Independence heuristic In some cases, we can compute the influence probability exactly: 20 Exact because these two paths are independent plant Heuristic: Assume all paths from x to y are independent of each other. Given influence computed to parents, can compute influence to child in constant time. Need just one linear pass through graph in topological order. [In practice, closely matches results from sampling.] Heuristic: Assume all paths from x to y are independent of each other. Given influence computed to parents, can compute influence to child in constant time. Need just one linear pass through graph in topological order. [In practice, closely matches results from sampling.]

Select a set of papers A with maximum influence to/from the query set Q while maintaining: - relevance - diversity 21 Recall:

Influence + Relevance Influence should focus on relevant concepts: Prevalent in query documents Q Should be a main theme of some document in A 22

Influence + Relevance Influence should focus on relevant concepts: Prevalent in query documents Q Should be a main theme of some document in A 23

Influence + Relevance 24 Influence should focus on relevant concepts: Prevalent in query documents Q Should be a main theme of some document in A

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem 25

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem We take a probabilistic max cover approach 26 query papers

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem We take a probabilistic max cover approach 27 query papers plantoxygenstressplantoxygenstressplantoxygenstressconcepts

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem We take a probabilistic max cover approach 28 query papers plantoxygenstressplantoxygenstressplantoxygenstressconcepts candidate papers

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem We take a probabilistic max cover approach 29 query papers plantoxygenstressplantoxygenstressplantoxygenstressconcepts candidate papers influence

Influence + Diversity Why diversity? Uncertainty about user’s information need Different approaches/facets to same research problem We take a probabilistic max cover approach 30 query papers plantoxygenstressplantoxygenstressplantoxygenstressconcepts candidate papers influence

Set influence 31

Set influence 32

Set influence 33 probability there exists influence and relevance

Set influence 34 probability none of documents exhibit influence or relevance

Set influence 35 probability at least one document exhibits influence or relevance

Putting it all together Can now write objective function exactly describing what we want: 36 max how do we solve this optimization?

Optimization Our objective is submodular an intuitive diminishing returns property 37 Using simple greedy algorithm, can maximize objective efficiently and near-optimally

Optimization Our objective is submodular 38 S B A S + + Large improvement Small improvement For AµB, s  B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) [slide credit: Andreas Krause] Using simple greedy algorithm, can maximize objective efficiently and near-optimally

Recap 39 query set max result set

But should all users get the same results? 40

Personalized trust Different communities trust different researchers for a given concept Goal: Estimate personalized trust from limited user input 41 e.g., network KleinbergHintonPearl

Specifying trust preferences Specifying trust should not be an onerous task Assume given (nonexhaustive!) set of trusted papers B, e.g., a BibTeX file of all the researcher’s previous citations a short list of favorite conferences and journals someone else’s citation history! 42 a committee member? journal editor? someone in another field? a Turing Award winner?

Given trusted set B, how much do I trust author a with respect to concept c? 43

Computing trust 44 How much do I trust Jon Kleinberg with respect to the concept “network”? B Kleinberg’s papers An author is trusted if he/she influences the user’s trusted set B

Personalized Objective 45

Personalized Objective 46 Does user trust at least one of authors of d with respect to concept c?

47 networks graphics data mining

User Study Evaluation 16 PhD students in machine learning For each participant: Select a recent paper for which we wish to find related work (the study paper) Compare our algorithm and three state-of-the-art alternatives: Relational Topic Model Information Genealogy Google Scholar Show papers one at a time (double-blind), asking questions: 48 Would this paper have been useful to you when writing the study paper? e.g.,

Usefulness 49 our approach higher is better Our approach provides more useful and more must-read papers Our approach provides more useful and more must-read papers

Trust 50 our approach higher is better Our approach provides more trustworthy papers…

Novelty 51 our approach …but at the expense of some novelty.

Diversity 52 Our approach produces more diverse results.

Summary Often difficult to phrase information needs as keyword queries Define query as small set of related papers Efficiently optimize submodular objective function based on intuitive notion of influence to select highly relevant articles Incorporate trust preferences to produce personalized results Participants in user study found our method to be more useful, trustworthy and diverse than other popular alternatives. 53 live site coming soon!