Download presentation
Presentation is loading. Please wait.
Published byAmi Simmons Modified over 9 years ago
1
The Path Ranking Algorithms for Relational Retrieval Problems Presented by Ni Lao 2009.9.3
2
Outline Problem definition Related works and our contribution Random walk algorithms with path parameterization Experiment
3
IR Trend Data in current information retrieval tasks becomes increasingly diverse in entity types and relation types –relational databases (Balmin et al., 2004), –citation networks (e.g., CiteSeer, DBLP); –movie database (e.g., IMDB), –music database (Konstas et al., 2009); –homeland security, (Lin & Chalupsky, 2008); –structured retrieval of annotated text (Bilotti et al. 2007); –Personal Information Management (PIM, Minkov & Cohen, 2007)
4
Formal Definition These relational structured data can be represented by an Entity-Relation (ER) graph –a set of entities types T={T} –a set of entities E={e}. Each entity is typed with e.T T The instantiation of type T is I(T)={e| e.T =T}. –a set of typed and ordered relations R={R}. Each is a pair of entity types R.T1, R.T2. R(e1,e2)=1/0 to denote e1,e2 having relation R or not R(e,)={e’| R(e,e')=1} to denote the set of entities that have relation R with e. –an Entity-Relation graph G=(T,E,R).
5
Formal Definition Generally we define the Relational Retrieval (RR) problem as –given a set of query entities E q ={e'} –predicte the relevance of each entity e of the target entity type T q – call q=(E q,T q ) a query
6
Related Works Keyword search in relational databases –Answer is defined as trees connecting all query entities and with target entity as root BANKS (Bhalotia et al., 2002; Bhavana et al., 2008), DBXplorer (Agrawal et al., 2002), Discover (Hristidis & Papakonstantinou, 2002), BLINKS (He et al., 2007) Ad-hoc retrieval style task definition –Entities are ranked by the closeness to the query words –Closeness defined by random walk on graph Pagerank (Brin & Page, 1998), Topic-sensitive Pagerank (Haveliwala, 2002) Personalized Pagerank (Jeh &. Widom, 2003) ObjectRank (Balmin et al., 2004), Personal information management (Minkov & Cohen, 2007) Gene detection (Arnold & Cohen, 2009)
7
Related Works Improving random walk models in supervised fashion –quadratic programming (Tsoi et al., 2003), –simulated annealing (Nie et al., 2005), –back-propagation (Diligenti et al., 2005; Minkov & Cohen, 2007), –limit memory Newton method (Agarwal et al., 2006) Limitations Expressive power of model –Actual paths (as opposed to individual relations used during random walk) can be very indicative (Minkov & Cohen, 2007) Lack of training data –18 testing queries by Richardson& Domingos (2001); –4 testing queries by Balmin et al. (2004); –10 training and testing queries by Chakrabarti and Agarwal (2006); –<30 training on various tasks by Minkov et al. (2006) –learn from page order generated from artificially manipulated models by Tsoi et al. (2003) and Agarwal et al. (2006).
8
Path Matters Example
9
This Work Path Ranking Algorithm (PRA) –Modify random walk model to path parameterization –Modify PageRank to path parameterization –Modify teleport learning to path parameterization –Demonstrate the importance of L1 and L2 regularization –Provide the first large scale evaluation Several realistic tasks, each having thousands of training and testing queries
10
PRA: Single Entity Queries Given a ER graph G=(T,E,R) and a query q=(E q,T q ) –A type path P=(T 1, …,T n ) is a sequence of entity types, with constraint that (T i,T i+1 ) R. –Let P(q, l) be the set of type paths that start with T, end with Tq, and have length ≤l. –For each type path P=(T 1, …,T n ) in P(q, l), we define a series of distributions h i (e), e.T=T i
11
PRA: Single Entity Queries All the type paths can be summarized as a prefix tree, with each node corresponds to a distribution h i (e) over the entities A PRA model (G, l, θ) ranks I(T q ) by the scoring function in matrix form s=Aθ, where s is a (sparse) column vector of scores of each entity, θ is a column vector of weights for the type paths, each column of A is the distribution h P (e) of a path P
12
Parameter Learning Given a set of training data D={(q (m), A (m), y (m) )} where m=1…M, y (m) (e)=1/0, we define a regularized objective function
13
Parameter Learning Given a set of training data D={(q (m), A (m), y (m) )} where m=1…M, y (m) (e)=1/0, we define a regularized objective function o(m)(θ) can be in various forms. Like log-loss (logistic regression), negative hinge loss (SVM), negative exponential loss (boosting), and etc Here we use log-loss which is easy to optimize and does not penalize too harshly to outlier samples as exponential loss Use orthant-wise L-BFGS (Andrew & Gao, 2007) to tune θ
14
Parameter Learning Its gradient Let P (m) be the index set or relevant entities, and N (m) the index set of irrelevant entities (how to choose them will be discussed later) We use the average log-likelihood of positive and negative entities as the objective o m (θ)
15
Parameter Learning Its gradient For a retrieval system we may prefer to optimize pair-wise margins –Predict for each pair of entity if one should be ranked higher than the other (e i e j )
16
EntityRank: Using Query Independent Paths PageRank assign an importance score (independent of the query term) to each web page, and this importance score is later combined with relevance score (query term dependent) We include to each query a special entity e 0 of special type T 0 which has relation to each entity type in the system, and e0 has linked to each entity in the entity relation graph. T 0 therefore introduces a set of query independent type paths, which can be calculate offline
17
Modeling Hidden Factors Entity specific information not captured by the model –E.g. Some document with lower rank to a query may be interesting to the users because of features not captured in the data –E.g. Different users may have completely different information needs and goals under the same query –The identity of entity matters
18
Modeling Hidden Factors Hidden Factors –we introduce a set of hidden factors to each entity, one for each path starting from the entity and leading to the target entity type papers—(cite)→papers—(written by)→authors papers—(written by)→authors authors authors—(write)→papers—(written by)→authors –Distribution matrix A is augmented to [A A hf ] where each column of A hf is a distribution of certain hidden path. –Similarly θ is augmented to [θ; θ hf ] –Suppose there are 10,000 authors and 10,000 papers in the graph. Then the model would include 40,000 parameters
19
Modeling Hidden Factors However hidden factors are –Large: potentially |E|^2 spaces –Redundant: many hidden paths are pointing to the same target entities Simplified model: Instantiated Relations –For a task with query type Q q, ard target type T q, –Define a set of special relations from special entity e 1 to each target entity e in T q, and a set of special relations from each query entity e’ in Q q to each target entity e in T q. –Each such relation has its own weights
20
Modeling Hidden Factors Efficiency about optimizing too many parameters? Only adds to the model important relations –Measured by |O(θ)/θ R | –Add at most top b (batch size) relations at each LBFGS iteration
21
Efficiency Considerations Path Blocking – Forbid random walk to take follow relation after its reversed relation (e.g. write after write -1 ) Maintaining Distribution Sparsity –For time/memory efficiency, sampling has been used to get approximated but sparse estimation of distribution –Here we use a truncation strategy: At each random walks step h i (e)= max(0, hi(e)-γE[h i (e)]) E[h i ] is the average of h i (e) on entities of non-zero values Few positive entities vs. thousands (or millions) of negative entities? –First sort all the negative entities with the initial model (all feature weights are uniformly set to 1.0), then –Square sampling take negative entities at the k(k+1)/2-th position, –Cubic sampling with k(k+1)(k+2)/6-th position –Exponential sampling with 2 k -1-th position, k=0,1,2,3,... –Top-k sampling (taking top-k negative entities)
22
Experiment: Data Data sources –PubMed on-line archive of over 18 million biological abstracts –PubMed Central (PMC) full-text copies of over 1 million of these papers –Saccharomyces Genome Database (SGD) a database information concerning the yeast The nodes of our network are: –48,641 papers contained in SGD. –69,161 authors in SGD paper. –5,816 genes of yeast, mentioned in SGD. –58 years, from 1950 through 2008 –1,126 journals –39,827 unique title terms, after applying a stop word list of size 429.
23
Experiment: Data The edges of our network are –376,010 Citation relations among papers.. –1,604 RelatesTo relations from genes to other genes –178,233 Authorship relations from authors to the papers Further distinguished as: any author, first author, and last author. –160,621 Mention relations from papers to the genes they discuss. further distinguished into 49 categories in the SGD database like “Evolution”, “Function/Process”, “Mutants/Phenotypes”, et al. –HasTitleTerm, InJorunal, InYear relations for each paper. –Before relations from each year to its next year
24
Experiment: Task The Paper Completion Tasks (PCT) –Treat a paper as a big form with fields, the task is to predict one field based on some other fields given. We have 16k query-judgment pairs for each task –Y-J: Eq=year, Tq={journal}, suggest hot journal of a year. –YGW-J: Eq=year U genes U words, Tq={journal}, suggest journal to publish a research work. –YGW-P: Eq=year U genes U words, Tq={citation}, help literature review –YA-G/YA-W: Eq=year U authors, Tq={gene/title}, suggest topics a researcher might currently be interested in. Time variant graph –each edge is tagged with a time stamp (year in this case) –only consider edges that are earlier than the query, when doing random walk
25
Experiment: Main Result Compare retrieval qualities by MAP Y-JYGW-JYGW-PYA-GYA-W RRAb 0.2520.4040.1270.1440.201 RRA 0.2510.4620.1510.1440.199 PRAb 0.2520.4040.1260.1430.201 PRA 0.2470.4470.1630.1490.200 PRAeb 0.2060.2710.1540.1350.184 PRAe 0.3390.4640.1620.1510.202 PRA-ir30 0.3860.4870.1670.1410.16 PRA-hf5 0.3750.4790.1550.1510.179 PRA-P 0.2460.4560.1660.1480.194
26
Experiment: Main Result Compare scalabilities by training time (s).Y-JYGW-JYGW-PYA-GYA-W RRAb 00000 RRA 32121 3968 PRAb 00000 PRA 381332,02419069 PRAeb 0040800 PRAe 301464,286274120 PRA-ir30 7525310,512393790 PRA-hf5 1262,660749 * 96 * 328 * PRA-P 301179520234120 Max path length of hidden factors is set to 2, and set to 1 in the tasks with *
27
Experiment: threshold γ Effect of threshold γ on RRAb: not very sensitive
28
Experiment: L2 Usually there is a bump YGW-J YGW-P
29
Experiment: L1 Usually there is a bump YGW-J, λ 1 =0 YGW-P, λ 1 =0
30
Experiment: L1 Can eliminate paths without reducing MAP YGW-P, λ 2 =0.003
31
Peek into the Model Weights Y-J, PRA3-i10 weightPath 26.5y(_Be)y(_Ye)p(Jo)j 0.89>Proc_Natl_Acad_Sci_U_S_A 0.87>Mol_Cell_Biol 0.63>EMBO_J 0.59>Genetics … -2.26>Cell -2.65>Nature -2.81>Mol_Microbiol -3.14>Glycobiology -3.15>Nat_Genet -3.2>Eur_J_Biochem -3.21>FEBS_Lett -3.21>Gene -3.22>Mol_Gen_Genet -3.23>J_Biol_Chem -3.25>Yeast
32
Peek into the Model Weights YGW-J, PRAe3-i5 weightPath 24.06w(_Ti)p(_Ci)p(Jo)j 19.98w(_Ti)p(Jo)j 13.90T(pa)p(_Ci)p(Jo)j 9.68g(_Ge)p(_Ci)p(Jo)j 3.11w(_Ti)p(Ci)p(Jo)j …… -2.26>Nature -2.63>Biochim_Biophys_Acta -2.68>J_Biol_Chem -2.73>Science -2.79>Biochem_Biophys_Res_Commun -3.01>Cell -3.05>FEBS_Lett -3.81>Eur_J_Biochem -3.84>Gene -4.17>Mol_Gen_Genet -4.23>Yeast
33
Peek into the Model Weights YGW-P, PRAe5 ID weightPath 1132.1w(_Ti)p(Ci)p 234.18w(_Ti)p 313.49w(_Ti)p(_Ci)p 49.795g(_Ge)p(Ci)p 52.9g(_Ge)p … 6 -0.539g(_Ge)p(FA)a(_LA)p(Au)a(_LA)p 7 -0.869g(_Ge)p(FA)a(_LA)p 8 -1.282g(_Ge)p(FA)a(_LA)p(FA)a(_LA)p 9 -3.9g(_Ge)p(LA)a(_FA)p 10 -6.589T(ye)y(_Be)y(_Be)y(_Ye)p 11 -6.591T(ye)y(_Be)y(_Ye)p 12 -6.594T(ye)y(_Ye)p
34
Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.