Efficient Ranking of Keyword Queries Using P-trees Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND
Outline The Keyword Ranking problem The P-tree technology EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary
Introduction keyword Ranking is the process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents
Introduction PRANK proceeds by finding the kNN of a given query (viewed as a document) using the EIN-ring (later) After that, the systems returns a weighted list of matching documents (weighting scheme is discussed later)
Introduction Motivation Increase in the number of text documents Medical articles Research Publications E-mails News reports (e.g. Reuters) Others access to these information has become in great demand
Introduction (cont.) text has no explicit structure like other data (e.g. relational database) Vector Space Model have been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection The query is also represented as document vector in the given space
The P-tree technology Tree-like data structure that store numeric (and categorical) relational data in vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first
The P-tree technology (cont.) Construction of Basic Peano Count trees
The P-tree technology (cont.) Basic P-tree Operations
Predicates using P-trees There are five basic type of predicates, i.e., x > c, x c, x c, x<c, x=c, where c, is bound value. We come up with five very useful propositions for each of them using P-trees.
The Formula Propositions Proposition for predicate Px>c Briefly, for low bound c=(bm..bi..b1)2 Proposition for predicate Pxc Similarly, for upper bound c=(bm..bi..b1)2
The Examples Calculation of Px>c Px > (4)10 > (100)2 Crude Method Px > (4)10 > (100)2 = (P3P2) (P3 P2’ P1) Our formula: = P3(P2P1)
The Examples (cont.) Calculation of Px c Crude Method Pxj (70)10 = Pxj (01000110)2 = (P7’P6’) ( P7’P6P5’ P4’P3’P2’) ( P7’P6P5’ P4’P3’P2P1’) Our formula Pxj (70)10 = Pxj (01000110)2 =P7’ (P6’ (P5’ (P4’ ( P3’ ( P2’ P1’ P0’))))
The EIN-ring Approach Definition of Equal Interval Neighborhood Ring (EIN-ring)
The EIN-ring Approach (cont.) x X r+ r Px-r-<Xx+r+ P’x-r<Xx+r Px-r-<Xx+r+ ^ P’x- r<Xx+r
EIN-ring Formulation
The Examples Calculation of Pc1<x c2 Crude Method Our formula
Efficient Ranking Algorithm Simplified Prototype of the Data Model
Efficient Ranking Algorithm Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term Normalization is then applied to solve the problem document sizes term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document All representations are converted to P-trees
Efficient Ranking Algorithm Step1. Consider each weight in W1, W2 and W3 sections in Table 1 as a dimension of the search space X. We have a space of (k+2)-dimensions. Step2. Let Xstart be the point in X having the largest weight values, i.e., xstart-i = max(Wi) Step3. Calculate Pmin Pmax, where Pmin = PY<Xstart-i-ri, Pmax = PX<Xstart-i-ri-. Ranking Algorithm Using EIN-ring
Experiment Results We compared P-rank with scan based ranking approach on MeDLINE data from KDDCup 2002 task2 with three size groups, denoted as DB1, DB2, DB3, which contain 1,000, 8,000, and 15,000, respectively.
Architecture of P-RANK
Conclusion We describe the architecture, implementation, and evaluation of the P-RANK system for extracting evidences of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring Formulations.
Future direction Accuracy Gene Alignment and Matching
Thanks!