Efficient Ranking of Keyword Queries Using P-trees Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND
Outline The Document Keyword Ranking problem The Predicate-tree (P-tree) technology EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary
Introduction keyword Ranking is process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents
PRANK proceeds by finding the kNN of the query (viewed as a document) using EIN-rings. Then the systems returns a weighted list of matching documents (weighting scheme is discussed later). Motivation Massive increases in the number of text documents Medical articles Research Publications E-mails News reports (e.g. Reuters) Others access to info in these libraries is in great demand
Text has very little of the explicit structure typically found in other data (e.g. a relational database) Vector Space Model has been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection Exclude stop list words with very high support, e.g., the, a, … expected terms e.g., blood in a corpus of articles addressing blood infections. Typically “term” means “stems” plus “important phrases”. Some use n-grams (raw sequences of characters) “case folding” is typically done (converting all characters to a common case. The query is also represented as document vector in the space.
The P-tree1 technology A P-tree is a tree data structure that store numeric (and categorical) relational data in a vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first Numeric attributes: by bit positions Categorical attributes: by a bit map for each category 1P-tree technology is patent pending at North Dakota State University
But it is pure (pure0) so this branch ends Predicate tree technology: vertically project each attribute, Current practice in data mining Data is structured into horizontal records. then vertically project each bit position of each attribute, then compress each bit slice, using a predicate, into a basic P-tree. e.g., compress R11 into P11 (using the universal predicate, pure1) Then Process vertically (vertical scans) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Horizontally structured records are scanned vertically R11 1 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 The 1-Dimension P-tree version of R11, P11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false 0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 0 0 0 1 01 10 1 0 0 1 0 1 0 0 10 01 ^ 2. Left half pure1? false 0 3. Right half pure1? false 0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =2 01 21-level ^ 7 0 1 4 But it is pure (pure0) so this branch ends
2-Dimensional Pure1 P-trees (AKA Peano-trees (use the Peano space filling curve concept) Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of a band from a 2-D image) 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order, looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1
Logical Operations on Ptrees (are used to get counts of any pattern) Ptree 1 Ptree 2 AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).
3-Dimensional Pure1 P-trees (are a naturaly choice (produce better compession?) if the data is naturally 3-D (e.g., solids)
Other useful Predicates-trees There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Proposition for predicate Px>c c=(bm..bi..b1)2 There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Where Pi is the ith basic P-tree and i =AND if bi = 1 else 0 For predicate tree, Pxc where i =AND if bi=0 else OR
e.g., Calculation of Px>c Crude Method: Px>(4)10(100)2 = ( P3P2) (P3P2’P1 ) Our formula: Px>(4)10(100)2 = P3(P2P1) e.g., Calculation of Px c Crude: Px(70)10=Px(01000110)2 = (P7’P6’)(P7’P6P5’P4’P3’P2’)(P7’P6P5’P4’P3’P2P1’) Ours: Px(70)10=Px(01000110)2 = P7’(P6’(P5’(P4’(P3’(P2’P1’P0’))))
Equal Interval Neighborhood (EIN-ring) Approach x X P’x-r<Xx+r X r Px-r-<Xx+r+ ^ P’x- r<Xx+r x x X r+ Px-r-<Xx+r+
Efficient Ranking Algorithm Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document Normalized to document size: term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term (reciprocal of the support) All representations are converted to P-trees
Efficient Ranking Algorithm (e.g., experiment_ section = high; educ_section = low) (Not used here) =document weight (e.g., CATA=hi, Science=low) (not used in this analysis) Relative Term Frequencies Term Wt (IDF*TF) TF1 TFk .025 .09 .034 .012 .003 .03 .22 .02 .01 The data is converted to Ptrees and then ranked using P-kNN. P-kNN ranks according to similarity rings eminating from the query (as a document), until at least k are found. Each ring is a simple EIN-ring calculation.
Experiment Results We compared P-rank with scan-based ranking on MeDLINE data (ACM KDDCup 2002 task2) Three size groups, denoted DB1, DB2, DB3, containing 1,000, 8,000, 15,000, resp.
Architecture of P-RANK
Conclusion and future directions We describe the P-RANK method for extracting evidences, for example, of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring formulations.