Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees
Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND

Outline The Document Keyword Ranking problem
The Predicate-tree (P-tree) technology EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary

Introduction keyword Ranking is process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents

PRANK proceeds by finding the kNN of the query (viewed as a document) using EIN-rings.
Then the systems returns a weighted list of matching documents (weighting scheme is discussed later). Motivation Massive increases in the number of text documents Medical articles Research Publications s News reports (e.g. Reuters) Others access to info in these libraries is in great demand

Text has very little of the explicit structure typically found in other data (e.g. a relational database) Vector Space Model has been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection Exclude stop list words with very high support, e.g., the, a, … expected terms e.g., blood in a corpus of articles addressing blood infections. Typically “term” means “stems” plus “important phrases”. Some use n-grams (raw sequences of characters) “case folding” is typically done (converting all characters to a common case. The query is also represented as document vector in the space.

The P-tree1 technology A P-tree is a tree data structure that store numeric (and categorical) relational data in a vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first Numeric attributes: by bit positions Categorical attributes: by a bit map for each category 1P-tree technology is patent pending at North Dakota State University

But it is pure (pure0) so this branch ends
Predicate tree technology: vertically project each attribute, Current practice in data mining Data is structured into horizontal records. then vertically project each bit position of each attribute, then compress each bit slice, using a predicate, into a basic P-tree. e.g., compress R11 into P11 (using the universal predicate, pure1) Then Process vertically (vertical scans) R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Horizontally structured records are scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 The 1-Dimension P-tree version of R11, P11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false  0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 To count occurrences of 7,0,1,4 use pure : level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level ^ But it is pure (pure0) so this branch ends

2-Dimensional Pure1 P-trees (AKA Peano-trees (use the Peano space filling curve concept)
Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (from, e.g., high-order bit of a band from a 2-D image) Which, in spatial raster order, looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 1 1 1 1

Logical Operations on Ptrees (are used to get counts of any pattern)
Ptree Ptree AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

3-Dimensional Pure1 P-trees (are a naturaly choice (produce better compession?) if the data is naturally 3-D (e.g., solids)

Other useful Predicates-trees
There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Proposition for predicate Px>c c=(bm..bi..b1)2 There are five basic type of predicates, i.e., x>c, xc, xc, x<c, x=c, where c, is bound value. Five very useful propositions for each of these P-trees. Where Pi is the ith basic P-tree and i =AND if bi = 1 else 0 For predicate tree, Pxc where i =AND if bi=0 else OR

e.g., Calculation of Px>c
Crude Method: Px>(4)10(100)2 = ( P3P2) (P3P2’P1 ) Our formula: Px>(4)10(100)2 = P3(P2P1) e.g., Calculation of Px  c Crude: Px(70)10=Px( )2 = (P7’P6’)(P7’P6P5’P4’P3’P2’)(P7’P6P5’P4’P3’P2P1’) Ours: Px(70)10=Px( )2 = P7’(P6’(P5’(P4’(P3’(P2’P1’P0’))))

Equal Interval Neighborhood (EIN-ring) Approach
x X P’x-r<Xx+r X r  Px-r-<Xx+r+ ^ P’x- r<Xx+r x x X r+ Px-r-<Xx+r+ 

Efficient Ranking Algorithm
Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document Normalized to document size: term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term (reciprocal of the support) All representations are converted to P-trees

Efficient Ranking Algorithm
(e.g., experiment_ section = high; educ_section = low) (Not used here) =document weight (e.g., CATA=hi, Science=low) (not used in this analysis) Relative Term Frequencies Term Wt (IDF*TF) TF1 TFk .025 .09 .034 .012 .003 .03 .22 .02 .01 The data is converted to Ptrees and then ranked using P-kNN. P-kNN ranks according to similarity rings eminating from the query (as a document), until at least k are found. Each ring is a simple EIN-ring calculation.

Experiment Results We compared P-rank with scan-based ranking on MeDLINE data (ACM KDDCup 2002 task2) Three size groups, denoted DB1, DB2, DB3, containing 1,000, 8,000, 15,000, resp.

Architecture of P-RANK

Conclusion and future directions
We describe the P-RANK method for extracting evidences, for example, of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring formulations.

Efficient Ranking of Keyword Queries Using P-trees

Similar presentations

Presentation on theme: "Efficient Ranking of Keyword Queries Using P-trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Ranking of Keyword Queries Using P-trees

Similar presentations

Presentation on theme: "Efficient Ranking of Keyword Queries Using P-trees"— Presentation transcript:

Similar presentations

About project

Feedback