Efficient Ranking of Keyword Queries Using P-trees

Efficient Ranking of Keyword Queries Using P-trees
Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND

Outline The Keyword Ranking problem The P-tree technology
EIN-ring Approach Efficient Ranking Algorithm using P-trees Summary

Introduction keyword Ranking is the process of ordering documents that best match a given query defined by a finite number of keywords For our purpose, the query is also viewed as a mini-document Similarity between documents is based entirely on their contents

Introduction PRANK proceeds by finding the kNN of a given query (viewed as a document) using the EIN-ring (later) After that, the systems returns a weighted list of matching documents (weighting scheme is discussed later)

Introduction Motivation Increase in the number of text documents
Medical articles Research Publications s News reports (e.g. Reuters) Others access to these information has become in great demand

Introduction (cont.) text has no explicit structure like other data (e.g. relational database) Vector Space Model have been proposed by Salton (1975) for text Each document is represented as a vector whose dimensions are the terms in the initial document collection The query is also represented as document vector in the given space

The P-tree technology Tree-like data structure that store numeric (and categorical) relational data in vertical bit format by splitting each attribute into bits representing each bit position by a P-tree The next example shows the conversion of columns into P-trees (next slide) All columns are converted to binary first

The P-tree technology (cont.)
Construction of Basic Peano Count trees

The P-tree technology (cont.)
Basic P-tree Operations

Predicates using P-trees
There are five basic type of predicates, i.e., x > c, x  c, x  c, x<c, x=c, where c, is bound value. We come up with five very useful propositions for each of them using P-trees.

The Formula Propositions
Proposition for predicate Px>c Briefly, for low bound c=(bm..bi..b1)2 Proposition for predicate Pxc Similarly, for upper bound c=(bm..bi..b1)2

The Examples Calculation of Px>c Px > (4)10 > (100)2
Crude Method Px > (4)10 > (100)2 = (P3P2) (P3  P2’  P1) Our formula: = P3(P2P1)

The Examples (cont.) Calculation of Px  c Crude Method
Pxj  (70)10 = Pxj  ( )2 = (P7’P6’)  ( P7’P6P5’ P4’P3’P2’)  ( P7’P6P5’ P4’P3’P2P1’) Our formula Pxj  (70)10 = Pxj  ( )2 =P7’  (P6’  (P5’  (P4’ ( P3’ ( P2’  P1’ P0’))))

The EIN-ring Approach Definition of Equal Interval Neighborhood Ring (EIN-ring)

The EIN-ring Approach (cont.)
x X r+ r  Px-r-<Xx+r+  P’x-r<Xx+r Px-r-<Xx+r+ ^ P’x- r<Xx+r

EIN-ring Formulation

The Examples Calculation of Pc1<x  c2 Crude Method Our formula

Efficient Ranking Algorithm
Simplified Prototype of the Data Model

Dimensionality values are the measurements, called Term Frequency by Inverse Document Frequency (TFxIDF) TF (t): how many times term t exists in a document Local weight measuring the importance of a term to document IDF: log(N/Nt) where Nt is the number of documents containing at least one occurrence of t and N is the total number of documents Global weight measuring the uniqueness of the term Normalization is then applied to solve the problem document sizes term t might exist 10 times in a 100-pages document but only 9 times in 1-page document Clearly the t is more associated with the 2nd document All representations are converted to P-trees

Step1. Consider each weight in W1, W2 and W3 sections in Table 1 as a dimension of the search space X. We have a space of (k+2)-dimensions. Step2. Let Xstart be the point in X having the largest weight values, i.e., xstart-i = max(Wi) Step3. Calculate Pmin  Pmax, where Pmin = PY<Xstart-i-ri, Pmax = PX<Xstart-i-ri-. Ranking Algorithm Using EIN-ring

Experiment Results We compared P-rank with scan based ranking approach on MeDLINE data from KDDCup 2002 task2 with three size groups, denoted as DB1, DB2, DB3, which contain 1,000, 8,000, and 15,000, respectively.

Architecture of P-RANK

Conclusion We describe the architecture, implementation, and evaluation of the P-RANK system for extracting evidences of products of genes from text document, e.g., biomedical papers. Our contributions in this paper include a new efficient keyword query system using data structure P-tree and a fast weighted ranking method using the EIN-ring Formulations.

Future direction Accuracy Gene Alignment and Matching

Thanks!

Efficient Ranking of Keyword Queries Using P-trees

Similar presentations

Presentation on theme: "Efficient Ranking of Keyword Queries Using P-trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Ranking of Keyword Queries Using P-trees

Similar presentations

Presentation on theme: "Efficient Ranking of Keyword Queries Using P-trees"— Presentation transcript:

Similar presentations

About project

Feedback