Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.

Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2 ICDM 2009, Miami

Outline Motivation & Problem Our Solution Experiments Related Work Summary and Future Work

Motivation Multidimensional data are everywhere Time series stock data data collected from sensor monitor Feature vectors extracted from images or texts …… Similarity query on multidimensional data is important data mining database information retrieval

Similarity query is challenging when the data is incomplete Data incompleteness happens when: Sensors do not work properly Certain features are missing from particular feature vectors ……. X X Sensor data … 2 312… 1 1 Text vector C1 4 Y9… Image vector Z 2 5 11… Query In order to process similarity query, imputation is necessary. (i.e. by “completing” the missing data by filling in specific values)

Dimension incomplete data Dimension incomplete data satisfies: (a) At least one of its data elements is missing; (b) The dimension of the missing data element can not be determined. E.g. Observed data: But we know the complete data should be of three dimensions Data missing might happen on the first, second or third dimension. 3 3 6 6

Causes of dimension incomplete Dimension incompleteness happens when: Data missing happens while using the order as the implicit dimension indicator The dimension indicator itself may also be lost ……

Similarity query is more challenging when the dimension is incomplete To measure the similarity between query and the dimension incomplete data object, we should first recover the incomplete data. Enumerating all combination cases? – Time costing E.g. X obs : 3 3 6 6 lost one dimension 3 possible results after data recovery 3 3 6 6 3 3 6 6 3 3 6 6 X X X X X X Imputed element For an m-dimensional data object which has n elements missing, there will be C m n cases to recover it.

Problem statement: Symboldescription DDatabase X obs The observed part of X X mis The missing part of X XThe underlying complete multidimensional data object X rv The recovery version of X obs QThe query cThe confidence threshold rThe distance threshold δThe distance function ( in this paper, we use Euclidean distance ) The imputation strategy

Two assumptions: The probability of using each recovery result is equal. The missing values obey normal distribution.

Efficient approach for PSQ-DID A gradual refinement search strategy including two pruning methods: Lower/upper bounds of confidence Probability triangle inequality Our Overall Query Process

Lower and upper bounds of confidence The missing part and the observed part of the dimension incomplete data are treated separately. Since we use Euclidean distance, we have: Lower/upper bounds of the observed part, denoted by δ LB obs and δ UB obs. Lower/upper bounds of the missing part, denoted by δ LB mis and δ UB mis.

E.g. X obs =(2,8,7), Q=(1,4,5,6,7) δ 2 LB obs (Q, X obs )=(2-1) 2 +(8-6) 2 +(7-7) 2 = 5 corresponding recovery version: (2,8,7,x 1,x 2 ) For the imputed random variables X mis ={x 1,x 2 }, If the imputation policy is using the mean value of the two adjacent observed elements as the expectation of the imputed random variables, then δ 2 LB mis (Q, X mis )=(4- x 1 ) 2 +(5-x 2 ) 2,(E(x 1 )=E(x 2 )=5), corresponding to X rv =(2,,, 8, 7). 55

Denoted by:, Lower and upper bounds of confidence We prove that

Probability triangle inequality Given a query Q and a multidimensional data object R (|Q| = |R|). For a dimension incomplete data object X obs whose underlying complete version is X, we have: (1) (2) Calculated in advance and stored in the database O(|X obs |(|Q|-|X obs |) 2 ) Calculated during query processing O(|Q|)

Experiments Data sets: Standard and Poor 500 index historical stock data(S&P500) (251 dimensions) A new data set with 30 dimensions by segmenting the S&P500 data set, resulting in 4328 data objects. Corel Color Histogram data (IMAGE) 68040 images 32 dimensions Dimension incomplete data set: randomly removing some dimensions of each data object.

Experiment Setup Ground truth: Similarity query results on the complete data Performance measures Precision, recall, pruning power Pruning power=N definite /N processed N processed : number of all data objects N definite : number of data objects judged as dismissals or search results by the pruner. Query: 100 data objects randomly sampled from the data set

Effectiveness of probabilistic similarity query on dimension incomplete data Query precision on S&P500 data set Query recall on S&P500 data set

Effectiveness of probabilistic similarity query on dimension incomplete data Query precision on IMAGE data set Query recall on IMAGE data set

Effect of the confidence threshold Missing ratio=0.1; r=60 for S&P500, r=0.7 for IMAGE data Confidence threshold vs precision-recall

Effectiveness of different pruners Pruning power of probability triangle inequality

Pruning Power of Four Pruners Pruner1: probability triangle inequality using confidence lower bound confidence; Pruner2: probability triangle inequality using confidence upper bound confidence; Pruner3: confidence lower bound; Pruner4: confidence upper bound missing ratio=10%, c= 0.1, number of assistant objects=20 Pruning power of four pruners

Comparison of query quality when neglecting naïve verification For data objects that the four pruners can not judge, Pos simply outputs as query results, Neg, by contrast, judges them as dismissals. c=0.1 Comparison of query quality

Performance analysis Time cost

Related Work Few research papers discuss similarity search on dimension incomplete data Incomplete data Recovery D. Williams et al. [ICML’05], K. Lakshminarayan et al. [Applied Intelligence’99],… Indexing G. Canahuate et al. [EDBT’06], B. C. Ooi et al. [VLDB’98],… Uncertain data J. Pei et al.[Sigmod’08], D. Burdick et al. [VLDB’05],… Dimension incomplete data Symbolic sequences J. Gu et al. [DEXA’07]

Summary and Future Work Problem: Tackle the similarity query on a new uncertain form (dimension incomplete) Solution: Lower and upper bounds of confidence So that we can avoid enumerate all C |Q| |X mis | recovery cases Probability triangle inequality Further boost the performance in query processing procedure Future work Other similarity measurements Index dimension incomplete data

Many thanks!

Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.

Similar presentations

Presentation on theme: "Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2.

Similar presentations

Presentation on theme: "Wei Cheng 1, Xiaoming Jin 1, and Jian-Tao Sun 2 Intelligent Data Engineering Group, School of Software, Tsinghua University 1 Microsoft Research Asia 2."— Presentation transcript:

Similar presentations

About project

Feedback