Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December 14, 2010
Outline Introduction Problem Definition Differential Analysis and Approaches Experiment Result Conclusion
Introduction Deep web –Query forms vs. backend databases –Similar information from multiple data sources –What’s their difference? –Application: guiding users’ search process Higher-level knowledge summary –Patterns of values with respects to the same entity
Problem definition Goal –D–Difference between multiple data sources in the same domain Patterns of values of the same entity –D–Different values for the same data entity For example: prices of commodities –H–How different is the data, under what conditions? –D–Differential Rules Capturing the difference of values
Differential Analysis and Approaches Summarizing difference between two data sources Data queried from the deep web –A relational table Attributes –Assumption: data sources have same attributes –Identical attributes Same values for the same data object –Differential attributes Different values for the same data object –Quantitative attributes Differences in values of quantitative attributes
Differential Analysis and Approaches- Useful Identifiers Two data source and –Identical attributes –Differential attributes :attribute in data source –Combining relation tables of A and B –Differential rule where Profile X: the left hand of the rule
Differential Analysis and Approaches- Differential Rule Mining Frequent Item Set Mining –Apriori algorithm –A concept hierarchy Identifying patterns for target attributes –For each frequent itemset X Decide –Paired Z-test : difference between two random variables Hypothesis test vs. if >, then – if >0, then
Differential Analysis and Approaches- Pruning Rules Pruning rules –A large number of rules are generated –Essential rules predict unessential rules –Identifying essential rules Direction of rules
Differential Analysis and Approaches- ancestors of rules Rules R1, R2 are complementary ancestors of rule R –R1: Y->d, R2: Z->d –R: X->d, and Rule R is predicated by complementary ancestors R1 and R2
Differential Analysis and Approaches- Profile Representation Identifying essential Rules –Rules are processed level by level –For rule R in k, all the rules from level 1 to k-1 are visited –Computation cost is expensive Profile Representation –Uniquely describe items contained in the profile X of a rule R –For profile, define would be extremely large when profile X is large –Thus, we modify
Differential Analysis and Approaches- Process of Pruning Hash table is used to store differential rules Each level corresponds to a hash table For each rule R in the k-the level –The ancestor rules from 1 to k/2 are visited –Identifying complementary rules by profile representation –R is unessential rules Predicted by a pair of complementary ancestor rules –Process the next rule
Experiment Results Data Set: four of the most popular travel sites. 120 randomly selected cities all over the world Attributes –Hotel ID, City, Star, Customer Rating, Cleanness Rating, Price, Service Rating Concept Hierarchy for attribute: city
Experiment Results - effectiveness
Experiment Results – Pruning effectiveness
Experiment Results- Efficiency
Experiment Results -Mining-Utility of the Approach
Conclusion A method to extract high-level summary of the differences in multiple data sources Differential rule mining – A new data mining problem Statistic test for discovering differential rules A method to prune unessential rules Hash-table is used to speedup the process. Experiment results on four travel-related deep web data sources show good results.
Questions?