CPT-S 580-06 Advanced Databases 11 Yinghui Wu EME 49.

Slides:



Advertisements
Similar presentations
Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.
Advertisements

CrowdER - Crowdsourcing Entity Resolution
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
GRAIL: Scalable Reachability Index for Large Graphs VLDB2010 Vineet Chaoji Mohammed J. Zaki.
Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.
1 Querying Big Data: Theory and Practice Theory –Tractability revisited for querying big data –Parallel scalability –Bounded evaluability Techniques –Parallel.
The Theory of NP-Completeness
Analysis of Algorithms CS 477/677
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
Trip Planning Queries F. Li, D. Cheng, M. Hadjieleftheriou, G. Kollios, S.-H. Teng Boston University.
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
Yinghui Wu LFCS Lab Lunch Homomorphism and Simulation Revised for Graph Matching.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Yinghui Wu, SIGMOD 2012 Query Preserving Graph Compression Wenfei Fan 1,2 Jianzhong Li 2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Bug Localization with Machine Learning Techniques Wujie Zheng
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
Association Rules with Graph Patterns Yinghui Wu Washington State University Wenfei Fan Jingbo Xu University of Edinburgh Southwest Jiaotong University.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
Approximation Algorithms
Computer Science CPSC 322 Lecture 9 (Ch , 3.7.6) Slide 1.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Yinghui Wu, ICDE Adding Regular Expressions to Graph Reachability and Pattern Queries Wenfei Fan Shuai Ma Nan Tang Yinghui Wu University of Edinburgh.
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Answering pattern queries using views Yinghui Wu UC Santa Barbara Wenfei Fan University of EdinburghSouthwest Jiaotong University Xin Wang.
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Distributed Graph Simulation: Impossibility and Possibility 1 Yinghui Wu Washington State University Wenfei Fan University of Edinburgh Southwest Jiaotong.
1 QSX: Querying Social Graphs Approximate query answering Query-driven approximation Data-driven approximation Graph systems.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Graph Indexing From managing and mining graph data.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Yinghui Wu, SIGMOD Incremental Graph Pattern Matching Wenfei Fan Xin Wang Yinghui Wu University of Edinburgh Jianzhong Li Jizhou Luo Harbin Institute.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Answering pattern queries using views
Probabilistic Data Management
CPT-S 415 Big Data Yinghui Wu EME B45.
Query-Friendly Compression of Graph Streams
Random Sampling over Joins Revisited
Diversified Top-k Subgraph Querying in a Large Graph
Presentation transcript:

CPT-S Advanced Databases 11 Yinghui Wu EME 49

Approximate query processing 22

3 Approximate query answering 1. Query-driven approximation Feasible query models: from intractable to low polynomial time Top-k query answering 2. Data-driven approximation: resource-bounded query answering Querying big data within our available resources 3 We can’t afford to always find exact query answers in big data Some queries are expensive (e.g., subgraph isomorphism) We may have constrained resources -- cannot afford unlimited resources Applications may demand real-time response

4 Revised graph query model Relaxing the semantics of queries: case study Effectiveness: capture more sensible matches in social graphs Efficiency: from intractable to low polynomial time Subgraph isomorphism NP-completeExponentially many matches Quadratic/cubic time a polynomial time algorithm Use “cheaper” queries whenever possible Works better for social network analysis 4

5 Gray: Best-effort graph pattern matching (Hanghang T. KDD 09) 5 Output Input Attributed Data Graph Query Graph Matching Subgraph

6 6 G-Ray: quick overview (for loop ) Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge

7 G-Ray: example pattern and matches Not exact answers No approximation guarantee Lose topological information In linear time on |G| Best-effort match

Top-k query answering 8 Traditional query answering: compute Q ( D ) Top-k query answering: Input: Query Q, database D and a positive integer k. Output: A top-ranked set of k matches of Q It is expensive to compute when D is large The result Q ( D ) is excessively large for the users to inspect – larger than D 8 How many matches do you check when you use, e.g., Google? Early termination: return top-k matches without computing Q(D)

9 Top-k graph querying Input: Graph G, Query Q, Integer k, answer quality measure Output: top-k answer set that maximizes object function F Top-k algorithms Exact top-k Approximate top-k Any-time top-k Early terminating Difference between Top-k graph problems and top-k table aggregation? Valid match expansion Join (no monotonicity) Hard to show instance optimality (top-k Join queries are special cases!)

10 GraphTA: A template Initialize candidate list L for node/edge in Q For each list L sort L with ranking function; Set a cursor to each list; set an upper bound U For each cursor c in each list L do generate a match that contains c; update Q(G,k); update threshold H with lowest score in Q(G,k); move all cursors one step ahead; update the upper bound U; if k matches are identified and H>=U then break; Return Q(G, k) nodes/edges of interests nodes/edges of interests

Finding best candidates 11 Project Manager* Programmer DB manager Tester PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 PRG 2 UD 1 UD 2 ST 1 ST 2 ST 3 ST 4 Query: find good PM (project manager) candidates collaborated with PRG (programmer), DB (database developer) and ST (software tester). Collaboration network G “query focus” complete matching relation (project manager, PM 1 ), (project manager, PM 2 ) (project manager, PM 3 ), (project manager, PM 4 ) (programmer, PRG 1 ), (programmer, PRG 2 ) (programmer, PRG 3 ), (programmer, PRG 4 ) (DBmanager, DB 1 ), (DBmanager, DB 2 ) (DBmanager, DB 3 ) (tester, ST 1 ), (tester, ST 2 ) (tester, ST 3 ), (tester, ST 4 ) Pattern graph Q Querying collaborative networks: we just want top-ranked PMs

12 Input: graph G = (V, E, f A ), pattern Q = (V Q, E Q, f v, u o ) Output: Q(G, u o ) = { v | ( u o, v)  Q(G) } Graph pattern matching with output node Output: k nodes vs. the entire set Q ( G ) Output node Matches of the output node Top-k query answering: Input: : Pattern Q, data graph G and a positive integer k. Output: Top-k matches in Q ( G, u o ) PM DBPRG ST Pattern Q * pm 1 pm 2 pm 3 pm n db 1 db 2 db 3 prg 1 prg 2 prg 3 st 1 st 2 st 3 st 4 st m …… Top-2 matches How to rank the answers? 12

Top-k answers 13 Top-k matching: top-k matches that maximize the total relevance PM 2 DB 2 PRG 3 DB 3 PRG 4 PRG 2 ST 2 ST 3 ST 4 Relevant set R(u,v) for a match v of a query node u: all descendants of v as matches of descendants of u a unique, maximum relevance set Relevance function ◦ The more reachable matches, the better

Finding Top-k Matches 14 Finding Top-k matches for acyclic patterns ◦ Initializes a heap S, and a vector for each candidate v ◦ Computes a set of matches for some query nodes (can be determined without following steps) ◦ Iteratively updates vectors of other candidates by propagating the partial answers ◦ Termination condition: (1) each v in S is a match of u o, and (2) min v ∈ S (l(u o, v)) ≥ max v′ ∈ can(uo)\S (h(u o, v)), where l(u o, v) and h(u o, v) denote a lower bound and upper bound of r(u o, v). xXv: match? v.R: relevance set v.l ower, v.upper: relevance bound

Finding Top-k Matches 15 Project Manager* Programmer DB manager PM 1 BA PM 2 PM 3 PM 4 PRG 1 DB 1 DB 2 PRG 3 DB 3 PRG 4 vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB k (k ∈ [1,3]) vv.T = PM 1 PM 2 PM 3 PM 4 PRG 1 PRG j (j ∈ [3,4]) DB 2 DB k (k ∈ [1,3]) After initialization propagation from DB 2 a valid match, and its relevant set includes the most matches compared with others. Early termination condition is met.

A revision of conventional approximation theory 16

17 Traditional approximation theory Traditional approximation algorithms T : for an NPO (NP-complete optimization problem), for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio  : for all x, Does it work when it comes to querying big data? OPT(x): optimal solution,   1 Minimization: OPT(x)  f(x, y)   OPT(x) Maximization: 1/  OPT(x)  f(x, y)  OPT(x) 17

18 The approximation theory revisited Traditional approximation algorithms T : for an NPO for each instance x, T (x) computes a feasible solution y quality metric f(x, y) performance ratio (minimization): for all x, A quest for revising approximation algorithms for querying big data Approximation: for even low PTIME problems, not just NPO Quality metric: answer to a query is a typically a set, not a number Approach: it does not help much if T (x) conducts computation on “big” data x directly! OPT(x)  f(x, y)   OPT(x) Big data? 18

Data-driven: Resource bounded query answering 19 Input: A class Q of queries, a resource ratio   [0, 1), and a performance ratio   (0, 1] Question: Develop an algorithm that given any query Q  Q and dataset D, accesses a fraction D  of D such that |D  |   |D| computes as Q(D  ) as approximate answers to Q(D), and accuracy(Q, D,  )   Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) Accessing  |D| amount of data in the entire process 19

20 Resource bounded query answering Resource bounded: resource ratio   [0, 1) decided by our available resources: time, space, … In combination with other tricks for making big data small Dynamic reduction: given Q and D, find D  for Q contrast this to synopses: find D  for all Q histogram, wavelets, sketches, sampling, … better reduction ratio Q( D ) dynamic reduction D DD Q approximation Q Q( D  ) access schema, distributed, views, … 20

21 Accuracy metrics Performance ratio for approximate query answering Performance ratio: F-measure of precision and recall to cope with the set semantics of query answers precision(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D  )| recall(Q, D,  ) = | Q(D  )  Q(D)| / | Q(D)| accuracy(Q, D,  ) = 2 * precision(Q, D,  ) * recall(Q, D,  ) / (precision(Q, D,  ) + recall(Q, D,  )) 21

22 Personalized social search make big graphs of PB size fit into our memory Graph Search, Facebook Find me all my friends who live in Pullman and like cycling Find me restaurants in Seattle my friends have been to Find me photos of my friends in New York personalized social search with  = %! 1.5 * * 1PB (10 15 B) = 15 * 10 9 = 15GB making big graphs of PB size as small as 15GB! Localized patterns with 100% accuracy! Add to this access schema, distributed, views, … 22

Localized queries Localized queries: can be answered locally ◦ Graph pattern queries: revised simulation queries ◦ matching relation over d Q -neighborhood of a personalized node Michael hiking group cycling club member ?cycling lovers Michael (unique match) hiking group … … … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n Personalized node Personalized social search, ego network analysis, … Michael: “find cycling fans who know both my friends in cycling club and my friends in hiking groups 23

Resource-bounded simulation 24 Preprocessing (auxiliary information) dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph local auxiliary information G Boolean guarded condition: label matching Cost function c(u,v) Potential function p(u,v), estimated probability that v matches u bound b, determines an upper bound of the number of nodes to be visited Q degree|neighbor| … u v u v label match Dynamically updated auxiliary information u v ? If v is included, the number of additional nodes that need also to be included – budget The probability for v to match u (total number of nodes in the neighbor of v that are candidate matches Query guided search – bounded by the budget

Resource-bounded simulation 25 preprocessing dynamic reduction (compute reduced subgraph) Approximate query evaluation over reduced subgraph Michael hiking group cycling club ?cycling lovers Michael hiking group … cycling club member cycling fans hg m hg 1 hg 2 cc 1 cc 2 cc 3 cl 1 cl 2 cl n-1 cl n cycling club cc 1 cc 2 cc 3 cycling club member ? cycling lovers cl n-1 cl n cycling fans hg m hiking group hiking group FALSE TRUE Cost=1 Potential=3 Bound =2 TRUE Cost=1 Potential=2 Bound =2 bound = 14 visited = 16 Match relation: (Michael, Michael), (hiking group, hg m ), (cycling club, cc 1 ), (cycling club, cc 3 ), (cycling lover, cl n-1 ), (cycling lover, cl n ) Dynamic data reduction and query-guided search

Accuracy 26 Varying α ( ), accuracy, Yahoo 89%-100% for simulation queries both achieves 100% accuracy when α>0.0015%, 100% accuracy for * 1PB (10 15 B) = 10 9 = 10GB

27 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Non-localized queries Reachability Input: A directed graph G, and a pair of nodes s and t in G Question: Does there exist a path from s to t in G? Non-localized: t may be far from s Does dynamic reduction work for non-localized queries? Is Michael connected to Eric via social links?

Resource-bounded reachability 28 Reduction size | G Q | <= α|G| Reachability query results Reachability query results Approximation (experimentally Verified; no false positive, in time O(α|G|) big graph G small tree index G Q O(|G|) Yes, dynamic reduction works for non-localized queries

Preprocessing: landmarks 29 Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … Landmarks ◦a landmark node covers certain number of node pairs ◦Reachability of the pairs it covers can be computed by landmark labels cc 1 “I can reach cl3” cl 3 cl n-1, “cl3 can reach me” cl 4 … cl 6 cl 16 A revision of 2-hop covers Search landmark index instead of G <= α|G|

Hierarchical landmark Index 30 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … A node v can reach v’ if there exists v1, v2, v2 in the index such that v reaches v1, v2 reaches v’, and v1 and v2 are connected to v3 at the same level (coding)

Hierarchical landmark Index 31 Landmark Index ◦landmark nodes are selected to encode pairwise reachability ◦Hierarchical indexing: apply multiple rounds of landmark selection to construct a tree of landmarks cc 1 cl 7 cl n-1 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric … … cl 16 cl 3 cl 5 cl 6 cl 4 cl 9 … Boolean guarded condition (v, vp, v’) Cost function c(v): size of unvisited landmarks in the subtree rooted at v Potential P(v), total cover size of unvisited landmarks as the children of v Cover size Landmark labels/encoding Topological rank/range Whether v can possibly reach v’ via vp Guided search on landmark index

Resource-bounded reachability 32 Michael cc 1 cl 3 cl 7 cl n-1 cl 4 cl 9 cl 5 … … cl 6 cl 16 Eric cc 1 … cl 7 cl n-1 … cl 16 cl 3 cl 5 cl 6 cl 4 Michael Eric “drill down”? cl 9 … local auxiliary information “roll up” Preprocessing dynamic reduction (compute landmark index) Approximate query evaluation over landmark index bi-directed guided traversal Condition = FALSE - - Condition = ? Cost=9 Potential = 46 Condition = ? Cost=2 Potential = 9 Condition = TRUE … … Drill down and roll up

Accuracy 33 Varying α ( ), accuracy, Yahoo achieves 100% accuracy when α>0.05%, 100% accuracy for * 1PB (10 15 B) = = 100GB

Efficiency: resource bounded reachability 34 RBreach is 62.5 times faster than BFS and 5.7 times faster than BFS-OPT Varying α ( ), Yahoo * 1PB (10 15 B) = = 100GB

Summing up 35

36 Approximate query answering Challenges: to get real-time answers Big data and costly queries Limited resources Yes, we can query big data within bounded resources! 36 Combined with techniques for making big data small Two approaches: Query-driven approximation Cheaper queries Retain sensible answers Data-driven approximation Dynamic data reduction Query-guided search Reduce data of PG size to GB

37 G. Gou and R. Chirkova. Efficient algorithms for exact ranked twig- pattern matching over graphs. In SIGMOD, H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness: an efficient algorithm for testing subgraph isomorphism. PVLDB, R. T. Stern, R. Puzis, and A. Felner. Potential search: A bounded-cost search algorithm. In ICAPS, (search Google Scholar) S. Zilberstein, F. Charpillet, P. Chassaing, et al. Real-time problem solving with contract algorithms. In IJCAI, (search Google Scholar) W. Fan, X. Wang, and Y. Wu. Diversified Top-k Graph Pattern Matching, VLDB (query-driven approximation) W. Fan, X. Wang, and Y. Wu. Querying big graphs with bounded resources, SIGMOD (data-driven approximation) Papers for you to review

Reading M. Arenas, L. E. Bertossi, J. Chomicki: Consistent Query Answers in Inconsistent Databases, PODS Indrajit Bhattacharya and Lise Getoor. Collective Entity Resolution in Relational Data. TKDD, harya-tkdd.pdf 3. P. Li, X. Dong, A. Maurino, and D. Srivastava. Linking Temporal Records. VLDB W. Fan and F. Geerts , Relative information completeness, PODS, Y. Cao. W. Fan, and W. Yu. Determining relative accuracy of attributes. SIGMOD P. Buneman, S. Davidson, W. Fan, C. Hara and W. Tan. Keys for XML. WWW 2001.