Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University Contributions Preprocessing to create PIC-D PIC-D Representation for Entities on the Web Query Runtime Speedup vs. Results Quality Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA C Conclusions PIC-D : A single low-dim. representation for entities on the Web using Power Iteration Clustering (PIC) by Lin and Cohen ICML #dimensions in PIC-D = √(total number of dimensions) Time to create PIC-D is linear in total number of dimensions Information extraction tasks posed as similarity queries on PIC-D Comparable precision recall w.r.t. high-dimensional baseline Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D. |X| * m PIC embedding, m << n_1 |X| * n_1 Bipartite graph |X| * n_2 Bipartite graph PIC concatenate |X| * D * m PIC-D embedding |X| * n_D Bipartite graph E.g. Entities in HTML tables E.g. Entities with Hearst patterns E.g. Entities in Subj-Verb-Obj triples PIC |X| * m PIC embedding, m << n_2 |X| * m PIC embedding, m << n_D PIC Hypothesis : PIC-D embeddings will cluster similar entities (entities belonging to same class) together. USA India Football Hockey Baseball Country Location Sports TC-1 TC-2 TC-3 TC-4 Entity occurrences In text with Hearst-patterns Entity occurrences in HTML Table columns CountryX1X2 USA India Football Hockey Baseball Y1Y Example PIC-3 embedding, m = 2 PropertyDescriptionDataset Toy_ Apple Delicious_ Sports ASIA_ INT Clueweb_ Sports #HTML pages57421K121K918K |X|# Entities15K43815K30K |C|# table columns K78K |(x,c)|# (x, c) edges70.5K5.5K91K566K |Ys|# suchas concepts2.3K1.6K3.8K21.4K |(x, Ys)|# (x, Ys) edges7.7K4.8K18.3K107.8K |Yn|# NELL classes11323 |(x, Yn)|#(x, Yn) edges |Yc|# manual column labels |(c, Yc)|# (c, Yc) pairs #PIC-D dimensions Total time to create PIC-D (msec) Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. IE Tasks as Similarity Queries Set Expansion task on Clueweb _Sports ASIA task on Clueweb_Sports Similarity queries on PIC-D are up to 2 orders of magnitude faster. PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes. We Present a single, efficiently-constructible representation, named PIC-D representation for entities on the Web. IE tasks can be posed as similarity queries on the PIC-D representation: Set Expansion, Automatic Set Instance Acquisition and Column Classification PIC-D results in huge savings in query run-time with comparable quality of results. Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived from KBs etc. for unsupervised class-instance pair acquisition. ASIA Column Classification Aggregate results over Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple) ASIA : 25 queries (Delicious_Sports) COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple) How many PIC-D dimensions are enough? How much time does it take to create PIC-D? m = √ n and time = O(n) Set Expansion