Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University Contributions Preprocessing to create PIC-D PIC-D Representation for Entities on the Web Query Runtime Speedup vs. Results Quality Acknowledgements : This work is supported by Google and the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. Conclusions  PIC-D : A single low-dim. representation for entities on the Web using Power Iteration Clustering (PIC) by Lin and Cohen ICML 2010.  #dimensions in PIC-D = √(total number of dimensions)  Time to create PIC-D is linear in total number of dimensions  Information extraction tasks posed as similarity queries on PIC-D  Comparable precision recall w.r.t. high-dimensional baseline  Up to 2 orders of magnitude improvements at query run-time incurring small amount of pre-processing time to create PIC-D. |X| * m PIC embedding, m << n_1 |X| * n_1 Bipartite graph |X| * n_2 Bipartite graph PIC concatenate |X| * D * m PIC-D embedding |X| * n_D Bipartite graph E.g. Entities in HTML tables E.g. Entities with Hearst patterns E.g. Entities in Subj-Verb-Obj triples PIC |X| * m PIC embedding, m << n_2 |X| * m PIC embedding, m << n_D PIC Hypothesis : PIC-D embeddings will cluster similar entities (entities belonging to same class) together. USA India Football Hockey Baseball Country Location Sports TC-1 TC-2 TC-3 TC-4 Entity occurrences In text with Hearst-patterns Entity occurrences in HTML Table columns CountryX1X2 USA 0.230.76 India 0.210.79 Football 0.360.80 Hockey 0.350.82 Baseball 0.340.79 Y1Y2 0.430.66 0.410.69 0.660.35 0.160.92 0.140.89 Example PIC-3 embedding, m = 2 PropertyDescriptionDataset Toy_ Apple Delicious_ Sports ASIA_ INT Clueweb_ Sports #HTML pages57421K121K918K |X|# Entities15K43815K30K |C|# table columns1569258K78K |(x,c)|# (x, c) edges70.5K5.5K91K566K |Ys|# suchas concepts2.3K1.6K3.8K21.4K |(x, Ys)|# (x, Ys) edges7.7K4.8K18.3K107.8K |Yn|# NELL classes11323 |(x, Yn)|#(x, Yn) edges41939691977 |Yc|# manual column labels3130-- |(c, Yc)|# (c, Yc) pairs156925-- #PIC-D dimensions51 110317 Total time to create PIC-D (msec)49.75369.70.0576 Hypothesis : Entities co-occurring in multiple table columns or with similar suchas concepts probably belong to the same class label. IE Tasks as Similarity Queries Set Expansion task on Clueweb _Sports ASIA task on Clueweb_Sports Similarity queries on PIC-D are up to 2 orders of magnitude faster. PIC-D results in comparable precision/recall w.r.t high-dimensional baseline. Label propagation achieves better performance at the cost of huge query runtimes.  We Present a single, efficiently-constructible representation, named PIC-D representation for entities on the Web.  IE tasks can be posed as similarity queries on the PIC-D representation: Set Expansion, Automatic Set Instance Acquisition and Column Classification  PIC-D results in huge savings in query run-time with comparable quality of results.  Future work : Using PIC-D representation with many more views of data, e.g., SVO triples, properties derived from KBs etc. for unsupervised class-instance pair acquisition. ASIA Column Classification Aggregate results over  Set expansion : 272 queries (Delicious_Sports) and 152 queries (Toy_Apple)  ASIA : 25 queries (Delicious_Sports)  COL-CLASS : 925 queries (Delicious_Sports) and 156 queries (Toy_Apple) How many PIC-D dimensions are enough? How much time does it take to create PIC-D? m = √ n and time = O(n) Set Expansion

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

Similar presentations

Presentation on theme: "Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University.

Similar presentations

Presentation on theme: "Very Fast Similarity Queries on Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen Language Technologies Institute, Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback