Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction.

Slides:



Advertisements
Similar presentations
Maintenance of Materialized Views: Problems, Techniques, and Applications Ashish Gupta IBM almaden Research Center Inderpal Singh Mumick AT&T Bell Laboratories.
Advertisements

© 2013 A. Haeberlen, Z. Ives Cloud Storage & Case Studies NETS 212: Scalable & Cloud Computing Fall 2014 Z. Ives University of Pennsylvania 1.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
O RCHESTRA : Rapid, Collaborative Sharing of Dynamic Data Zachary Ives, Nitin Khandelwal, Aneesh Kapur, University of Pennsylvania Murat Cakir, Drexel.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
Optimizing Statistical Information Extraction Programs Over Evolving Text Fei Chen Xixuan (Aaron) Feng Christopher Ré Min Wang.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
Database Management: Getting Data Together Chapter 14.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Chapter 2- Software Process Lecture 4. Software Engineering We have specified the problem domain – industrial strength software – Besides delivering the.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Databases & Data Warehouses Chapter 3 Database Processing.
N. J. Taylor Database Management Systems (DBMS) 1.
High-Speed, High Volume Document Storage, Retrieval, and Manipulation with Documentum and Snowbound March 8, 2007.
1 C omputer information systems Design Instructor: Mr. Ahmed Al Astal IGGC1202 College Requirement University Of Palestine.
Web Application Architecture and Communication. Displaying a Web page in a Browser
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
Copyright © 2003 by Prentice Hall Module 4 Database Management Systems 1.What is a database? Data hierarchy and data organization Field, record, file,
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Eric Westfall – Indiana University Jeremy Hanson – Iowa State University Building Applications with the KNS.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Module 3: The Relational Model.  Overview Terminology Relational Data Structure Mathematical Relations Database Relations Relational Keys Relational.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
DBMS 2001Notes 1: Introduction1 Principles of Database Management Systems (Tietokannanhallintajärjestelmät) Pekka Kilpeläinen Fall 2001.
A Relational Approach to Incrementally Extracting and Querying Structure in Unstructured Data Eric Chu, Akanksha Baid, Ting Chen, AnHai Doan, Jeffrey Naughton.
Database Management Systems Introduction. In the Beginning… Customer Program 1.
@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.
Cimple: Building Community Portal Sites through Crawling & Extraction Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management.
Principles of Database Design, Conclusions AIMS 2710 R. Nakatsu.
AnHai Doan University of Wisconsin-Madison Data Quality Challenges in Community Systems Joint work with Pedro DeRose, Warren Shen, Xiaoyong Chai, Byron.
Automatically Repairing Broken Workflows for Evolving GUI Applications Sai Zhang University of Washington Joint work with: Hao Lü, Michael D. Ernst.
Optimizing Complex Extraction Programs over Evolving Text Data Fei Chen 1, Byron Gao 2, AnHai Doan 1, Jun Yang 3, Raghu Ramakrishnan 4 1 University of.
Testing Techniques Software Testing Module ( ) Dr. Samer Hanna.
Database Application Design and Data Integrity AIMS 3710 R. Nakatsu.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin.
GSLPI: a Cost-based Query Progress Indicator
A State Perspective Mentoring Conference New Orleans, LA 2/28/2005 RCRAInfo Network Exchange.
Principles of Database Design, Conclusions MBAA 609 R. Nakatsu.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
KEW Definitions Document Type The Document Type defines the routing definition and other properties for a set of documents. Each document is an instance.
Benjamin Post Cole Kelleher.  Availability  Data must maintain a specified level of availability to the users  Performance  Database requests must.
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
Chapter – 8 Software Tools.
Scheduling of Transactions on XML Documents Author: Stijin Dekeyser Jan Hidders Reviewed by Jason Chen, Glenn, Steven, Christian.
Chapter 3 The Relational Model. Objectives u Terminology of relational model. u How tables are used to represent data. u Connection between mathematical.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Database System Concepts Introduction Purpose of Database Systems View of Data Data Models Data Definition Language Data Manipulation Language Transaction.
11th International Conference on Web-Age Information Management July 15-17, 2010 Jiuzhaigou, China V Locking Protocol for Materialized Aggregate Join Views.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
8 Copyright © 2005, Oracle. All rights reserved. Managing Schema Objects.
September 2000C.Watters1 Data & Database Management Systems (DBMS) ECMM6010.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED INTRODUCTION TO INTERSTAGE BPM.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Relational Data Model and Relational Database Constraints تنبيه.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Software Configuration Management
COMP 430 Intro. to Database Systems
Supplier Recovery Claim Automation
Chapter 3 The Relational Model.
Oracle Configurator Cloud
Data Integration for Relational Web
Database Management Systems
Instructor 彭智勇 武汉大学软件工程国家重点实验室 电话:
Probabilistic Databases
Database Dr. Roueida Mohammed.
Assertions and Triggers
Presentation transcript:

Xiaoyong Chai, Ba-Quy Vuong, AnHai Doan, Jeffrey F. Naughton University of Wisconsin-Madison Efficiently Incorporating User Feedback into Information Extraction and Integration Programs

The Need for Incorporating User Feedback Panels Chair

3 Current Approach … Code Data

4 This Is Not Just For DBLife A growing number of applications use IE and II Almaden Univ. of Berlin of Washington –… A systematic user-feedback solution could significantly benefit them

5 What User Feedback To Incorporate? Types of User Feedback Flagging an Error Fixing an Error Editing Data Editing Code Input Intermediate Results Output

6 Challenges How to expose program data for user feedback? How to incorporate user feedback? How to efficiently execute a program?

7 Exposing Program Data for User Feedback dataSources services Views User Interfaces extractConf crawl extractNames findRoles … 09/01/2008http://.../cidr09/ dateurl … Joe Hellerstein name PC ChairCIDR 2009 roleconf … …… namepagerole … …… url … Form Spreadsheet Wiki nameconfrole … … … namerolepage … … … roles Extracting conference services

8 Writing User-Feedback Rules to Expose Program Data Write extraction program, e.g., in xlog [Shen et al, 07] R 6 : dataSourcesForUserFeedback(url) :  dataSources(url, date), date >= “01/01/2009” R 7 : rolesForUserFeedback(pos, page#no-edit)#spreadsheet-UI :  roles(role, page) R 8 : servicesForUserFeedback(name, conf, role)#wiki-UI :  services(name, conf, role) Write user-feedback rules to specify views and user interfaces #form-UI R 1 : pages(page) :  dataSources(url, date), crawl(url, page) R 3 : names(name, page) :  pages(page), extractNames(page, name) R 2 : conferences(conf, page):  pages(page), extractConf(page, conf) R 5 : services(name, conf, role) :  conferences(conf, page), roles(name, role, page) R 4 : roles(name, role, page) :  names(name, page), findRoles(name, page, role)

9 Program Semantics Views url … nameconfrole … … … namerolepage … … … extractConf crawl extractNames findRoles dataSources … 09/01/2008http://.../cidr09/ dateurl … services Joe Hellerstein name PC ChairCIDR 2009 roleconf … …… namepagerole … …… roles User Interfaces Form Spreadsheet Wiki

10 Incorporating Previous User Feedback I O t  t’ p Interpretation: for operator p, if t is in the output, change t into t’ name A. Smith A. Jones page p1p1 … D. Smith, A. Jones,... name A. Smith page p2p2 Dr. A. Smith is... … … Change “A. Smith” to “D. Smith” extractNames O’ I O p

11 Interpreting User Feedback Based On Tuple Provenance Provenance of output tuple t : –the set of input tuples that operator p used to produce t name A. Smith A. Jones page p1p1 extractNames p1p1 p1p1 Change “A. Smith” to “D. Smith” If the operator produces {“A. Smith”, “A. Jones”} from {p1}, then replace {“A. Smith”, “A. Jones”} with {“D. Smith”, “A. Jones”} p1p1 p2p2 page extractNames p1p1 p1p1 p2p2 name A. Smith A. Jones A. Smith

12 Challenges How to expose program data for user feedback? How to incorporate user feedback? How to efficiently execute a program? –Incremental execution –Improved concurrency control

13 Incrementally Executing the Program ? p2p2 p1p1 page … name extractNames p2p2 p1p1 page extractNames p3p3 Similar problem in incremental view maintenance Incremental-update properties –Closed-formed insertion –Closed-formed deletion –Input partitionability –Partition correlation –Attribute independence extractNames(I+  I) extractNames(I) = extractNames(  I) +

14 Concurrently Executing Transactions dataSources services extractConf crawl extractNames findRoles … 09/01/2008http://.../cidr09/ dateurl … Joe Hellerstein name PC ChairCIDR 2009 roleconf … …… namepagerole … …… roles T2T2 T1T1 Locks only the input and output tables of the crawl operator Table-Locking Skips executing the join operator after updating the roles table Operator-Skipping

15 Experiment Setup Testbed –A 5-stage DBLife workflow –13 blackbox operators: 6 IE operators and 3 II operators Wrote xlog program and user-feedback rules in < 1 hr Simulated user-feedback transactions –On each stage of the workflow –Each transaction randomly deletes, inserts, or modifies 1/10 of the tuples in a table

16 Incremental-Update Properties are Broadly Applicable Inc. Update Properties DBLife Operatorscicdipaipc Get Data Pages Get People Variations Get Publication Variations Get Organization Variations Find People Variations Find Publication Variations Find Organization Variations Find People Entities Find Publication Entities Find Organization Entities Find Related People Find Authorship Find Related Organizations

17 Incremental Update Reduces Execution Time

18 Table-Locking and Operator-Skipping Improve Concurrency Degree Increase transaction throughput by 50% and 500% Reduce transaction response time by 43% and 98% MinMaxAverage Graph-locking~0s7,584s3,203s Table-locking 1s5,485s1,841s Operator-skipping~0s 457s 43s -43% -98%

19 Related Work User feedback in IE and II –[Doan et al, 01], [Chiticariu et al, 08], [Jeffery et al, 08] –Leveraging user feedback to improve results of individual operations Provenance –[Woodruff & Stonebraker, 97], [Cui & Widom, 01], [Buneman et al, 01], [Bohannon et al, 08] ], [Huang et al, 08] Incremental execution –View maintenance [Blakeley et al, 86], [Griffin & Libkin, 95], [Gupta & Mumick, 95] –Schema matching [Bernstein et al, 06], IE [Chen et al, 07]

20 Conclusions and Future Work Incorporating user feedback into IE and II programs is important Identify key issues and provide initial solutions: –Write user-feedback rules to expose program data to UIs –Model and incorporate user feedback –Efficiently execute program to process user feedback Future work: –Handle unreliable user feedback –Propagate user feedback down in the workflow –Conduct user study