Approximated Provenance for Complex Applications

Slides:

Advertisements

Similar presentations

Modelling with expert systems. Expert systems Modelling with expert systems Coaching modelling with expert systems Advantages and limitations of modelling.

Advertisements

PROBLEM-BASED LEARNING & CAPACITY BUILDING

ANHAI DOAN ALON HALEVY ZACHARY IVES CHAPTER 14: DATA PROVENANCE PRINCIPLES OF DATA INTEGRATION.

Provenance analysis of algorithms 10/1/13 V. Tannen University of Pennsylvania 1WebDam someTowards ?

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.

Smart Shopper A Consumer Decision Support System Using Type-2 Fuzzy Logic Systems Ling Gu 2003 Fall CSc8810.

Dynamic Bayesian Networks (DBNs)

Concepts of Database Management Seventh Edition

Concepts of Database Management Sixth Edition

1 Next Century Challenges: Scalable Coordination in sensor Networks MOBICOMM (1999) Deborah Estrin, Ramesh Govindan, John Heidemann, Satish Kumar Presented.

University of Nevada, Reno College of Business Administration What are we going to learn 9/27 – 9/29? 1. Answer questions about MS Access queries. 2. Understand.

Rational Trigonometry Applied to Robotics

Xyleme A Dynamic Warehouse for XML Data of the Web.

An Abstract Semantics and Concrete Language for Continuous Queries over Streams and Relations Presenter: Liyan Zhang Presentation of ICS

Architecture for Pattern- Base Management Systems Manolis TerrovitisPanos Vassiliadis National Technical Univ. of Athens, Dept. of Electrical and Computer.

WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.

1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.

Abstraction and ACT-R. Outline Motivations for the Theory –Architecture –Abstraction Production Rules ACT-R Architecture of Cognition.

1 Provenance Semirings T.J. Green, G. Karvounarakis, V. Tannen University of Pennsylvania Principles of Provenance (PrOPr) Philadelphia, PA June 26, 2007.

Chapter 14 The Second Component: The Database.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Concepts of Database Management Sixth Edition

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.

XP Information Information is everywhere in an organization Employees must be able to obtain and analyze the many different levels, formats, and granularities.

Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.

Time Series Data Analysis - II

On the Origin of Data Daniel Deutch Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

©Silberschatz, Korth and Sudarshan5.1Database System Concepts Chapter 5: Other Relational Languages Query-by-Example (QBE) Datalog.

Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.

DATA-CENTERED CROWDSOURCING WORKSHOP PROF. TOVA MILO SLAVA NOVGORODOV TEL AVIV UNIVERSITY 2014/2015.

Towards a Javascript CoG Kit Gregor von Laszewski Fugang Wang Marlon Pierce Gerald Guo

Finding Optimal Probabilistic Generators for XML Collections Serge Abiteboul, Yael Amsterdamer, Daniel Deutch, Tova Milo, Pierre Senellart BDA 2011.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

FI-CORE Data Context Media Management Chapter Release 4.1 & Sprint Review.

Concepts of Database Management Seventh Edition

Dimitrios Skoutas Alkis Simitsis

Concepts of Database Management Seventh Edition

Future Learning Landscapes Yvan Peter – Université Lille 1 Serge Garlatti – Telecom Bretagne.

A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.

Understanding User’s Query Intent with Wikipedia G 여 승 후.

Tool for Ontology Paraphrasing, Querying and Visualization on the Semantic Web Project By Senthil Kumar K III MCA (SS)‏

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

Concepts of Database Management Seventh Edition Chapter 3 The Relational Model 2: SQL.

32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.

Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.

Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.

National Educational Technology Standards For Students.

Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS

AP CSP: Cleaning Data & Creating Summary Tables

Data-Centered Crowdsourcing Workshop

Chapter 13 The Data Warehouse

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

David W. Embley Brigham Young University Provo, Utah, USA

Higher-Order Procedures

Lecture 12: Data Wrangling

Project Category Grade Level

Subscript and Summation Notation

Chapter 2: Intro to Relational Model

On Provenance of Queries on Linked Web Data

Slides based on those originally by : Parminder Jeet Kaur

Presentation transcript:

Approximated Provenance for Complex Applications Susan B. Davidson University of Pennsylvania Eleanor Ainy, Daniel Deutch, Tova Milo Tel Aviv University Intro Science is changing Importance of wf and prov Vision Provenance: wf vs db Related work Privacy Privacy concerns Initial results on module privacy Composite modules as a hiding technique, can be used for privacy Search and access control Hierarchical wf model Search Access control Search w access control Related work.

Crowd Sourcing The engagement of crowds of Web users for data procurement and knowledge creation. Crowd-sourcing: harness the crowd to perform some task Galaxy Zoo: classifying galaxies according to their shape Captcha: Completely Automated Public Turing test to tell Computers and Humans Apart 2

Why now? We are all connected, all the time! 3

Complexity? Many of the initial applications were quite simple Specify Human Interaction Task (HIT) using e.g. Mechanical Turk, collect responses, aggregate to form result. Newer ideas are multi-phase and complex, e.g. mining frequent fact sets from the crowd (OASSIS) Model as workflows with global state

Outline “State-of-the-art” in crowd data provenance New challenges A proposal for modeling crowd data provenance

Outline “State-of-the-art” in crowd data provenance New challenges A proposal for modeling crowd data provenance

Crowd data provenance? TripAdvisor: aggregates reviews and presents average ratings Individual reviews are part of the provenance Wikipedia: keeps extensive information about how pages are edited ID of the user who generated the page as well as changes to page (when, who, summary) Provides several views of this information, e.g. by page or by editor Mainly used for presentation and explanation

TA rank is based on an algorithm which accounts not only for the ratings given, but number of reviews and age of reviews with newer reviews receiving more weight than older ones. What that specific algorithm is, none of us know.

Outline “State-of-the-art” in crowd data provenance New challenges A proposal for modeling crowd data provenance

Challenges for crowd data provenance Complexity of processes and number of user inputs involved Provenance can be very large, leading to difficulties in viewing and understanding provenance Need for Summarization Multidimensional views Provenance mining Compact representation for maintenance and cleaning Data is collected through a variety of means, ranging from simple questions (e.g. “In which picture is this person the oldest?”or “What is the capital of France?”), to Datalog-style reasoning [7], to dynamic processes that evolve as answers are received from the crowd (e.g. mining association rules from the crowd [4]). The complexity of the process, together with the number of user inputs involved in the derivation of even a single value, lead to an especially large provenance size, which leads to further difficulties in viewing and understanding the information.

Summarization Large size of provenance  need for abstraction E.g., in heavily edited Wikipedia pages: “x1, x2, x3 are formatting changes; y1, y2, y3, y4 add content; z1 , z2 represent divergent viewpoints” “u1 , u2 , u3 represent edits by robots; v1, v2 represent edits by Wikipedia administrators” E.g., in a movie-rating application to summarize the provenance of the average rating for “MatchPoint” “Audience crowd members gave higher ratings (8-10) whereas critics gave lower ratings (3-5).”

Multidimensional Views “Perspective” through which provenance can be viewed or mined E.g. in TripAdvisor, if there is an “outlier” review it would be useful to see other reviews by that person to “calibrate” it. “Question” perspective could show which questions are bad/unclear

Maintenance and Cleaning May need update propagation to remove certain users, questions and/or answers E.g. spammers or bad questions Mining of provenance may lag behind the aggregate calculation E.g., detecting a spammer may only be possible when they have answered enough questions, or when enough answers have been obtained from other users. Note that the aggregate calculation may in turn have already been used in a downstream calculation, or have been used to direct the process itself.

Outline “State-of-the-art” in crowd data provenance New challenges A proposal for modeling crowd data provenance

Crowd Sourcing Workflow Consider a movie reviews aggregator platform, whose logic is captured by the workflow in Figure 1. Inputs for the platform are reviews (scores) from users, whose identifiers and names are stored in the Users table. Users have different roles (e.g. movie critics, directors, audience, etc.); information about two such roles, Critics and Audience, is shown in the corresponding relations. Reviews are fed through different reviewing modules, which “crawl” different reviewing platforms such as IMDB, newspaper web-sites etc. Each such module updates statistics in the Stats table, e.g. how many reviews the user has submitted (Num-Rate), what their average score is (computed as SumRate divided by NumRate), etc. A reviewing module also consults Stats to output a “sanitized review”, by implementing some logic. The sanitized reviews are then fed to an aggregator, which computes an aggregate movies scores. There are many plausible logics for the reviewing modules; we exemplify one in which each module “sanitizes” the reviews by joining the users, reviews and audience/critic relation (depending on the module), keeping only reviews of users listed under the corresponding role (audience/critic), and are “active”, i.e. submitted more than 2 reviews. The aggregator combines the reviews obtained from all modules to compute overall movie ratings (sum, num, avg). Movie reviews Aggregator Platform

Provenance expression Want some expression that captures the Users (U), their type Audience (A), the logic involved (the Stat (S) for the User (U)), their answer, and how combined to calculate the final result.

Propagating provenance annotations through joins A B C … a b c p R ⋈ S A B C D E … a b c d e p * r JOIN (on B) S D B E Todd J. Green, Gregory Karvounarakis, Val Tannen: Provenance semirings. PODS 2007: … d b e r The annotation p * r means joint use of data annotated by p and data annotated by r [Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

Propagating provenance annotations through unions and projections A B C … a b c1 p a b c2 r a b c3 s πABR A B … a b p + r + s PROJECT + means alternative use of data, which arises in both PROJECT and UNION. [Green, Karvounarakis, Tannen, Provenance Semirings. PODS 2007]

Annotated Aggregate Expressions Q = select Dept, sum(Sal) from R group by Dept R Eid Dept Sal The sum salary for d1 could be represented by the expression (20 p1 + 10 p2 + 15 p3) ⊗ 1 d1 20 p1 2 d1 10 p2 3 d1 15 P3 This provenance aware value “commutes” with deletion. [Amsterdamer, Deutch, Tannen, Provenance for Aggregate Queries. PODS 2011]

Provenance expression

Provenance expression: Benefits Can understand how movie ratings were computed. Can be used for data maintenance and cleaning E.g. if U2 is discovered to be a spammer, “map” its provenance annotation to 0

Summarizing provenance Map annotations to a corresponding “summary” h: Ann  Ann’, where |Ann’| << |Ann| E.g. in our example, let h(Ui)=h(Si)=1, h(Ai)=A, h(Ci)=C Reducing the expression to Which simplifies to

Constructing mappings? How do we define and find “good” mappings? Provenance size Semantic constraints (e.g. two annotations can only be mapped to the same annotation if they come from the same input table) Distance between original provenance expression and the mapped expression (e.g. grouping all young French people and giving them an average rating for some movie)

Conclusions Provenance is needed for crowd-sourcing applications to help understand the results and reason about their quality. Techniques from database/workflow provenance can be used, but there are special challenges and “opportunities”