Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir.

Similar presentations


Presentation on theme: "Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir."— Presentation transcript:

1 Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir

2 Introduction As computer (data) scientists, we all need data to do our computations. The problem is that when collecting data from the real world, errors are almost impossible to avoid. The mistakes could be a result of:

3 Programmer Errors

4 User Input

5 Faulty Sensors And more…

6 There are many methods out there for fixing and finding errors in data, but today we are going to look at a method, that assuming we have a data base, and we know the truth value of every entry, helps us to point out the likely cause for the errors, all that while solving an NP-Complete problem in linear time How will we know the validity of the data?

7 Examples The researchers have tested their algorithm on a few data sources and found interesting results

8 Scanning the Web As you probably expected, the internet is full of data errors. The Algorithm was able to find error clusters in data sets extracted from the web that manual investigation revealed they were a result of: 1. Annotation Errors – searching the site www.besoccer.com resulted in about 600 athletes with the wrong date of birth “Feb 18, 1986”, which was probably an HTML copy paste errorwww.besoccer.com 2. Reconciliation Errors- Search term like “baseball coach” returned 700,000 results with a 90% error rate since coaches of all types were returned 3. Extraction Errors – when using a certain extractor from multiple sites, the search term “Olympics” returned 2 million results with a 95% error rate, since all the search term was over generalized and returned all sport related results

9 Other Examples Two more examples they presented are: Packet Losses – they ran the algorithm over reports of packet losses in a wireless network, and the algorithm pointed out two problematic nodes, that were on the side of the building with interference. Traffic Incidents – when running the algorithm over traffic incident reports crossed with weather data, the algorithm pointed out the correlation between water levels of 2cm of more, and traffic incidents.

10 So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.

11 So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.

12 So What Are Properties? Properties could be columns of the data rows, or metadata about the rows (like the source of the row). We would like them to have some hierarchy, and that every row will be defined exclusively by a single group of properties.

13 Let’s Look at an Example Here we have a set of data from a wiki page, with data about musicians and composers: We would like to store this data in a uniform way

14 Enter – Triplets

15 So the following row: will become the following triplets: Object Predicate Values

16 Back to Properties So what are the properties of our data? Let’s look at the first triplet: What do we know about it? Source of the data – Table 1 Subject – P.Fontaine Predicate – Profession Object – Musician What we created here, is the Property Vector of our data element. Property vectors can represent subsets of elements.

17 And here is how it’s done We’ll add an ID to each triplet, and here is our result: * The highlighted rows are rows with errors

18 Let’s Formulate Property Dimensions – Each dimension describes one aspect of the data (an element of the Property Vector). Property Hierarchy – for every dimension, we will define a hierarchy of values from “All” to every specific value we have. Property Vector – A unique identifier of a subset of data elements. A vector can represent all the element: {All, All, … }, or a specific element:

19 So, Are we there yet?

20 Two More Definitions

21 Results Table of Elements:

22 Results Table of Features: * Notice that we created a Directed Acyclic Graph (DAG). It plays a huge part in the time compexity.

23 Questions?

24 Formal Problem

25 What Cost??? As we said before, we want the extracted features to be as accurate as possible, so the cost is the penalty we pay for misses in our features

26 Formal Cost We start by using (*)Bayesian analysis to derive the set of features with the highest probability of being associated with the causes for the mistakes in the dataset. We derive our cost function from the Bayesian estimate: the lowest cost corresponds to the highest a posteriori probability that the selected features are the real causes for the errors. The resulting cost function contains three types of penalties, which capture the following three intuitions: Conciseness: Simpler diagnoses with fewer features are preferable. Specificity: Each feature should have a high error rate. Consistency: Diagnoses should not include many correct elements. (*) Bayesian analysis is a statistical procedure which endeavors to estimate parameters of an underlying distribution based on the observed distribution

27 So how do we calculate it?

28 Assumptions

29

30

31

32

33 Cost Function

34

35 So Are We There Now????

36 Additive Cost Function (Final Form)

37 Questions?

38 And what about the Algorithm?

39 Feature Hierarchy

40 Parent-Child Features

41 Feature Partitions

42 Feature Hierarchy+Partitions This hierarchy can be represented as a DAG:

43 In Our Example

44 Questions?

45 Back to the Algorithm During the traversal of the DAG, we will maintain 3 datasets: Unlikely causes U: features that are not likely to be causes. Suspect causes S: features that are possibly the causes. Result diagnosis R: features that are decided to be associated with the causes.

46 The Complete Algorithm

47 Algorithm 1-4 We start simple with initiating the features: Then we start the top down traversal

48 Algorithm 5-7 Create the Child features from the each parent. If the parent feature is marked as “Covered”, an ancestor was marked added to R, and if we go over the children, we might produce redundant features, and thus there is no need to go over this parent and it’s children. We mark them as “Covered” as well.

49 Algorithm 8-11 Now we first get all the current children divided to partitions (line 8). Next, we compare each partitions total cost with it’s parents cost. We add the winner to S and the loser to U.

50 Algorithm 12-15 We now need to consolidate U and S. Parents that are only in S are moved to R, and their children are marked as “covered”. Parent features in U are discarded since one of their child features better explains the problem. Child features in S are sent to nextLevel for further investigation.

51 Questions?

52 Complexity

53 Optimizations There are optimizations to this algorithm that include “pruning” and “Parallel diagnosis in MapReduce” that can significantly improve the actual runtime. These improve the runtime. We also can improve the accuracy. We can post-process the result set with a greedy set-cover step. This greedy step looks for a minimal set of features among those chosen by DATAXRAY. Since the number of features in the DATAXRAY result is typically small, this step is very efficient. Testing shows that with negligible overhead, DATAXRAY with greedy refinement results in significant improvements in accuracy.

54 Competitors Greedy RedBlue DATAAUDITOR FEATURESELECTION DECISIONTREE

55 Metrics Precision measures the portion of features that are correctly identified as part of the optimal diagnosis Recall measures the portion of features associated with causes of errors that appear in the derived diagnosis F-measure - harmonic mean of the previous two

56 Graphs

57 Execution Time

58 Questions?

59 The End

60 Good Luck In The Exams!


Download ppt "Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir."

Similar presentations


Ads by Google