Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir
Introduction As computer (data) scientists, we all need data to do our computations. The problem is that when collecting data from the real world, errors are almost impossible to avoid. The mistakes could be a result of:
Programmer Errors
User Input
Faulty Sensors And more…
There are many methods out there for fixing and finding errors in data, but today we are going to look at a method, that assuming we have a data base, and we know the truth value of every entry, helps us to point out the likely cause for the errors, all that while solving an NP-Complete problem in linear time How will we know the validity of the data?
Examples The researchers have tested their algorithm on a few data sources and found interesting results
Scanning the Web As you probably expected, the internet is full of data errors. The Algorithm was able to find error clusters in data sets extracted from the web that manual investigation revealed they were a result of: 1. Annotation Errors – searching the site resulted in about 600 athletes with the wrong date of birth “Feb 18, 1986”, which was probably an HTML copy paste errorwww.besoccer.com 2. Reconciliation Errors- Search term like “baseball coach” returned 700,000 results with a 90% error rate since coaches of all types were returned 3. Extraction Errors – when using a certain extractor from multiple sites, the search term “Olympics” returned 2 million results with a 95% error rate, since all the search term was over generalized and returned all sport related results
Other Examples Two more examples they presented are: Packet Losses – they ran the algorithm over reports of packet losses in a wireless network, and the algorithm pointed out two problematic nodes, that were on the side of the building with interference. Traffic Incidents – when running the algorithm over traffic incident reports crossed with weather data, the algorithm pointed out the correlation between water levels of 2cm of more, and traffic incidents.
So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.
So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.
So What Are Properties? Properties could be columns of the data rows, or metadata about the rows (like the source of the row). We would like them to have some hierarchy, and that every row will be defined exclusively by a single group of properties.
Let’s Look at an Example Here we have a set of data from a wiki page, with data about musicians and composers: We would like to store this data in a uniform way
Enter – Triplets
So the following row: will become the following triplets: Object Predicate Values
Back to Properties So what are the properties of our data? Let’s look at the first triplet: What do we know about it? Source of the data – Table 1 Subject – P.Fontaine Predicate – Profession Object – Musician What we created here, is the Property Vector of our data element. Property vectors can represent subsets of elements.
And here is how it’s done We’ll add an ID to each triplet, and here is our result: * The highlighted rows are rows with errors
Let’s Formulate Property Dimensions – Each dimension describes one aspect of the data (an element of the Property Vector). Property Hierarchy – for every dimension, we will define a hierarchy of values from “All” to every specific value we have. Property Vector – A unique identifier of a subset of data elements. A vector can represent all the element: {All, All, … }, or a specific element:
So, Are we there yet?
Two More Definitions
Results Table of Elements:
Results Table of Features: * Notice that we created a Directed Acyclic Graph (DAG). It plays a huge part in the time compexity.
Questions?
Formal Problem
What Cost??? As we said before, we want the extracted features to be as accurate as possible, so the cost is the penalty we pay for misses in our features
Formal Cost We start by using (*)Bayesian analysis to derive the set of features with the highest probability of being associated with the causes for the mistakes in the dataset. We derive our cost function from the Bayesian estimate: the lowest cost corresponds to the highest a posteriori probability that the selected features are the real causes for the errors. The resulting cost function contains three types of penalties, which capture the following three intuitions: Conciseness: Simpler diagnoses with fewer features are preferable. Specificity: Each feature should have a high error rate. Consistency: Diagnoses should not include many correct elements. (*) Bayesian analysis is a statistical procedure which endeavors to estimate parameters of an underlying distribution based on the observed distribution
So how do we calculate it?
Assumptions
Cost Function
So Are We There Now????
Additive Cost Function (Final Form)
Questions?
And what about the Algorithm?
Feature Hierarchy
Parent-Child Features
Feature Partitions
Feature Hierarchy+Partitions This hierarchy can be represented as a DAG:
In Our Example
Questions?
Back to the Algorithm During the traversal of the DAG, we will maintain 3 datasets: Unlikely causes U: features that are not likely to be causes. Suspect causes S: features that are possibly the causes. Result diagnosis R: features that are decided to be associated with the causes.
The Complete Algorithm
Algorithm 1-4 We start simple with initiating the features: Then we start the top down traversal
Algorithm 5-7 Create the Child features from the each parent. If the parent feature is marked as “Covered”, an ancestor was marked added to R, and if we go over the children, we might produce redundant features, and thus there is no need to go over this parent and it’s children. We mark them as “Covered” as well.
Algorithm 8-11 Now we first get all the current children divided to partitions (line 8). Next, we compare each partitions total cost with it’s parents cost. We add the winner to S and the loser to U.
Algorithm We now need to consolidate U and S. Parents that are only in S are moved to R, and their children are marked as “covered”. Parent features in U are discarded since one of their child features better explains the problem. Child features in S are sent to nextLevel for further investigation.
Questions?
Complexity
Optimizations There are optimizations to this algorithm that include “pruning” and “Parallel diagnosis in MapReduce” that can significantly improve the actual runtime. These improve the runtime. We also can improve the accuracy. We can post-process the result set with a greedy set-cover step. This greedy step looks for a minimal set of features among those chosen by DATAXRAY. Since the number of features in the DATAXRAY result is typically small, this step is very efficient. Testing shows that with negligible overhead, DATAXRAY with greedy refinement results in significant improvements in accuracy.
Competitors Greedy RedBlue DATAAUDITOR FEATURESELECTION DECISIONTREE
Metrics Precision measures the portion of features that are correctly identified as part of the optimal diagnosis Recall measures the portion of features associated with causes of errors that appear in the derived diagnosis F-measure - harmonic mean of the previous two
Graphs
Execution Time
Questions?
The End
Good Luck In The Exams!