Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir.

Slides:



Advertisements
Similar presentations
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Advertisements

Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Fast Algorithms For Hierarchical Range Histogram Constructions
Sorting Comparison-based algorithm review –You should know most of the algorithms –We will concentrate on their analyses –Special emphasis: Heapsort Lower.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Sorting Heapsort Quick review of basic sorting methods Lower bounds for comparison-based methods Non-comparison based sorting.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 A Dynamic Clustering and Scheduling Approach to Energy Saving in Data Collection from Wireless Sensor Networks Chong Liu, Kui Wu and Jian Pei Computer.
Tracking Moving Objects in Anonymized Trajectories Nikolay Vyahhi 1, Spiridon Bakiras 2, Panos Kalnis 3, and Gabriel Ghinita 3 1 St. Petersburg State University.
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Experimental Evaluation
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
Learning from Observations Chapter 18 Through
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A new Ad Hoc Positioning System 컴퓨터 공학과 오영준.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
SQL Server Statistics and its relationship with Query Optimizer
CS 9633 Machine Learning Support Vector Machines
SIMILARITY SEARCH The Metric Space Approach
12. Principles of Parameter Estimation
Iterative Deepening A*
Information Retrieval in Practice
A paper on Join Synopses for Approximate Query Answering
Task: It is necessary to choose the most suitable variant from some set of objects by those or other criteria.
Minimum Spanning Tree 8/7/2018 4:26 AM
Distribution of the Sample Means
Maximal Independent Set
Aditya P. Mathur Purdue University
Intro to PHP & Variables
Relational Algebra Chapter 4, Part A
Data Mining Lecture 11.
Distributed Submodular Maximization in Massive Datasets
Chapter 15 QUERY EXECUTION.
Six Sigma Green Belt Training
Predictive Performance
Chapter 1 Database Systems
Data Integration with Dependent Sources
Overview: Fault Diagnosis
Rank Aggregation.
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
10701 / Machine Learning Today: - Cross validation,
Discriminative Frequent Pattern Analysis for Effective Classification
Disambiguation Algorithm for People Search on the Web
Introduction to Artificial Intelligence Lecture 9: Two-Player Games I
Zip Codes and Neural Networks: Machine Learning for
Probabilistic Databases
Chapter 1 Database Systems
Important Problem Types and Fundamental Data Structures
Evolutionary Ensembles with Negative Correlation Learning
12. Principles of Parameter Estimation
Algorithms for Selecting Mirror Sites for Parallel Download
Communication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs Chia-Ling Chen, Yen-Hao Chen and TingTing Hwang Department.
Minimum Spanning Trees
Presentation transcript:

Xiaolan Wang, Xin Luna Dong, Alexandra Meliou Presentation By: Tomer Amir

Introduction As computer (data) scientists, we all need data to do our computations. The problem is that when collecting data from the real world, errors are almost impossible to avoid. The mistakes could be a result of:

Programmer Errors

User Input

Faulty Sensors And more…

There are many methods out there for fixing and finding errors in data, but today we are going to look at a method, that assuming we have a data base, and we know the truth value of every entry, helps us to point out the likely cause for the errors, all that while solving an NP-Complete problem in linear time How will we know the validity of the data?

Examples The researchers have tested their algorithm on a few data sources and found interesting results

Scanning the Web As you probably expected, the internet is full of data errors. The Algorithm was able to find error clusters in data sets extracted from the web that manual investigation revealed they were a result of: 1. Annotation Errors – searching the site resulted in about 600 athletes with the wrong date of birth “Feb 18, 1986”, which was probably an HTML copy paste errorwww.besoccer.com 2. Reconciliation Errors- Search term like “baseball coach” returned 700,000 results with a 90% error rate since coaches of all types were returned 3. Extraction Errors – when using a certain extractor from multiple sites, the search term “Olympics” returned 2 million results with a 95% error rate, since all the search term was over generalized and returned all sport related results

Other Examples Two more examples they presented are: Packet Losses – they ran the algorithm over reports of packet losses in a wireless network, and the algorithm pointed out two problematic nodes, that were on the side of the building with interference. Traffic Incidents – when running the algorithm over traffic incident reports crossed with weather data, the algorithm pointed out the correlation between water levels of 2cm of more, and traffic incidents.

So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.

So let’s define the problem Say we have a huge database, where every data element has a truth value. We would like to find a subsets of properties (Features) that will define as many erroneous elements, and as few correct elements as possible, at the minimum time possible.

So What Are Properties? Properties could be columns of the data rows, or metadata about the rows (like the source of the row). We would like them to have some hierarchy, and that every row will be defined exclusively by a single group of properties.

Let’s Look at an Example Here we have a set of data from a wiki page, with data about musicians and composers: We would like to store this data in a uniform way

Enter – Triplets

So the following row: will become the following triplets: Object Predicate Values

Back to Properties So what are the properties of our data? Let’s look at the first triplet: What do we know about it? Source of the data – Table 1 Subject – P.Fontaine Predicate – Profession Object – Musician What we created here, is the Property Vector of our data element. Property vectors can represent subsets of elements.

And here is how it’s done We’ll add an ID to each triplet, and here is our result: * The highlighted rows are rows with errors

Let’s Formulate Property Dimensions – Each dimension describes one aspect of the data (an element of the Property Vector). Property Hierarchy – for every dimension, we will define a hierarchy of values from “All” to every specific value we have. Property Vector – A unique identifier of a subset of data elements. A vector can represent all the element: {All, All, … }, or a specific element:

So, Are we there yet?

Two More Definitions

Results Table of Elements:

Results Table of Features: * Notice that we created a Directed Acyclic Graph (DAG). It plays a huge part in the time compexity.

Questions?

Formal Problem

What Cost??? As we said before, we want the extracted features to be as accurate as possible, so the cost is the penalty we pay for misses in our features

Formal Cost We start by using (*)Bayesian analysis to derive the set of features with the highest probability of being associated with the causes for the mistakes in the dataset. We derive our cost function from the Bayesian estimate: the lowest cost corresponds to the highest a posteriori probability that the selected features are the real causes for the errors. The resulting cost function contains three types of penalties, which capture the following three intuitions: Conciseness: Simpler diagnoses with fewer features are preferable. Specificity: Each feature should have a high error rate. Consistency: Diagnoses should not include many correct elements. (*) Bayesian analysis is a statistical procedure which endeavors to estimate parameters of an underlying distribution based on the observed distribution

So how do we calculate it?

Assumptions

Cost Function

So Are We There Now????

Additive Cost Function (Final Form)

Questions?

And what about the Algorithm?

Feature Hierarchy

Parent-Child Features

Feature Partitions

Feature Hierarchy+Partitions This hierarchy can be represented as a DAG:

In Our Example

Questions?

Back to the Algorithm During the traversal of the DAG, we will maintain 3 datasets: Unlikely causes U: features that are not likely to be causes. Suspect causes S: features that are possibly the causes. Result diagnosis R: features that are decided to be associated with the causes.

The Complete Algorithm

Algorithm 1-4 We start simple with initiating the features: Then we start the top down traversal

Algorithm 5-7 Create the Child features from the each parent. If the parent feature is marked as “Covered”, an ancestor was marked added to R, and if we go over the children, we might produce redundant features, and thus there is no need to go over this parent and it’s children. We mark them as “Covered” as well.

Algorithm 8-11 Now we first get all the current children divided to partitions (line 8). Next, we compare each partitions total cost with it’s parents cost. We add the winner to S and the loser to U.

Algorithm We now need to consolidate U and S. Parents that are only in S are moved to R, and their children are marked as “covered”. Parent features in U are discarded since one of their child features better explains the problem. Child features in S are sent to nextLevel for further investigation.

Questions?

Complexity

Optimizations There are optimizations to this algorithm that include “pruning” and “Parallel diagnosis in MapReduce” that can significantly improve the actual runtime. These improve the runtime. We also can improve the accuracy. We can post-process the result set with a greedy set-cover step. This greedy step looks for a minimal set of features among those chosen by DATAXRAY. Since the number of features in the DATAXRAY result is typically small, this step is very efficient. Testing shows that with negligible overhead, DATAXRAY with greedy refinement results in significant improvements in accuracy.

Competitors Greedy RedBlue DATAAUDITOR FEATURESELECTION DECISIONTREE

Metrics Precision measures the portion of features that are correctly identified as part of the optimal diagnosis Recall measures the portion of features associated with causes of errors that appear in the derived diagnosis F-measure - harmonic mean of the previous two

Graphs

Execution Time

Questions?

The End

Good Luck In The Exams!