A Linear Method for Deviation Detection in Large databases

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Copyright 2004 David J. Lilja1 Comparing Two Alternatives Use confidence intervals for Before-and-after comparisons Noncorresponding measurements.
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Multi-dimensional Sequential Pattern Mining
Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.
Reduced Support Vector Machine
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
1 A DATA MINING APPROACH FOR LOCATION PREDICTION IN MOBILE ENVIRONMENTS* by Gökhan Yavaş Feb 22, 2005 *: To appear in Data and Knowledge Engineering, Elsevier.
Model Database. Scene Recognition Lamdan, Schwartz, Wolfson, “Geometric Hashing”,1988.
Data Mining – Intro.
Performance and Scalability: Apriori Implementation.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.
Outlier Detection & Analysis
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Learning user preferences for 2CP-regression for a recommender system Alan Eckhardt, Peter Vojtáš Department of Software Engineering, Charles University.
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Searching Topics Sequential Search Binary Search.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm Speaker: Minghua ZHANG March. 12, 2003 Authors: Isidore Rigoutsos Aris.
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
` Printing: This poster is 48” wide by 36” high. It’s designed to be printed on a large-format printer. Customizing the Content: The placeholders in this.
Fuzzy Set Approach for Improving Web Log Mining Sajitha Naduvil-Vadukootu Csc 8810 : Computational Intelligence Instructor: Dr. Yanqing Zhang Dec 4, 2006.
Machine Learning for the Quantified Self
Supervised Time Series Pattern Discovery through Local Importance
MIS 451 Building Business Intelligence Systems
Byung Joon Park, Sung Hee Kim
Selective Regular Expression Matching
Jiawei Han Department of Computer Science
Lin Lu, Margaret Dunham, and Yu Meng
Ungraded quiz Unit 1.
I don’t need a title slide for a lecture
Farzaneh Mirzazadeh Fall 2007
Mining Sequential Patterns
15-826: Multimedia Databases and Data Mining
Mean-shift outlier detection
CSE591: Data Mining by H. Liu
Promising “Newer” Technologies to Cope with the
Ungraded quiz Unit 1.
Presentation transcript:

A Linear Method for Deviation Detection in Large databases Data mining A Linear Method for Deviation Detection in Large databases Presented by: Ali Triki Date: 09/30/1999 11/28/2018

Content What are Deviations Approach Exact exception problem Sequential exception problem Algorithm Dissimilarity function Experimental results conclusion 11/28/2018

What are Deviations? Deviations are errors or noise in data Several approaches for detecting deviations (or exceptions) in the areas of Databases and Machine Learning Statistical approach (Hoaglin 1983) Extending learning algorithms to cope with small amount of noise (Aha 1991) Impact of erroneous examples on the learning results (Quinlan 1986) 11/28/2018

Approach Use the implicit redundancy in the data to detect deviations. Clustering data into 2 clusters: deviation and non deviations. Do not discard deviation as noise, but try to isolate small minorities. 11/28/2018

Exact Exception Problem Problem description Set of Items I= {1,4,4,4} Cardinality function: C(I) Dissimilarity Function: the variance of the numbers in the set = 1/n (xi- x)2 Smoothing factor: C(I-Ij) * (D(I)-D(I-Ij)) By computing each candidate exception set Ij we get the following results: 11/28/2018

Example The candidate set = {1} is an exception because it has a large smoothing factor SF 11/28/2018

Sequential Exception Problem After seeing a series of similar data, an element disturbing the series is considered an exception Given: A set of items I A sequence S of subsets:: Ij  I and Ij-1 Ij Cardinality function Smoothing factor: SF(Ij)=C(Ij-Ij-1) * (D(Ij)-D(Ij-1)) The Smoothing factor consider the difference with the preceding set instead of the complimentary set 11/28/2018

Algorithm 1- Get the first element i1 of the item set I making up the element subset I1I and compute Ds(I1) 2- For each following element ij in S, create the subset Ij taking Ij= Ij-1U {ij} and compute the difference in dissimilarity values dj=Ds(Ij) – Ds(Ij-1) 3- Consider that element ij with the maximal value of dj>0 to be the answer for this iteration. If dj  0 for all Ij in S, there is no exception 11/28/2018

Algorithm If an exception ij is found: For each element ik where k>j compute dk0=Ds(Ij-1U {ik}) –Ds (Ij-1) dk1=Ds(IjU {ik}) –Ds (Ij) Add to Ix those ik for which dk0 –dk1  dj For m iterations, we get m competing exception sets Ix, select the one with the largest value of difference in dissimilarity dj scaled with the dissimilarity function C 11/28/2018

Dissimilarity function Handles the comparison of the character strings, it maintains a pattern of a regular expression that matches all the character strings seen so far. Starting with the pattern of the 1st string, we introduce wildcard characters as more strings need to be covered. Ds(Ij)= Ds(Ij-1) + J*(Ms(Ij)-Ms(Ij-1))/Ms(Ij) Auxiliary function Ms(Ij )= 1/ (3*c-w+2) With c being the total number of characters And w being the number of needed wildcards 11/28/2018

Experimental Results 1 11/28/2018

Experimental Results 2 11/28/2018

Experimental Results 3 11/28/2018

A Failure example 11/28/2018

Why did it fail? The dissimilarity function used couldn’t catch the exception. Once 2 values ‘..,n,..’ and ‘..,y,..’ are seen , the pattern takes the form ‘...,*,…’ from then on, there is no change in pattern when ‘?’ appears in the same column as the pattern covers it. Need a more powerful dissimilarity function. 11/28/2018

Conclusion We presented a linear algorithm for sequential exception problem. Experimental evaluation shows that the effectiveness of the algorithm depends on the dissimilarity function used. It seems helpful to have some predefined D.F that works well for particular datasets. 11/28/2018

References: A. Arning, R. Agrawal, P. Raghavan: "A Linear Method for Deviation Detection in Large Databases", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996 S. Sarawagi, R. Agrawal, N. Megiddo: "Discovery-driven exploration of OLAP data cubes", Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998 R. Agrawal and R Srikant “Fast Algorithms for mining association rules” In Proceedings of the VLDB Conference 1994 11/28/2018