N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.

Slides:

Advertisements

Similar presentations

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Advertisements

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Data Mining Techniques: Clustering

Basic Data Mining Techniques Chapter Decision Trees.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Basic Data Mining Techniques

Focused Reducts Janusz A. Starzyk and Dale Nelson.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.

Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

1 An Excel-based Data Mining Tool Chapter The iData Analyzer.

CS Instance Based Learning1 Instance Based Learning.

Part I: Classification and Bayesian Learning

Enterprise systems infrastructure and architecture DT211 4

Evaluating Performance for Data Mining Techniques

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

An Excel-based Data Mining Tool Chapter The iData Analyzer.

Inductive learning Simplest form: learn a function from examples

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Data Clustering 1 – An introduction

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.

 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.

1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

A new clustering tool of Data Mining RAPID MINER.

An Excel-based Data Mining Tool Chapter The iData Analyzer.

Clustering Algorithms Minimize distance But to Centers of Groups.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.

Data Mining and Text Mining. The Standard Data Mining process.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Semi-Supervised Clustering

System for Semi-automatic ontology construction

Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition

An Excel-based Data Mining Tool

Data Mining Anomaly Detection

Outlier Discovery/Anomaly Detection

Data Mining Anomaly/Outlier Detection

CSCI N317 Computation for Scientific Applications Unit Weka

Data Mining: Introduction

Unsupervised Learning: Clustering

Data Mining Anomaly Detection

Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE

Data Mining Anomaly Detection

Presentation transcript:

N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network

N. GagunashviliRAVEN Workshop Heidelberg Data collected and stored at enormous speeds Traditional techniques infeasible for raw data Methods of data mining with visualization techniques may help scientists in classifying and segmenting data in Hypothesis Formation

N. GagunashviliRAVEN Workshop Heidelberg

N. GagunashviliRAVEN Workshop Heidelberg Data processing can be realized at the RAVEN network Unsupervised classification (cluster analysis) Detection anomalies (outliers) Supervised classification for selection rare events

N. GagunashviliRAVEN Workshop Heidelberg

N. GagunashviliRAVEN Workshop Heidelberg

N. GagunashviliRAVEN Workshop Heidelberg Algorithm 1_RAVEN Basic K-means Algorithm 1.Select K points as the initial centroids at user node and sheare this information in network. 2.Repeat 3.Form K clusters by assigning all points to the closest centroid for each node. 4.Recompute the centroind of each cluster for each nodes. Output of each node are positions of centroids with number of points assigned to centroids. 5.Recalculate the positions of centroids for whole network and sheare this information in network. 6.Until The centroids positions for cluster don’t change.

N. GagunashviliRAVEN Workshop Heidelberg What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x  D with anomaly scores greater than some threshold t –Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D Anomaly/Outlier Detection

N. GagunashviliRAVEN Workshop Heidelberg Outlier score defined as relative distance. It is the ratio of the distance of point from the closest centroid to the median distance of all points in the cluster from the centroid. Clustering-Based algorithm anomaly detection

N. GagunashviliRAVEN Workshop Heidelberg Algorithm 2 Basic clustering-based algorithm of anomaly detection 1.Find position of K centroids (Algorithm 1) 2.Find median distance for each centroid 3.Calculate the score of each pont that is (dist of point to nearist centroid)/(median dist. for given nearist centorid) 4.Order the scores to define outliers. Clustering-Based algorithm of anomaly detection

N. GagunashviliRAVEN Workshop Heidelberg Algorithm 2_RAVEN Basic clustering-based algorithm of anomaly detection 1.Find position of K centroids (Algorithm 1_RAVEN) 2.Find median distance for each centroid of network. Histogram-based method can be used for that. 3.Calculate the score of each point that is (dist of point to nearest centroid)/(median dist. for given nearest centroid ) 4.Order the scores for each node. 5.Use highest scores from nodes and find points with highest score for network Clustering-Based algorithm of anomaly detection

N. GagunashviliRAVEN Workshop Heidelberg Some clustering techniques, such as K-mean, have linear or close to linear time and space complexity. Possible to find both clusters and outliers at the same time. The set of outliers produced and their score can be heavily dependent upon the number of clusters as well as the presence of outliers in the data. The quality of outliers produced by a clustering is heavily impacted by the quality of clusters produced by algorithm. The clustering algorithm needs to be chosen carefully. Clustering-Based. Strength and Weaknesses

N. GagunashviliRAVEN Workshop Heidelberg In the analysis the LHCb focuses on very specific decay modes of B mesons which are sensitive to quantum effects caused by as yet undiscovered heavy particles (“New Physics”). ▪ Relative frequencies of B-decays ▪ Every B-decay inside an event there are about 5-10 times that number of tracks from non-B-decays The success of LHCb and the other LHC experiments therefore depends critically on the availability of sufficiently powerful analysis tools. Main research must be addresses the issues of finding signals in a huge background. Supervised classification for selection rare events

N. GagunashviliRAVEN Workshop Heidelberg The typical imbalanced problem is credit card fraud or AIDS tests. In these problems it is important to detect all rare (signal) instances if possible. In particle selection on the other hand, performance criteria is signal/noise ratio or significance that is typical for detector devices. The second difference is the usage of real cases in data mining in contrast to simulated training and test samples in particle physics. Classification algorithms must therefore be robust with respect to differences in properties between simulated and real data.

N. GagunashviliRAVEN Workshop Heidelberg Example of selection algorithm IPpi, DoCA, IP, IPp, ptpi - attributes of event D0, BG - class of event

N. GagunashviliRAVEN Workshop Heidelberg Selection algorithms used methods of supervised classifications Traditional in data analysis cut based method of selection Background/signal ~ 3000 Britsch, XVII International Workshop on Deep-Inelastic Scattering and Related Subjects, 2009, Madrid

N. GagunashviliRAVEN Workshop Heidelberg Precision of measurements can be improved New method can help find particles that can not to be found by old method. Trigger can be organized to register only events interesting for particular physical analysis. New method can found application for detection complex events for example in radars, sonars, lidars technique