1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the.

1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the sink/source t. 4. use the Maximum-Flow/Minimum-Cut algorithm to find the flow from source to sink, get the cut separating s and t, and use the smaller side as the candidate outlier or outlier group. 5. remove the candidate outlier or outlier group from the graph. 6. select the next source, go back to 3 until the stop criterion. 7. coarsen the graph and run the algorithm again; select outliers from candidate outliers. Outlier Detection and Evaluation by Network Flow Ying LiuAdvisor: Alan P. Sprague {liuyi, sprague}@cis.uab.edu Department of Computer and Information Sciences University of Alabama at Birmingham http://www.cis.uab.edu/kddm Fraud detection for credit cards Intrusion detection in computer network Detection novelties in images Detect network bottlenecks Experiments and Implementation Details To repair the poor quality clusters generated by a cluster algorithm, we repair a cluster by removing outliers which do not belong in this cluster. Theory foundation: Maximum Flow/Minimum Cut The maximum flow problem is to find a f for which the total flow is maximum. The total flow can be measured at the sink, or it can be measured at any cut separating the source from the sink. (Ford-Fulkerson algorithm, 1962.) s->a->b->t: 12 s->a->c->d->b->t: 7 s->c->b->t: 9 s->c->d->t: 3 maximum-flow = minimum-cut = 12+3+9+7=31 st a b cd 19/19 12/13 7/109/97/7 12/12 28/30 3/3 10/11 Intuition: Suppose t is an outlier, s is the farthest vertex from t. Suppose further that t is far from all other points in the data set. Then each edge between t and other vertices has small capacity, so ({t}, C-{t}) is a cut of small capacity. 7 nearest neighbors 591 points, 5028 edges. Edges are in two directions. The No. 20 cluster ， 591 points 20 Compute k nearest neighbors, make sure all vertices are connected. Compute the capacity between two vertices by the distance. Red vertices are the source side. source Blue vertices are the sink side Randomly select a vertex to start, find its farthest vertex by single source shortest path. Use this source and sink to run the Maximum_flow/Minimum_cut algorithm. The flow from source to sink is 1269. We use the maximum flow as the measure of outlier degree. For an outlier or outlier group, low maximum flow means strong outliers, and high maximum flow means weak outliers. Vertex with smallest (last_wavemin+flow_passed) s t 3/3 3/10 3/12 3/10 0/8 0/4 Minimum-cut a b c d e f 0/15 Minimum-cut After saturating all the edges from source to sink, source will continue find a augmenting path, i.e., the last wave. We use the vertex with smallest (last_wave + flow_passed) as the next source. Vertex with maximum average distance Get each vertex’s average distance to other vertices, then order these distances from maximum to minimum. Begin this process from the vertex with maximum average distance. After the minimum cut, some vertices are cut. The next source is chosen from among vertices not cut, as the one with maximum average distance. 1 2 3 LoopMax Flow No. 41267 No. 11269 No. 33256 No. 53937 No. 85939 No. 77717 No. 148962 No. 910148 No. 1016194 No. 216533 No. 1317793 No. 625378 No. 1163797 No. 12160515 No. 15359560 No. 17427908 No. 161307310 Users input the number of outlier or outlier group they want. Use the maximum flow as the stop condition. If D flow < D avg Then Stop D flow = 1/n th root of the max_flow D avg = average distance of the remaining data LoopCutMax Flow No. 1vertex 41267 No. 2vertex 11269 No. 3vertex 33256 No. 4Vertex 53937 No. 5vertex 85939 No. 6vertex 7,9,10 16531 No. 7vertex 216533 No. 8Vertex 625378 No. 9Vertex 1152498 1.Density-based algorithm measure the difference in density between an object and its neighboring objects 2.Distribution-based algorithm An object O in a dataset T is a UO (p, D)-outlier if at least fraction p of the objects in T are distance D from O. 3.Distance-based algorithm The problem of finding all DB (p, D)- outliers can be solved by answering a nearest neighbor or range query centered at each object O. 4.Depth-based algorithm Depth based algorithms find the outliers by peeling off the outer layers of convex hulls. 5.Clustering-based algorithm Outliers are byproduct of the clustering process and those outliers will not be in any clusters. Use maximum average distance to select next source Outlier Detection and Application Previous work Repair poor quality of a cluster Poor quality clusters Theory foundation Outlier Detection by Network Flow Outlier Detection and Application Find an outlier/outlier group Scale up the capacity by n th power of the original capacity. sink Choose next source Outliers and maximum flow results Different parameters Stop criteria K = 10 K = 15 Different k nearest neighbors K = 7 Increase the number of k, network has more edges. Outliers are split into more pieces. Use maximum average distance to select next source, outliers are split in to more pieces. Algorithm process Final results after running the algorithm again on the coarse network. Because of the order of removal, outliers 13 and 14 have quite different maximum flow. We coarsen the graph and use each cut as a vertex and merge edges. Outliers Noisy data Novel information Anomaly Deviation Set up the Network

1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the.

Similar presentations

Presentation on theme: "1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the.

Similar presentations

Presentation on theme: "1. cluster the data. 2. for the data of a cluster, set up the network. 3. begin at a random vertex as source/sink s, choose its farthest vertex as the."— Presentation transcript:

Similar presentations

About project

Feedback