1 Using Association Rules for Fraud Detection in Web Advertising Networks Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Efficient Computation of Frequent and Top-k Elements in Data Streams
A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
A distributed method for mining association rules
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.
New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the five essential properties of an algorithm.
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Evaluating Search Engine
Streaming Algorithms for Robust, Real- Time Detection of DDoS Attacks S. Ganguly, M. Garofalakis, R. Rastogi, K. Sabnani Krishan Sabnani Bell Labs Research.
Dynamic Tuning of the IEEE Protocol to Achieve a Theoretical Throughput Limit Frederico Calì, Marco Conti, and Enrico Gregori IEEE/ACM TRANSACTIONS.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Communication-Efficient Distributed Monitoring of Thresholded Counts Ram Keralapura, UC-Davis Graham Cormode, Bell Labs Jai Ramamirtham, Bell Labs.
SSCP: Mining Statistically Significant Co-location Patterns Sajib Barua and Jörg Sander Dept. of Computing Science University of Alberta, Canada.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.
What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically By Graham Cormode & S. Muthukrishnan Rutgers University, Piscataway NY Presented by.
Bandwidth Allocation in a Self-Managing Multimedia File Server Vijay Sundaram and Prashant Shenoy Department of Computer Science University of Massachusetts.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
Student Seminar – Fall 2012 A Simple Algorithm for Finding Frequent Elements in Streams and Bags RICHARD M. KARP, SCOTT SHENKER and CHRISTOS H. PAPADIMITRIOU.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
Secure Incremental Maintenance of Distributed Association Rules.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal ( )
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Efficient Computation of Frequent and Top-k Elements in Data Streams.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Practical LFU implementation for Web Caching George KarakostasTelcordia Dimitrios N. Serpanos University of Patras.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
Content caching and scheduling in wireless networks with elastic and inelastic traffic Group-VI 09CS CS CS30020 Performance Modelling in Computer.
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
Crowd Fraud Detection in Internet Advertising Tian Tian 1 Jun Zhu 1 Fen Xia 2 Xin Zhuang 2 Tong Zhang 2 Tsinghua University 1 Baidu Inc. 2 1.
Duplicate Detection in Click Streams(2005) SubtitleAhmed Metwally Divyakant Agrawal Amr El Abbadi Tian Wang.
Using Multiple Predictors to Improve the Accuracy of File Access Predictions Gary A. S. Whittle, U of Houston Jehan-François Pâris, U of Houston Ahmed.
Frequency Counts over Data Streams
Optimizing Parallel Algorithms for All Pairs Similarity Search
Query-Friendly Compression of Graph Streams
Data Mining Association Analysis: Basic Concepts and Algorithms
Edge computing (1) Content Distribution Networks
Approximate Frequency Counts over Data Streams
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
By: Ran Ben Basat, Technion, Israel
Lu Tang , Qun Huang, Patrick P. C. Lee
Presentation transcript:

1 Using Association Rules for Fraud Detection in Web Advertising Networks Ahmed Metwally Divyakant Agrawal Amr El Abbadi Department of Computer Science University of California, Santa Barbara

2 Outline Introduction –Motivating Applications Problem Formalization –Problem Definition: Association Rules in Data Streams Which Elements to Count Together? –The Unique-Count Technique A Feasible Counting Algorithm –The Streaming-Rules Algorithm Experimental Results Conclusion

3 The Advertising Network Model Motivated by Internet Advertising Commissioners $$: Detect hit-inflation fraud done by publishers

4 It seems like a Famous Problem “ When Advertisers Pay by the Look, Fraud Artists See Their Chance ” David Vise Washington Post April 17, 2005; Page F01 Previous Work [Metwally et al. WWW’05] –Detecting Duplicate in Click Streams Fraud (27% of traffic) was detected in Live data

5 [Anupam et al. WWW ‘ 99] Hit- Inflation Attack

6 [Anupam et al. WWW ‘ 99] Hit- Inflation Characteristics [Anupam et al. WWW‘99] hit inflation fraud technique –Coalition: Dishonest Publisher P and Dishonest Site S –Two versions of PageP.html: non-Fraudulent and Fraudulent –If Customer C is referred from S: P loads Fraudulent PageP.html. Otherwise, P loads non- Fraudulent PageP.html

7 Why is it Difficult to Detect? Duplicate Detection Does not work Commissioner does not know Referer field value for HTTP calls to Publishers Hidden from the Customer A normal Visit: non-Fraudulent PageP.html

8 Detecting Anupam’s Attack We call for coalition between Advertising Commissioners and ISPs. We call for coalition between Advertising Commissioners and ISPs. ISP: Which Websites precede what Websites? We are interested in popular pairs of elements

9 Mining Association Rules in Streams of Elements Another Motivation: –Predictive caching File Servers Search Engines Model: –Needs a new way to model streams generated by activity of more than one customer –Previous work [Chang et al. SIGKDD’03, Teng et al. VLDB’03, Yu et al. VLDB’4] assumed streams of transactions or sessions

10 Formalizing the Problem Assumptions 1: Stream of Elements –Previous work [Chang et al. SIGKDD’03, Teng et al. VLDB’03, Yu et al. VLDB’04] assumed streams of transactions or sessions –This is not always applicable –ISPs tracking HTTP requests of customers individually: Privacy violation (US CODE: Title 18, Part I, Chapter 119, section 2511) Technically, NAT boxes hide thousands of computers –Search Engines: Not all of them use cookies –File Servers: distributed applications blur sessions

11 Formalizing the Problem (cont.) Assumptions 2: Causality Span –Causality holds between temporally close element pairs Assumptions 3: Lost History –The server cannot store the entire history. It only stores a current window of elements. Assumptions 4: Independent Duplicates –Duplicate pairs assumed issued by different Customers Assumptions 5: No False Negatives –Give counting the benefit of doubt –Stream = aab  Count(a,b) = 1 –Stream = aabb  Count(a,b) = 2 –Stream = abab  Count(a,b) = 2

12 Problem Definition Formal Definition –Given a stream q 1, q 2, …, q I, …, q N of size N –Assume causality holds within a span δ –An association rule is an implication on the form x  y –The conditional frequency F(x, y) of x and y is the number of times distinct y’s follow distinct x’s within δ –The frequency F(x) of x the number of occurrences of x Antecedent ≠ Consequent

13 Problem Definition (cont.) Two Variations –Forward Association Rules: Motivated by search engines and file servers Focus on Antecedent: F(x) > φN Frequent conditional frequency: F(x, y) > ψ F(x) –Backward Association Rules: Motivated by detecting Anupam’s fraud technique Focus on Consequent: F(y) > φN Frequent conditional frequency: F(x, y) > ψ F(y) Both φ and ψ are user specified, 0 ≤ φ, ψ ≤ 1

14 Example F(x) = 4, F(u) = 3, F(f) = 1 S = x x u u c g d c x f x u N = 12 Span between g and f is 4 Within span 2, F(c, d) = 1 Within span 3, F(u, g) = 1, only one possible pairing For any span > 1, F(x, u) = 3, only 3 u’s User Query: δ = 3, φ = 0.2, and ψ = 0.3 –Min support requirement = φN = 0.2 * 12 = 3 –Only x and u can be antecedents for forward association or consequents for backward association Forward Association: –For x, Min confidence requirement = ψ F(x) = 0.3 * 4 = 2 –For u, Min confidence requirement = ψ F(u) = 0.3 * 3 = 1 –Since δ = 3, rules are x  u, u  c, u  g, u  d

15 Guidelines on Pairing Elements Element a cannot cause itself For any two elements a and b, we cannot count one a for more than one b Associate causality with the eldest possible element. This avoids underestimating counts. The server cannot store the entire history. It only stores a current window of elements. –The current window is at least δ + 1 It is not a simple problem to comply with such rules. WHY?

16 Example Assume current window = 6 δ = 5 S = a ab b will be counted with a at q 1, Hence a at q 2 can be counted with another b later cdab Since the server cannot see the expired a, it will assume that b at q 3 is counted with a at q 2. Hence, b at q 7 is counted with a at q 6 b The server cannot associate the new b at q 8 with any a, since the b at q 7 is counted with a at q 6 A more cautious counting results in F(a,b) = 3 instead of 2 Shall the server keep more history?

17 Example (Cont) Assume we consider the forward association of a  b δ = 5 S = aa b c d …b The server needs the entire history for a correct F(a, b) δ = 5 S = aa b c d b… If current window = 6, the server counts only 2/3 * F(a, b) Shall the server keep te entire history?

18 The Unique-Count Algorithm Data Structures: –For last element, q I, keep Antecedent Set, t I It contains elements that arrived before q I and was counted with q I. The set expires when observe a new element. –For each element, q J, in current window, keep Consequent Set, s J, It contains elements that arrived after q J and was counted with q J. Space Complexity: O(δ 2 ) Processing time per element: O(δ)

19 Unique-Count By Example Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, δ = 3 S = aab Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, F(a,b) = 1 ba

20 Unique-Count By Example Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, δ = 3 S = aab F(a,b) = 1 b a c Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, F(a,c) = 1 c a

21 Unique-Count By Example Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, δ = 3 S = aab F(a,b) = 1 b c F(a,c) = 1 c a Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, F(b,c) = 1 c b

22 Unique-Count By Example Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, δ = 3 S = aab F(a,b) = 1 b c F(a,c) = 1 c F(b,c) = 1 c b Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, F(a,b) = 2 ba

23 Unique-Count By Example Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, δ = 3 S = aab F(a,b) = 2 b c F(a,c) = 1 c F(b,c) = 1 c b ba Unique-Count Technique –For each arriving element, q I, scan the previous δ elements in order of arrival, from old to new. For every scanned element, q J –If (q J ≠ q I ) and (q J  t I ) and (q I  s J ) »Count q I for q J »Insert q J into t I and q I into s J, F(c,b) = 1 b c

24 Is the Problem Solved? Yes, we know which elements to count together for association. No, this is not practical. We cannot keep counters for all possible pairs of elements We need an efficient algorithm to count frequent associated with other frequent element We need to count nested frequent elements in data streams

25 Nesting Frequent Elements Algorithms If we have a counter-based algorithm, Λ, that finds φ-frequent elements in streams, we use it to find antecedents of rules. For every antecedent, x, we use Λ to find consequents, elements occurred after x within δ, which satisfy ψ F(x). Λ can be our algorithm Streaming-Rules [Metwally et al. ICDT ’ 05], or one of [Manku et al. VLDB ’ 02] algorithms.

26 Nesting Frequent Elements Data Structure The Λ algorithm keeps a Γ data structure to estimate counts of frequent antecedents. For every frequent antecedents, x, a nested data structure Γ x is kept to estimate the counts of frequent consequents.

27 The Space-Saving Algorithm Space-Saving [Metwally et al. WWW ’ 05] is a counter-based algorithm Monitor only m elements in a Stream-Summary data structure Frequency estimation is more accurate for significant elements Keep track of max. possible overestimation errors for each element Properties: –No. of counters < 1/ ,  is user specified error –An element, x, with F(x) >  N, is guaranteed to be monitored

28 Space-Saving By Example Element Count error (max possible) ABBACABBDD Element ABC Count221 error (max possible) 000 Element ABC Count321 error (max possible) 000 Element BAC Count431 error (max possible) 000 Element BAD Count432 error (max possible) 001 Element BAD Count533 error (max possible) 001 E Element BEA Count543 error (max possible) 030 Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error Space-Saving Algorithm –For every element in the stream S –If a monitored element is observed Increment its Count –If a non-monitored element is observed, Replace the element with minimum hits, min Increment the minimum Count to min + 1 maximum possible over-estimation is error C Element BEC Count544 error (max possible) 033 B

29 The Streaming-Rules Algorithm Streaming-Rules Algorithm –For every arriving element, q I, in the stream S –Update Antecedent Stream-Summary using Space-Saving –If q I was not monitored before Initialize its Consequent Stream-Summary –Identify elements that q I should be counted for as a consequent using Unique-Count –For each Identified element q J Insert q I into the Consequent Stream-Summary of q J using Space-Saving

30 Querying the Nested Structure Find-Forward Algorithm –Scan Antecedent Stream-Summary until the scanned element does not satisfy minsupScan –For each scanned element, q I –Scan Consequent Stream-Summary of q I until the scanned element, q J, does not satisfy minconf –For each scanned element q J Output q I  q J

31 The Streaming-Rules Properties Streaming-Rules is an algorithm that: –Detects both forward and backward association between keywords or sites –Space efficient Streaming-Rules inherits some properties from Unique-Count: –The processing time per element is O(δ)

32 The Streaming-Rules Properties (Cont) Streaming-Rules inherits some properties from Space-Saving –Using O(1/  * 1/η) space, Streaming-Rules has overestimation rates bounded by  in support, and η in confidence. Both  and η are user specified errors –A rule with guaranteed frequency, count - overestimation, that exceeds the thresholds is guaranteed to be correct –An association rule x  y, is guaranteed to be monitored in the consequent Stream-Summary of x if F(x) >  N, and F(x, y) > η N

33 Experimental Setup Data: both synthetic and obfuscated ISP log Compare with Omni-Data, that uses the same Unique-Count technique, and Stream-Summary data structure, but keeps exact counters Compare: run time and space usage For Streaming-Rules, measure: –Recall: number of correct elements found / number of actual correct –Precision: number of correct elements found / entire output –Guarantee: number of guaranteed correct elements found / entire output

34 Synthetic Data Experiments Adaptation to data skew: –Zipfian Data: skew parameter = 1, 1.5, 2, 2.5, 3 For all synthetic data, Streaming-Rules –Recall = Precision = Guarantee = 1 Forward rules. φ = ψ = 0.1, δ = 10, 20 Streaming-Rules used a nested Stream- Summary with m = n =500   = 1/500, and η = 1/250

35 The Streaming-Rules Space Efficiency N = 3*10 6

36 The Streaming-Rules Time Efficiency N = 3*10 6

37 The Streaming-Rules Space Scalability N = 10 7

38 The Streaming-Rules Time Scalability N = 10 7

39 Real Data Experiments Obfuscated ISP data from Anonymous.com N = 678,191 For all synthetic data, Streaming-Rules –Recall = 1, Precision and Guarantee varied from 0.97 to 0.99 Interesting results: –Set of Suspicious antecedents, and a set of suspicious consequents –The antecedents are not frequent Backward rules. φ = 0.02, ψ = 0.5, δ = 10, 20, …, 100 Streaming-Rules used a nested Stream-Summary with m = 1000, n =500   = 1/500, and η = 3/1000

40 Space Usage - ISP Data N = 6*10 5

41 Time Usage - ISP Data N = 6*10 5

42 Conclusion Contributions: –A new model for mining (forward and backward) association between elements in data streams –A solution to Anupam’s hit inflation mechanism that was never detected before –A new algorithm for solving the proposed problem with limited processing per element and space –Guarantees on results –Experimental validation