Download presentation
Presentation is loading. Please wait.
Published byJacoby Fourman Modified over 9 years ago
1
School of Computer Science Carnegie Mellon University Athens University of Economics & Business Patterns amongst Competing Task Frequencies: S u p e r – L i n e a r i t i e s, & t h e A l m o n d-D G m o d e l Danai Koutra B.Aditya Prakash Vasileios Koutras Christos Faloutsos PAKDD, 15-17 April 2013, Gold Coast, Australia
2
CMU AUEB Questions we answer (1) Patterns: If Bob executes task x for n x times, how many times does he execute task y? Modeling: Which 2-d distribution fits 2-d clouds of points? 2 © Danai Koutra (CMU) - PAKDD'13 # of ‘Smith’ (100 calls, 700 sms)
3
CMU AUEB Questions we answer (2) Patterns: If Bob executes task x for n x times, how many times does he execute task y? Modeling: Which 2-d distribution fits 2-d clouds of points? 3 © Danai Koutra (CMU) - PAKDD'13 # of
4
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 4 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)
5
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 5 © Danai Koutra (CMU) - PAKDD'13
6
CMU AUEB Let’s peek... … at our contributions Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios 6 © Danai Koutra (CMU) - PAKDD'13
7
CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 7 © Danai Koutra (CMU) - PAKDD'13
8
CMU AUEB Data 1: Tencent Weibo micro-blogging website in China 2.2 million users Tasks extracted Tweets Retweets Comments Mentions Followees 8 © Danai Koutra (CMU) - PAKDD'13
9
CMU AUEB Data 2: Phonecall Dataset phone-call records 3.1 million users Tasks extracted: Calls Messages Voice friends SMS friends Total minutes of phonecalls 9 © Danai Koutra (CMU) - PAKDD'13
10
CMU AUEB Roadmap Data Observed Patterns Related Work Proposed Distribution Goodness of Fit Conclusions 10 © Danai Koutra (CMU) - PAKDD'13
11
CMU AUEB Pattern 1 - SuRF: Super Linear Relative Frequency (1) 11 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(retweets) ‘Smith’ (1100 retweets, 7 tweets) Logarithmic Binning Fit [Akoglu’10] 15 log buckets E[Y|X=x] per bucket linear regression on conditional means 0.23
12
CMU AUEB Pattern 1 – SuRF (2) 12 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(comments) Corr coeff: ++ Intuition: 2x tweets, 4x comments 0.304
13
CMU AUEB Pattern 1 – SuRF (3) 13 © Danai Koutra (CMU) - PAKDD'13 ln(tweets) ln(mentions) Corr coeff: ++ Intuition: 0.33
14
CMU AUEB Pattern 1 – SuRF (4) 14 © Danai Koutra (CMU) - PAKDD'13 ln(followees) ln(retweets) Corr coeff: ++ Intuition: 0.25
15
CMU AUEB Pattern 1 – SuRF (5) 15 © Danai Koutra (CMU) - PAKDD'13 Super-linear relationship: more calls, even more minutes ln(calls_no) ln(total_mins) 1.18 Corr coeff: ++ Intuition:
16
CMU AUEB Pattern 1 – SuRF (6a) 16 © Danai Koutra (CMU) - PAKDD'13 ln(calls_no) ln(voice_friends) 2x friends, 3x phonecalls
17
CMU AUEB Pattern 1 – SuRF (6b) 17 © Danai Koutra (CMU) - PAKDD'13 Telemarketers? ln(calls_no) ln(voice_friends)
18
CMU AUEB Pattern 1 – SuRF (7) 18 © Danai Koutra (CMU) - PAKDD'13 ln(sms_friends) ln(sms_no) 2x friends, 5x sms
19
CMU AUEB Contributions revisited (1) Patterns: power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios. 19 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)
20
CMU AUEB Pattern 2: log-logistic marginals (1) 20 © Danai Koutra (CMU) - PAKDD'13 NOT power law ln(retweets) ln(frequency) Marginal PDF
21
CMU AUEB 21 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(frequency) Marginal PDF NOT power law Pattern 2: log-logistic marginals (2)
22
CMU AUEB 22 © Danai Koutra (CMU) - PAKDD'13 power law ln(mentions) ln(frequency) Marginal PDF Pattern 2: log-logistic marginals (3)
23
CMU AUEB Contributions revisited (2) Patterns: We observe power law relationships between competing tasks log-logistic distributions for many tasks Modeling: We propose the Almond-DG distribution for fitting 2-d real world datasets Practical Use: spot outliers; what-if scenarios. 23 © Danai Koutra (CMU) - PAKDD'13
24
CMU AUEB Roadmap Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 24 © Danai Koutra (CMU) - PAKDD'13
25
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 25 © Danai Koutra (CMU) - PAKDD'13 # of
26
CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 26 © Danai Koutra (CMU) - PAKDD'13
27
CMU AUEB Solutions in the Literature? Multivariate Logistic [Malik & Abraham, 1973] Multivariate Pareto Distribution [Mardia, 1962] Triple Power Law [Akoglu et al., 2012] bivariate distribution for modeling reciprocity in phonecall networks 27 © Danai Koutra (CMU) - PAKDD'13 BUT none of them captures the 2-d marginals AND dependency / correlation!!!
28
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Problem Definition Almond-DG Background: copulas Goodness of Fit Conclusions 28 © Danai Koutra (CMU) - PAKDD'13
29
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 29 © Danai Koutra (CMU) - PAKDD'13 # of
30
CMU AUEB STEP 1: How to model the marginal distributions? A: Log-logistic! Q: Why? A: Because it mimics Pareto captures the top concavity matches reality 30 © Danai Koutra (CMU) - PAKDD'13 ln(retweets) ln(frequency) Marginal PDF
31
CMU AUEB Reminder: Log-logistic (1) The longer you survive the disease, the even longer you survive Not memoryless 2 parameters: scale ( α ) and shape ( β ) BACKGROUND 31 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=
32
CMU AUEB Reminder: Log-logistic (2a) In log-log scales, looks like hyperbola BACKGROUND 32 © Danai Koutra (CMU) - PAKDD'13 a=1β=β=
33
CMU AUEB Reminder: Log-logistic (2b) In log-log scales, looks like hyperbola BACKGROUND 33 © Danai Koutra (CMU) - PAKDD'13 a=1β=β= Blank out the top concavity - power law
34
CMU AUEB Fact: Log-logistic (3) linear log-odd plots BACKGROUND 34 © Danai Koutra (CMU) - PAKDD'13 Prob(X x) ln(mentions) ln(odds) α = 2.07 β = 1.27
35
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 35 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔
36
CMU AUEB STEP 2: How to model the dependency? A: we borrow an idea from survival models, financial risk management, decision analysis COPULA! 36 © Danai Koutra (CMU) - PAKDD'13
37
CMU AUEB Modeling dependence between r.v.’s (e.g., X = # of, Y = # of ) BACKGROUND Copulas in a nutshell 37 © Danai Koutra (CMU) - PAKDD'13
38
CMU AUEB Model dependence between r.v.’s (e.g., X = # of, Y = # of ) Create multivariate distribution s.t.: the marginals are preserved the correlation (+, -, none) is captured BACKGROUND Copulas in a nutshell 38 © Danai Koutra (CMU) - PAKDD'13 # of
39
CMU AUEB STEP 2: Which copula? A: among the many copulas Blah Gumbel’s copula 39 © Danai Koutra (CMU) - PAKDD'13
40
CMU AUEB Applications of Gumbel’s copula Modeling of: the dependence between loss and lawyer’s fees in order to calculate reinsurance premiums the rainfall frequency as a joint distribution of volume, peak, duration etc. … BACKGROUND 40 © Danai Koutra (CMU) - PAKDD'13
41
CMU AUEB Gumbel’s copula: Example 1 BACKGROUND 41 © Danai Koutra (CMU) - PAKDD'13 Uniform marginals No dependence # of
42
CMU AUEB Gumbel’s copula: Example 2 BACKGROUND 42 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals No correlation # of
43
CMU AUEB Gumbel’s copula: Example 3 BACKGROUND 43 © Danai Koutra (CMU) - PAKDD'13 Skewed marginals ρ = 0.7 # of
44
CMU AUEB Problem definition Given: cloud of points Find: a 2-d PDF, f(x,y), that captures (a) the marginals (b) the dependency 44 © Danai Koutra (CMU) - PAKDD'13 # of ✔ ✔
45
CMU AUEB where θ = ( 1 – ρ ) -1 captures the dependence ρ = Spearman’s coefficient 45 © Danai Koutra (CMU) - PAKDD'13 ρ=0 ρ=0.4 ρ=0.7 ρ=0 ρ=0.2 ρ=0.7 α = ? β = ? α = ? β = ?
46
CMU AUEB - DG If (X,Y) ~ A LMOND then (floor(X), floor(Y)) ~ A LMOND - DG where X>=1 and Y>=1. i.e., we discretize the values of A LMOND, and reject the pairs with either X=0 or Y=0. 46 © Danai Koutra (CMU) - PAKDD'13
47
CMU AUEB Contributions revisited (3) Patterns: We observe power laws between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG distribution for 2-d real datasets Practical Use: spot outliers; what-if scenarios. 47 © Danai Koutra (CMU) - PAKDD'13
48
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 48 © Danai Koutra (CMU) - PAKDD'13
49
CMU AUEB Synthetic Data Generation 49 © Danai Koutra (CMU) - PAKDD'13 Parameter Estimation Traditionally: MLE, MOM log-logistic Proposed: log-odd plot 2 parameters intercept + slope of the line Copula-based generation 1 parameter dependence θ Evaluation is hard even for 1-d skewed distributions!!! [Chakrabarti, 2006] ln(mentions) ln(odds)
50
CMU AUEB Goodness of Fit (1a) 50 © Danai Koutra (CMU) - PAKDD'13 ln(frequency) ln(comments) Marginal PDF ln(mentions) ln(frequency) Real data - Synthetic data 1 ✔
51
CMU AUEB Goodness of Fit (1b) 51 © Danai Koutra (CMU) - PAKDD'13 Contour plots Conditional Means (SuRF) Synthetic data Real data ln(mentions) ln(comments) 2 ✔ 3 ✔
52
CMU AUEB Goodness of Fit (2a) 52 © Danai Koutra (CMU) - PAKDD'13 Real data - Synthetic data ln(frequency) ln(retweets) Marginal PDF ln(tweets) ln(frequency) 1 ✔
53
CMU AUEB Goodness of Fit (2b) 53 © Danai Koutra (CMU) - PAKDD'13 Real data Synthetic data Contour plots Conditional Means (SuRF) ln(retweets) ln(tweets) ln(retweets) ln(tweets) 3 2 ✔ ✔
54
CMU AUEB Roadmap Related Work Data Observed Patterns Proposed Distribution Goodness of Fit Conclusions 54 © Danai Koutra (CMU) - PAKDD'13
55
CMU AUEB Conclusions Patterns: Discovery of power law between competing tasks log-logistic distributions for many tasks Modeling: Almond-DG, that explains (i) super-linearity, (ii) marginals and (iii) conditionals in real 2-d data Practical Use: anomaly detection; what-if scenarios. 55 © Danai Koutra (CMU) - PAKDD'13 ln(comments) ln(tweets)
56
CMU AUEB Thank you! - DG 56 © Danai Koutra (CMU) - PAKDD'13
57
CMU AUEB Backup slides Likely question areas: ideas glossed over, shortcomings of methods or results, and future work Why Gumbel? It fits, it has been used in the past + parsimonious (theta, alpha, beta) 57 © Danai Koutra (CMU) - PAKDD'13
58
CMU AUEB Why are we interested in these questions? We can: answer what-if scenarios & spot anomalies. 58 © Danai Koutra (CMU) - PAKDD'13
59
CMU AUEB … but Power Laws: although prevalent, it’s not always the case in real data Logistic & Log-Logistic Distributions no earlier work provides a 2-d distribution that explains the patterns found in the real datasets, i.e., super-linearity + log-logistic marginals 59 © Danai Koutra (CMU) - PAKDD'13
60
CMU AUEB Goodness of Fit: mentions vs. comments Evaluation is hard even for univariate skewed distributions 60 © Danai Koutra (CMU) - PAKDD'13
61
CMU AUEB STEP 2: How to model the dependency? with Copulas … and specifically, Gumbel’s copula! 61 © Danai Koutra (CMU) - PAKDD'13
62
CMU AUEB Copulas formally BACKGROUND 62 © Danai Koutra (CMU) - PAKDD'13 Add simple cases for Gumbel’s copula + properties of it…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.