Presentation is loading. Please wait.

Presentation is loading. Please wait.

Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and contingency tables Ron S. Kenett KPA Ltd., Raanana,

Similar presentations


Presentation on theme: "Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and contingency tables Ron S. Kenett KPA Ltd., Raanana,"— Presentation transcript:

1 Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and contingency tables Ron S. Kenett KPA Ltd., Raanana, Israel and Department of Statistics and Applied Mathematics "Diego de Castro", University of Torino, Italy in collaboration with Silvia Salini Department of Economics, Business and Statistics, University of Milan, Italy

2 Text Mining Outline Contingency Tables Graphical Displays “Algebraic Statistics” Evolution

3 Background 1930 - 1975 FISHER, R. A. 1930, The Genetical Theory of Natural Selection. Clarendon, Oxford, U.K.. LEWONTIN, R., and KOJIMA, K., 1960, The evolutionary dynamics of complex polymorphisms. Evolution 14: 458- 472. KARLIN, S. and FELDMAN, M., 1970, Linkage and selection: Two locus symmetric viability model. Theoretical Population Biology 1 : 39-71. KARLIN, S., 1975, General two-locus models: some objectives, results and interpretations. Theoretical Population Biology 7 : 364-398. KARLIN,S. and KENETT, R. 1977, Variable Spatial Selection with Two Stages of Migration and Comparisons Between Different Timings, Theoretical Population Biology, 11, pp. 386-409. Linkage Disequilibrium Evolution R.A. Fisher Sam Karlin The Fundamental Theorem of Natural Selection: “the rate of increase of fitness of any organism at any time is equal to its genetic variance at that time”

4 Background - 1975 Contingency Tables Discrete multivariate analysis: theory and practice “The first comprehensive treatise on the analysis of categorical data using loglinear and related statistical models…”

5 Background - 1983 Graphical Display

6 Background - 2008 Algebraic Statistics

7 254 itemsets: The “e” relationship ……….. female…………..infection………………… ……….. female……………………..………………… ……….. ……….…………..infection………………… ……………..………………………….……..………… 57 40 109 48 Text Mining

8 The Data Contingency Tables e RHS^RHS LHS x 1 =.224 x 2 =.157 ^LHS x 3 =.429 x 4 =.189 d RHS^RHS LHS x 1 =.389 x 2 =.056 ^LHS x 3 =.50 x 4 =.056 h RHS^RHS LHS x 1 =.057 x 2 =.321 ^LHS x 3 =.189 x 4 =.434 LHS ^LHS RHS ^RHS RHS ^RHS eRHS^RHS LHS5740 ^LHS10948

9 “…a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read; and that like a figure imprinted on sand, is soon totally erased and defaced.” William Playfair (1786), The Commercial and Political Atlas, from Edward R. Tufte (1983), The Visual Display of Quantitative Information. Contingency Tables

10 The Simplex Graphical Display x1x1 x2x2 x3x3 x4x4

11 The Simplex Graphical Display

12 Linkage Disequilibrium “Algebraic Statistics” RHS^RHS LHSx1x1 x2x2 ^LHSx3x3 x4x4 D can be extended to more dimensions… Two loci, two alleles each, four genotypes: AB, Ab, aB, ab

13 Linkage Disequilibrium “Algebraic Statistics” An algebraic observation…

14 Relative Linkage Disequilibrium “Algebraic Statistics” D M is the distance from the point corresponding to the contingency table on the surface D=0 in the e  e direction, to the surface of the simplex, in that direction. D is the distance from the point corresponding to the contingency table in the simplex, to the surface D=0 in the e  e direction.

15 Graphical Display

16 RLD Example RLD Contingency Tables h RHS^RHS LHS x 1 =.057 x 2 =.321 ^LHS x 3 =.189 x 4 =.434 e RHS^RHS LHS x 1 =.224 x 2 =.157 ^LHS x 3 =.429 x 4 =.189 d RHS^RHS LHS x 1 =.389 x 2 =.056 ^LHS x 3 =.50 x 4 =.056

17 Association Rules Association rules are one of the most popular unsupervised data mining methods used in applications such as Market Basket Analysis, to measure the associations between products purchased by each consumer, or in web clickstream analysis, to measure the association between the pages seen (sequentially) by a visitor of a site. Mining frequent itemsets and association rules is a popular and well researched method for discovering interesting relations between variables in large databases. The structure of the data to be analyzed is typically referred to as transactional. Once obtained, the list of association rules extractable from a given dataset is compared in order to evaluate their importance level. The measures commonly used to assess the strength of an association rule are the indexes of support, confidence, and lift. Text Mining

18 Support The support for a rule A => B is obtained dividing the number of transactions which satisfy the rule, N A=>B, by the total number of transactions, N. support {A=>B} = N A=>B / N Text Mining

19 Support Text Mining RHS^RHS LHS x1x1 x2x2 g ^LHS x3x3 x4x4 1-g f1-f1 support {A=>B} = {N A=>B / N} = x 1 The higher the support the stronger the information that both type of events occur together.

20 Confidence The confidence of the rule A => B is obtained by dividing the number of transactions which satisfy the rule, N A=>B, by the number of transactions which contain the body of the rule, A. confidence {A=>B} = N A=>B / N A Text Mining

21 Confidence Text Mining RHS^RHS LHS x1x1 x2x2 g ^LHS x3x3 x4x4 1-g f1-f1 confidence {A=>B} = {N A=>B / N A } = x 1 /g A high confidence that the LHS event leads to the RHS event implies causation or statistical dependence.

22 Lift The lift of the rule A => B is the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS (A) and the RHS (B). lift {A=>B} = confidence{A=>B} / support{B} = support{A=>B}/support{A}support{B} Text Mining

23 Relative Linkage Disequilibrium and other measures RLD Sup = 57/254 =.224 Conf = 57/97 =.588 Sup (RHS) = 166/254 =.654 lift =.588/.654 =.90 Contingency Tables eRHS^RHS LHS5740 ^LHS10948

24 The groceries example Association Rules First 20 rules for groceries data, sorted by Lift

25 The groceries example Association Rules Plot of Relative Disequilibrium versus Lift for the 430 rules of groceries data set For “Lift” > 2.5, RLD varies between 1%-40% RLD Lift

26 The groceries example Association Rules

27 The groceries example Association Rules Plot of Relative Disequilibrium versus Support for the 430 rules of groceries data set RLD shows more variability than “support” RLD Support

28 The groceries example Association Rules Plot of Relative Disequilibrium versus Confidence for the 430 rules of groceries data set For RLD of 20%, “Confidence” varies between 1%-40% RLD Confidence

29 The groceries example Association Rules First 20 rules for groceries data, sorted by Lift

30 The groceries example Association Rules Plot of Relative Disequilibrium versus Lift for the 430 rules of groceries data set For “Lift” > 2.5, RLD varies between 1%-40% RLD Lift

31 The groceries example Association Rules Plot of RLD versus Chisquare for the top 20 rules of groceries data set, sorted by RLD RLD ChiSquared

32 The groceries example Association Rules Plot of RLD versus Odds Ratio for the top 20 rules of groceries data set, sorted by RLD RLD Odds Ratio

33 Summary RLD is intuitive RLD yields different answers from “usual” measures RLD can be extended to higher dimensions There are opportunities in considering the relationship between Association Rules and Contingency Tables

34 Text Mining Contingency Tables Graphical Displays Algebraic Statistics Evolution

35 References Bishop, Y.M. M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. M.I.T. Press, Cambridge, MA. Paperback edition (1977). Reprinted, by Springer- Verlag, New York (2007). Fisher, R. A. (1930). The Genetical Theory of Natural Selection. Clarendon, Oxford, U.K.. Hahsler, M., Grun, B., and Hornik, K. (2005). arules – A computational environment for mining association rules and frequent item sets. Journal of Statistical Software, 14(15):1–25. ISSN 1548-7660. URL http://www.jstatsoft.org/v14/i15/. Karlin, S. and Feldman, M. (1970). Linkage and selection: Two locus symmetric viability model. Theoretical Population Biology 1 : 39-71. Karlin, S. and Kenett, R.S. (1977). Variable Spatial Selection with Two Stages of Migration and Comparisons Between Different Timings, Theoretical Population Biology, 11, pp. 386-409. Kenett, R. (1983). On an Exploratory Analysis of Contingency Tables. The Statistician, 32, pp. 395-403. Kenett, R. and Salini, S. (2008). Relative Linkage Disequilibrium: A New measure for association rules UNIMI - Research Papers in Economics, Business, and Statistics http://services.bepress.com/unimi/statistics/art32/ Lewontin, R,. C., and Kojima, K. (1960). The evolutionary dynamics of complex polymorphisms. Evolution 14: 458-472. Omiecinski, E. (2003). Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69. Piatetsky-Shapiro, G. (1991). Discovery, analysis, and presentation of strong rules. In: Knowledge Discovery in Databases, pages 229–248. Shimada, K., Hirasawa K, and Hu J. (2006) Association Rule Mining with Chi-Squared Test Using Alternate Genetic Network Programming, ICDM2006. Tan, P-N, Kumar, V., and Srivastava, J. (2004). Selecting the right objective measure for association analysis. Information Systems, 29(4):293–313.

36 Backup slides

37 The telecom systems example Association Rules

38 The telecom systems example Association Rules Item Frequency Plot (Support>0.1) of telecom data set

39 The telecom systems example Association Rules 3D Simplex Representation for 200 rules of telecom data set and for the top 10 rules sorted by RLD

40 The telecom systems example Association Rules 2D Simplex Representation for the top 10 rules sorted by RLD of telecom data set

41 The telecom systems example Association Rules Top 10 rules sorted by RLD of telecom data set

42 RLD Statistical Properties Contingency Tables

43 RLD Statistical Properties Contingency Tables


Download ppt "Relative Linkage Disequilibrium: An intersection between evolution, algebraic statistics, text mining and contingency tables Ron S. Kenett KPA Ltd., Raanana,"

Similar presentations


Ads by Google