Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Tying it all together (a start) Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science Last update: 6 December 2007
Berendt: Advanced databases, winter term 2007/08, 2 Goal 1 for today Wrap up yesterday‘s lecture and discussion + prepare you for the next assignment
Berendt: Advanced databases, winter term 2007/08, 3 Goal 2 for today: identify „missing links“ & point to solution approaches (on the board)
Berendt: Advanced databases, winter term 2007/08, 4 Agenda Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief)
Berendt: Advanced databases, winter term 2007/08, 5 Agenda Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief)
Berendt: Advanced databases, winter term 2007/08, 6 Mining association rules Apriori: (slides from D. Delic) Mining generalized association rules: (Karlsruhe slides)
Berendt: Advanced databases, winter term 2007/08, 7 Main interestingness measures of association rules n Support of a rule A B = no. of instances with A and B / no. of all instances n Confidence of a rule A B = no. of instances with A and B / no. of instances with A = support (A & B) / support (A) n Lift of a rule A B = support (A & B) / [ support (A) * support (B) ] l What does this measure, and in what numerical interval can it be?
Berendt: Advanced databases, winter term 2007/08, 8 Interesting- ness measures
Berendt: Advanced databases, winter term 2007/08, 9 Interestingness as a constraint So we‘re not interested in „show me all patterns“ But „show me all patterns that are interesting = that have properties X“ constraints!
Berendt: Advanced databases, winter term 2007/08, 10 Examples from MINERULE MINE RULE exemple as SELECT DISTINCT 1..n Item as BODY, 1..1 Item as HEAD, SUPPORT, CONFIDENCE WHERE HEAD.Item=« umbrellas » // also other fields, e.g. Date FROM Purchase GROUP BY Tid HAVING COUNT(*)<6 EXTRACTING RULES WITH SUPPORT: 0.06, CONFIDENCE: 0.9 E.g., jacket flight_Dublin umbrellas (0.08,0.93)
Berendt: Advanced databases, winter term 2007/08, 11 Agenda Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief)
Berendt: Advanced databases, winter term 2007/08, 12 The site Business understanding / problem definition: * How do users search in this online catalog? * Which search criteria are popular? * Which are efficient? [Berendt & Spiliopoulou, VLDB Journal 2000]
Berendt: Advanced databases, winter term 2007/08, 13 The concept hierarchies / site ontology (excerpt) SEITE1-...LI (1st page of a list) or SEITEn-...LI (further page) LA („Land“)SA („Schulart“)SU („Suche“)
Berendt: Advanced databases, winter term 2007/08, 14 Sequence mining – one result pattern: successful search for a school in Germany a refinement a repetition a continuation one example pattern select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2 (Berendt & Spiliopoulou, VLDB J. 2000) /liste.html?offset=920&ze ilen=20&anzahl=1323&sprac he=de&sw_kategorie=de&ers cheint=&suchfeld=&suchwer t=&staat=de®ion=by&sch ultyp=
Berendt: Advanced databases, winter term 2007/08, 15 Sequences
Berendt: Advanced databases, winter term 2007/08, 16 Generalized sequences, navigation patterns, hits in WUM
Berendt: Advanced databases, winter term 2007/08, 17 Aggregated Logs: The basic internal representation in WUM
Berendt: Advanced databases, winter term 2007/08, 18 The confi- dence measure for genera-lized sequences
Berendt: Advanced databases, winter term 2007/08, 19 Templates in the query language MINT, g-sequences, and navigation patterns
Berendt: Advanced databases, winter term 2007/08, 20 Interestingness measures: Support (hits) and confidence
Berendt: Advanced databases, winter term 2007/08, 21 Aggregated Logs, queries, and query results
Berendt: Advanced databases, winter term 2007/08, 22 The basic idea of the WUM algorithm
Berendt: Advanced databases, winter term 2007/08, 23 MINT can express 3 types of constraints (“predicates“)
Berendt: Advanced databases, winter term 2007/08, 24 The WUM gseqm algorithm (B predicates)
Berendt: Advanced databases, winter term 2007/08, 25 Also for higher-order structures (graphs): Ex. MolFea
Berendt: Advanced databases, winter term 2007/08, 26 Agenda Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief)
Berendt: Advanced databases, winter term 2007/08, 27 The basic idea (on the board)
Berendt: Advanced databases, winter term 2007/08, 28 Agenda Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief)
Berendt: Advanced databases, winter term 2007/08, 29 (One) basic idea (on the board)
Berendt: Advanced databases, winter term 2007/08, 30 Next lecture Naïve Bayes [remaining from yesterday] Changing representation: LSI [rem. from yesterday] Ont.+KDD: Apriori and taxonomies KDD+DB: Constrained pattern mining – ex. WUM KDD+DB: Inductive databases (very brief) KDD+Ont.: Induction and Semantic Web (very brief) Applications
Berendt: Advanced databases, winter term 2007/08, 31 References and background reading; acknowledgements n Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Conference on Management of Data, pages , Washington, D.C., May l (presentation from Delic, D. (2002). Mining Association Rules with Rough Sets and Large Itemsets - A Comparative Study.) n Ramakrishnan Srikant and Rakesh Agrawal. Mining Generalized Association Rules. In Proc. of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September l (presentation from kassel.de/lehre/ss2004/kdd/folien/4Folie_VII.3_Assoziationsregeln.pdf) kassel.de/lehre/ss2004/kdd/folien/4Folie_VII.3_Assoziationsregeln.pdf n P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In Proceedings of the Eight A CM SIGKDD International Conference on Knowledge Discovery and Data Mining, July n MINERULE: R. Meo, G. Psaila and S. Ceri, An extension to SQL for mining association rules. Data Mining and Knowledge Discovery, Vol. 2 (2), pp , n WUM and the Schulweb study: Berendt, B. & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. The VLDB Journal, 9, n MolFea (esp. The example): S. Kramer, L. De Raedt, C. Helma. Molecular Feature Mining in HIV Data, in Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, n De Raedt, L. (2002) A perspective on inductive databases. SIGKDD Explorations. Volume 4, Issue 2,