CMPT 884, SFU, Martin Ester, Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009
CMPT 884, SFU, Martin Ester, Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is valid previously unknown and potentially useful. Remarks (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. valid: in the statistical sense. previously unknown: not explicit, no „common sense knowledge“. potentially useful: for some given application.
CMPT 884, SFU, Martin Ester, Introduction Statistics [Hand, Mannila & Smyth 2001] representation of uncertainty model-based inferences focus on numeric data Machine Learning [Mitchell 1997] knowledge representation search strategies focus on symbolic data Database Systems [Han & Kamber 2000] data management integration of data mining with DBS scalability for large databases
CMPT 884, SFU, Martin Ester, Introduction Pre- processing Trans- formation Database Focussing Data Mining Evaluation Pattern Knowledge KDD Process [Han & Kamber 2000] KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Databases Data Cleaning Data Integration Selection Data Mining Data Warehouse Task-relevant Data Pattern Evaluation Knowledge
CMPT 884, SFU, Martin Ester, Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks a a a b b b b b a a b a A and B C clusteringclassification association rulesgeneralisation other tasks: regression, outlier detection...
CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2000 Conference New Data Mining Algorithms Efficiency and Scalability of Data Mining Algorithms Interactive Data Exploration Visualization Constraints and Evaluation in the KDD Process
CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2002 Conference Statistical Methods Frequent Patterns Streams and Time Series Visualization Web Search and Navigation Text and Web Page Classification Intrusion and Privacy Applications
CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2004 Conference Frequent Patterns / Association Rules Clustering Mining Spatio-Temporal Data Mining Data Streams Dimensionality Reduction Privacy-Preserving Data Mining Mining Biological Data Applications (Web, biological data, security,...)
CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2006 Conference Clustering Classification / supervised ML Privacy Web / Graph Mining Web / Text Mining Frequent Pattern Mining Structured Data
CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2008 Conference Text Mining Data Integration Social Networks Graph Mining Distance Functions and Metric Learning Active and Semi-supervised Learning Pattern Mining Collaborative Filtering
CMPT 884, SFU, Martin Ester, Trends in KDD Research Some Hot Topics Social Networks THE hot topic of KDD 08 topic of the only panel Graph mining Text mining and information extraction / integration Collaborative Filtering more general, recommender systems $1M NetFlix prize
CMPT 884, SFU, Martin Ester, Overview of this Course Prerequisites Foundations of database systems and statistics Introductory graduate data mining course or equivalent Objectives Introduction into some hot topics of data mining research Training in research methodology Presentation skills start thesis work after this class!
CMPT 884, SFU, Martin Ester, Overview of this Course Topics Graph mining social network analysis and analysis of biological networks as driving applications Recommender systems in particular trust-based recommendation Information extraction and integration integration with existing databases
CMPT 884, SFU, Martin Ester, Overview of this Course Format Tutorial surveys by instructor Written research paper reviews by students Research paper presentations by students discussions in class Course research projects by students on a topic of their choice
CMPT 884, SFU, Martin Ester, Overview of this Course Tentative Grading Scheme Paper review (20 %) Paper presentation (20 %) Course project report (40%) two steps: project proposal, final project report Course project presentation (20 %) marking criteria: originality, technical quality, presentation
CMPT 884, SFU, Martin Ester, Overview of this Course Types of Course Projects Literature survey summarize the state-of-the-art and identify open research problems New problem introduce and analyze a new problem New algorithm for known problem implement and evaluate algorithm Improvement of existing algorithm implement and compare algorithm Comparison of existing algorithms on a new, interesting dataset identify criteria for choice of algorithms / open research problems
CMPT 884, SFU, Martin Ester, Graph Mining Motivating Applications Social network analysis oWhat communities exist? oHow does information about a new product spread? oWhat customers should be targeted to maximize the profit of a marketing campaign? Analysis of biological networks o What are the functional modules of an organism? o How do biological networks evolve in the course of time? o What protein should be targeted to inhibit some virulent bacteria?
CMPT 884, SFU, Martin Ester, Graph Mining Methods Frequent subgraph mining frequent pattern mining approach Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters Graph generative models probabilistic models that generate graphs similar to real graphs / networks
CMPT 884, SFU, Martin Ester, Graph Mining Challenges Complexity of graph algorithms oMany graph mining problems are NP-hard. oReal graphs tend to be extremely large. need efficient algorithms Attribute data oMany graphs have attributes associated with the nodes. oTransformation into weighted graph looses a lot of information. need new models / algorithms considering relationship and attribute data
CMPT 884, SFU, Martin Ester, Recommender Systems Motivating Applications Motivation o The internet provides a flood of information on all kinds of items. o There is a great need for personalized recommendations. o The internet also provides a wealth of item ratings / reviews. Typical applications oMovie recommendation o Product recommendation oKeyword recommendation
CMPT 884, SFU, Martin Ester, Recommender Systems Methods Collaborative filtering o Uses only a database of user – item ratings. o Recommendation based on ratings by users with similar rating patterns. Content-based recommender systems o Uses information about the content of items and / or the properties of users. o Recommends items that have content similar to items liked by user. Trust-based recommender systems oAssume a social network / trust network. Trust can be defined explicitly or implicitly. oRecommendation based on ratings by trusted neighbors.
CMPT 884, SFU, Martin Ester, Recommender Systems Challenges High dimensionality and sparsity of data o The overwhelming majority (> 99%) of user item ratings is unknown. o Recommendation especially hard for cold start users and controversial items. dimensionality reduction, model based methods, trust-based approach Fraud o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings. trust-based approach more robust to fraud Privacy issues with trust network data o only very few trust networks are public domain
CMPT 884, SFU, Martin Ester, Information Extraction and Integration Motivating Applications Importance of unstructured text data o The overwhelming majority (>= 80%) of human generated information is not in structured form, but in unstructured text. Biomedical literature o Contains a wealth of valuable information that cannot be processed / searched automatically. o Extraction of entities and relationships such as proteins and their localizations. Online product reviews o A lot of product „reviews“ available online in community databases or blogs. o Companies want to know what customers think of their products.
CMPT 884, SFU, Martin Ester, Information Extraction and Integration Methods Basic NLP methods o Part-of-speech tagging o Lexica, ontologies,... Machine learning methods o Typically, supervised classification. o CRFs and similar methods are state-of-the-art. Bootstrapping approach o Using a small labeled training dataset, find textual extraction patterns. o Using these patterns, extract further entities / relationships and continue.
CMPT 884, SFU, Martin Ester, Information Extraction and Integration Challenges Text data is hard to understand o Many of the NLP problems are still essentially unsolved. relatively simple NLP methods often sufficient for information extraction Portability across domains o Extraction methods need to be portable from one domain to another. o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive. machine learning methods Entity mentions need to be resolved o Information extraction produces strings referencing an entity of a given type. o Without mapping to known real world entities, extracted information is of limited usefulness. need to integrate extracted information with existing databases
CMPT 884, SFU, Martin Ester, References Graph mining -X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial KDD 08 -Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007 Recommender systems -Joseph Konstan, “Introduction to Recommender Systems”, Tutorial SIGMOD 2008 Information extraction and integration - Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and Integration”, Tutorial KDD 06 - AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing Information Extraction”, Tutorial SIGMOD 2006