CMPT 884, SFU, Martin Ester, 1-09 1 Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring.

Slides:

Advertisements

Similar presentations

Advertisements

New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.

SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Introduction to Data Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.

Introduction Contents of this Chapter

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Data Mining By Archana Ketkar.

Data Mining – Intro.

CS157A Spring 05 Data Mining Professor Sin-Min Lee.

Overview of Web Data Mining and Applications Part I

Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.

GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.

Enterprise systems infrastructure and architecture DT211 4

Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Overview of Distributed Data Mining Xiaoling Wang March 11, 2003.

1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.

OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.

The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.

Intelligent Systems Lecture 23 Introduction to Intelligent Data Analysis (IDA). Example of system for Data Analyzing based on neural networks.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Data Mining Chun-Hung Chou

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Database and Data-Intensive Systems. Data-Intensive Systems From monolithic architectures to diverse systems Dedicated/specialized systems, column stores.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

BeeSpace Informatics Research: From Information Access to Knowledge Discovery ChengXiang Zhai Nov. 7, 2007.

Chapter 1 Introduction to Data Mining

Knowledge Discovery and Data Mining Evgueni Smirnov.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Knowledge Discovery and Data Mining Evgueni Smirnov.

Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.

Data Mining By Dave Maung.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.

Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining

Data Warehousing Data Mining Privacy. Reading FarkasCSCE Spring

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.

CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.

DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.

The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —

Data Mining – Intro.

What Is Cluster Analysis?

Machine Learning overview Chapter 18, 21

MIS 451 Building Business Intelligence Systems

Introduction C.Eng 714 Spring 2010.

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Data Mining: Concepts and Techniques Course Outline

©Jiawei Han and Micheline Kamber

CS7280: Special Topics in Data Mining Information/Social Networks

Sangeeta Devadiga CS 157B, Spring 2007

Data Warehousing and Data Mining

Data Mining: Concepts and Techniques

Overview of Machine Learning

Data Mining: Concepts and Techniques

Data Warehousing Data Mining Privacy

Data Mining: Concepts and Techniques

Welcome! Knowledge Discovery and Data Mining

CSE591: Data Mining by H. Liu

Promising “Newer” Technologies to Cope with the

Presentation transcript:

CMPT 884, SFU, Martin Ester, Special Topics in Database Systems Martin Ester Simon Fraser University School of Computing Science CMPT 884 Spring 2009

CMPT 884, SFU, Martin Ester, Introduction [Fayyad, Piatetsky-Shapiro & Smyth 96] Knowledge discovery in databases (KDD) is the process of (semi-)automatic extraction of knowledge from databases which is valid previously unknown and potentially useful. Remarks (semi)-automatic: distinction from manual analysis / OLAP. Typically, some user interaction necessary. valid: in the statistical sense. previously unknown: not explicit, no „common sense knowledge“. potentially useful: for some given application.

CMPT 884, SFU, Martin Ester, Introduction Statistics [Hand, Mannila & Smyth 2001] representation of uncertainty model-based inferences focus on numeric data Machine Learning [Mitchell 1997] knowledge representation search strategies focus on symbolic data Database Systems [Han & Kamber 2000] data management integration of data mining with DBS scalability for large databases

CMPT 884, SFU, Martin Ester, Introduction Pre- processing Trans- formation Database Focussing Data Mining Evaluation Pattern Knowledge KDD Process [Han & Kamber 2000] KDD Process [Fayyad, Piatetsky-Shapiro & Smyth 1996] Databases Data Cleaning Data Integration Selection Data Mining Data Warehouse Task-relevant Data Pattern Evaluation Knowledge

CMPT 884, SFU, Martin Ester, Data Mining Definition [Fayyad, Piatetsky-Shapiro, Smyth 1996] Data Mining is the application of efficient algorithms to determine the patterns contained in some database. Data-Mining Tasks a a a b b b b b a a b a A and B  C clusteringclassification association rulesgeneralisation other tasks: regression, outlier detection...

CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2000 Conference New Data Mining Algorithms Efficiency and Scalability of Data Mining Algorithms Interactive Data Exploration Visualization Constraints and Evaluation in the KDD Process

CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2002 Conference Statistical Methods Frequent Patterns Streams and Time Series Visualization Web Search and Navigation Text and Web Page Classification Intrusion and Privacy Applications

CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2004 Conference Frequent Patterns / Association Rules Clustering Mining Spatio-Temporal Data Mining Data Streams Dimensionality Reduction Privacy-Preserving Data Mining Mining Biological Data Applications (Web, biological data, security,...)

CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2006 Conference Clustering Classification / supervised ML Privacy Web / Graph Mining Web / Text Mining Frequent Pattern Mining Structured Data

CMPT 884, SFU, Martin Ester, Trends in KDD Research KDD 2008 Conference Text Mining Data Integration Social Networks Graph Mining Distance Functions and Metric Learning Active and Semi-supervised Learning Pattern Mining Collaborative Filtering

CMPT 884, SFU, Martin Ester, Trends in KDD Research Some Hot Topics Social Networks THE hot topic of KDD 08  topic of the only panel Graph mining Text mining and information extraction / integration Collaborative Filtering more general, recommender systems  $1M NetFlix prize

CMPT 884, SFU, Martin Ester, Overview of this Course Prerequisites Foundations of database systems and statistics Introductory graduate data mining course or equivalent Objectives Introduction into some hot topics of data mining research Training in research methodology Presentation skills start thesis work after this class!

CMPT 884, SFU, Martin Ester, Overview of this Course Topics Graph mining social network analysis and analysis of biological networks as driving applications Recommender systems in particular trust-based recommendation Information extraction and integration integration with existing databases

CMPT 884, SFU, Martin Ester, Overview of this Course Format Tutorial surveys by instructor Written research paper reviews by students Research paper presentations by students discussions in class Course research projects by students on a topic of their choice

CMPT 884, SFU, Martin Ester, Overview of this Course Tentative Grading Scheme Paper review (20 %) Paper presentation (20 %) Course project report (40%) two steps: project proposal, final project report Course project presentation (20 %)  marking criteria: originality, technical quality, presentation

CMPT 884, SFU, Martin Ester, Overview of this Course Types of Course Projects Literature survey summarize the state-of-the-art and identify open research problems New problem introduce and analyze a new problem New algorithm for known problem implement and evaluate algorithm Improvement of existing algorithm implement and compare algorithm Comparison of existing algorithms on a new, interesting dataset identify criteria for choice of algorithms / open research problems

CMPT 884, SFU, Martin Ester, Graph Mining Motivating Applications Social network analysis oWhat communities exist? oHow does information about a new product spread? oWhat customers should be targeted to maximize the profit of a marketing campaign? Analysis of biological networks o What are the functional modules of an organism? o How do biological networks evolve in the course of time? o What protein should be targeted to inhibit some virulent bacteria?

CMPT 884, SFU, Martin Ester, Graph Mining Methods Frequent subgraph mining frequent pattern mining approach Graph clustering e.g., normalized cut, i.e. Minimize number of edges between graph components / clusters Graph generative models probabilistic models that generate graphs similar to real graphs / networks

CMPT 884, SFU, Martin Ester, Graph Mining Challenges Complexity of graph algorithms oMany graph mining problems are NP-hard. oReal graphs tend to be extremely large.  need efficient algorithms Attribute data oMany graphs have attributes associated with the nodes. oTransformation into weighted graph looses a lot of information.  need new models / algorithms considering relationship and attribute data

CMPT 884, SFU, Martin Ester, Recommender Systems Motivating Applications Motivation o The internet provides a flood of information on all kinds of items. o There is a great need for personalized recommendations. o The internet also provides a wealth of item ratings / reviews. Typical applications oMovie recommendation o Product recommendation oKeyword recommendation

CMPT 884, SFU, Martin Ester, Recommender Systems Methods Collaborative filtering o Uses only a database of user – item ratings. o Recommendation based on ratings by users with similar rating patterns. Content-based recommender systems o Uses information about the content of items and / or the properties of users. o Recommends items that have content similar to items liked by user. Trust-based recommender systems oAssume a social network / trust network. Trust can be defined explicitly or implicitly. oRecommendation based on ratings by trusted neighbors.

CMPT 884, SFU, Martin Ester, Recommender Systems Challenges High dimensionality and sparsity of data o The overwhelming majority (> 99%) of user item ratings is unknown. o Recommendation especially hard for cold start users and controversial items.  dimensionality reduction, model based methods, trust-based approach Fraud o Memory-based collaborative filtering can be easily manipulated by adding fraudulent ratings.  trust-based approach more robust to fraud Privacy issues with trust network data o only very few trust networks are public domain

CMPT 884, SFU, Martin Ester, Information Extraction and Integration Motivating Applications Importance of unstructured text data o The overwhelming majority (>= 80%) of human generated information is not in structured form, but in unstructured text. Biomedical literature o Contains a wealth of valuable information that cannot be processed / searched automatically. o Extraction of entities and relationships such as proteins and their localizations. Online product reviews o A lot of product „reviews“ available online in community databases or blogs. o Companies want to know what customers think of their products.

CMPT 884, SFU, Martin Ester, Information Extraction and Integration Methods Basic NLP methods o Part-of-speech tagging o Lexica, ontologies,... Machine learning methods o Typically, supervised classification. o CRFs and similar methods are state-of-the-art. Bootstrapping approach o Using a small labeled training dataset, find textual extraction patterns. o Using these patterns, extract further entities / relationships and continue.

CMPT 884, SFU, Martin Ester, Information Extraction and Integration Challenges Text data is hard to understand o Many of the NLP problems are still essentially unsolved.  relatively simple NLP methods often sufficient for information extraction Portability across domains o Extraction methods need to be portable from one domain to another. o Knowledge engineering approach (domain expert defines rules) is labor-intensive and expensive.  machine learning methods Entity mentions need to be resolved o Information extraction produces strings referencing an entity of a given type. o Without mapping to known real world entities, extracted information is of limited usefulness.  need to integrate extracted information with existing databases

CMPT 884, SFU, Martin Ester, References Graph mining -X Yan & Karsten Borgwardt, "Graph Mining and Graph Kernels", Tutorial KDD 08 -Jure Leskovec and Christos Faloutsos, “Mining Large Graphs: Models, Diffusion and Case Studies”, Tutorial ECML/PKDD 2007 Recommender systems -Joseph Konstan, “Introduction to Recommender Systems”, Tutorial SIGMOD 2008 Information extraction and integration - Eugene Agichtein & Sunita Sarawagi, “Scalable Information Extraction and Integration”, Tutorial KDD 06 - AnHai Doan & Raghu Ramakrishnan & Shiv Vaithyanathan, “Managing Information Extraction”, Tutorial SIGMOD 2006