Introduction to KDD: Knowledge Discovery in Databases and Data Mining

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Advertisements

CS583 – Data Mining and Text Mining
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
These slides are additional material for TIES4451 Data Mining Lecture 1 TIES445 Data mining Nov-Dec 2007 Sami Äyrämö.
2015/6/1Course Introduction1 Welcome! MSCIT 521: Knowledge Discovery and Data Mining Qiang Yang Hong Kong University of Science and Technology
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
An Overview of Our Course:
Data Mining – Intro.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
CIS 674 Introduction to Data Mining
WPI Center for Research in Exploratory Data and Information Analysis From Data to Knowledge: Exploring Industrial, Scientific, and Commercial Databases.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Introduction to Data Mining Engineering Group in ACL.
Geographic Data Mining Marc van Kreveld Seminar for GIVE Block 1, 2003/2004.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Chapter 1 Introduction to Data Mining
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
The Interplay Between Mathematics/Computation and Analytics Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
يادگيري ماشين Machine Learning Lecturer: A. Rabiee
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
CENG 770. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
Project GuideBenazir N( ) Mr. Nandhi Kesavan RBhuvaneshwari R( ) Batch no: 32 Department of Computer Science Engineering.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
CSC 4740 / 6740 Fall 2016 Data Mining Instructor: Yubao Wu Fall 2016.
Brief Intro to Machine Learning CS539
CSE 4705 Artificial Intelligence
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Why Data Mining? What Is Data Mining?
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Eick: Introduction Machine Learning
DATA MINING BY: PRADEEP AGRAWAL MBA (SEC – A) ALLIANCE UNIVERSITY – SCHOOL OF BUSINESS.
CS583 – Data Mining and Text Mining
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Course Summary (Lecture for CS410 Intro Text Info Systems)
Introductory Seminar on Research: Fall 2017
Data Mining: Concepts and Techniques Course Outline
What is Pattern Recognition?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Introduction --- Part2 Another Introduction to Data Mining
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Course Introduction CSC 576: Data Mining.
Data Mining: Introduction
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Warehousing Data Mining Privacy
Dept. of Computer Science University of Liverpool
Data Mining: Concepts and Techniques
Christoph F. Eick: A Gentle Introduction to Machine Learning
Welcome! Knowledge Discovery and Data Mining
CSCE 4143 Section 001: Data Mining Spring 2019.
Promising “Newer” Technologies to Cope with the
First 2-3 Lectures (Intro to DS/DM)
Presentation transcript:

Introduction to KDD: Knowledge Discovery in Databases and Data Mining Carolina Ruiz, PhD Associate Professor Department of Computer Science Worcester Polytechnic Institute This PowerPoint Template includes a series of slide masters with predefined layouts and color schemes for formatting slides Slide Masters are displayed when you right click on a slide and select Layout from menu

Data Mining What data mining is and why we need it

Need for Data Mining Data are being gathered and stored extremely fast http://www.internetlivestats.com/one-second/ “In 1 second, each and every second there are … 7,998 Tweets sent in 1 second 839 Instagram photos uploaded in 1 second 1,364 Tumblr posts in 1 second 3,083 Skype calls in 1 second 55,560GB of Internet traffic in 1 second 66,335 Google searches in 1 second 73,391 YouTube videos viewed in 1 second 2,681,874 Emails sent in 1 second” Computational tools and techniques are needed to help humans summarize, understand, and take advantage of accumulated data

What is Data Mining? “Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” * Example 1: Recommender Systems Data on library books and users’ past reading history Data Mining What book to recommend next to given user such that there is a high likelihood that the user will like it? Raw Data Data Mining Patterns, Knowledge * Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

What is Data Mining? “Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” * Example 2: Resource Allocation Data on library books and users’ past reading history Data Mining Given a newly acquired book, what is an accurate estimate of the number of users who will read it in the next 12 months? Raw Data Data Mining Patterns, Knowledge * Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

Data Mining Process From Data to Knowledge

Knowledge Discovery in DBs (KDD) clean data Data Mining applying algorithms to find patterns models / patterns Data Preprocessing remove noisy missing data dimen. reduction Model/Pattern Evaluation quantitative qualitative data sources data Data Management spreadsheets databases data warehouses “good” model Model/Pattern Deployment prediction decision support new data

Data Mining is Interdisciplinary Databases and Information Retrieval Contributes efficient data storage, data cleansing, and data access techniques Data Visualization Contributes visual data displays and data exploration High Performance Computing Contributes techniques to efficiently handle complexity Application Domain Contributes domain knowledge Machine Learning and AI Contributes automatic induction of empirical laws from observations & experimentation Statistics Contributes language, framework, and techniques Pattern Recognition Contributes pattern extraction and pattern matching techniques

Confirmatory vs. Exploratory Data Mining Confirmatory (verification): Given a hypothesis, verify its validity against the data Exploratory (discovery): Predictive patterns Patterns for predicting behavior of newly encountered entities Descriptive patterns Patterns for presenting the behavior of observed entities in a human-understandable format in some cases patterns are both predictive and descriptive

Data Mining Approaches and Techniques What kinds of patterns can be mined from data?

Data Mining Approaches IF A & B THEN IF A & D THEN regression Data clustering classification outlier / deviation detection summarization Regression: Parametrically summarize data points dependency/assoc. analysis IF a & b & c THEN d & k IF k & a THEN e A B C D 0.5 0.75 0.3 A, B -> C 80% C, D -> A 22%

Classification: Example Given Data: Large collection of books. For each book: title, info, full text and a category art history geography … Automatically derive from these data Classification Model: A collection of patterns that map books to their categories Classification Techniques: Rule Learning Classification Techniques: Decision Trees Classification Techniques: Neural Networks IF A & B THEN history IF A & D THEN geography IF C & D & E THEN art art such that this model can be used for Prediction: given a new book, predict its category Description: provide insights into the data

Regression: Example … Given Data: Large collection of books. For each book: title, info, full text and number of users that accessed the book in the past 12 months 134 275 531 73 97 321 115 … Automatically derive from these data Regression Model: A collection of patterns that map books to their expected number of readers Regression Techniques: Non-linear Regression Regression Techniques: Neural Networks Regression Techniques: Linear Regression 102 such that this model can be used for Prediction: given a new book, predict expected number of readers in the next 12 months Description: provide insights into the data

Clustering: Example … … Given Data: Large collection of books. For each book: title, info, full text, … Automatically derive from these data A set of clusters: that group books by similarity Clustering Techniques: Hierarchical Clustering Clustering Techniques: Gaussian Mixtures Clustering Techniques: K-means … … such that these clusters can be used for Description: provide insights into the data Useful for example to recommend books to users or to organize books in (virtual) library shelves

Data Mining Applications

Sample Data Mining Applications I Identifying important groups of microorganisms in the human body Classifying galaxies in the universe Fowler, L., Schawinski, K., & Brandt, B.-E. (2017). Galaxy Classification using Machine Learning. Paper presented at the American Astronomical Society Meeting Abstracts. Here's the abstract of the paper:              We present our current research into the use of machine learning to classify galaxy  imaging data with various convolutional neural network configurations in TensorFlow. We  are investigating how five-band Sloan Digital Sky Survey imaging data can be used to train  on physical properties such as redshift, star formation rate, mass and morphology. We also  investigate the performance of artificially redshifted images in recovering physical properties  as image quality degrades. Fowler, L., Schawinski, K., & Brandt, B.-E. Galaxy Classification using Machine Learning. Paper presented at the American Astronomical Society Meeting Abstracts. 2017 Dan Knights Elizabeth K. Costello Rob Knight “Supervised classification of human microbiota” FEMS Microbiology Reviews, Volume 35, Issue 2, 1 March 2011, Pages 343–359

Sample Data Mining Applications II Email spam filtering Document sentiment analysis Liu B., Zhang L. “A Survey of Opinion Mining and Sentiment Analysis.” In: Aggarwal C., Zhai C. (eds) Mining Text Data. Springer, Boston, MA. 2012 Blanzieri, E. & A. Bryl. “A survey of learning- based techniques of email spam filtering” Artificial Intelligence Review March 2008, Vol. 29, Issue 1, pp 63–92

Sample Data Mining Applications III audio and voice processing image and video processing Personal assistants https://www.classaction.org/blog/facebook-sued-over-face-recognition-feature recommender systems Bgr.com/tag/siri

Sample Data Mining Applications IV black and white image colorization Zhang, Isola, Efros. Colorful Image Colorization. In ECCV, 2016. http://richzhang.github.io/colorization/ See also https://machinelearningmastery.com/inspirational-applications-deep-learning/

Sample Data Mining Applications V image classification, object recognition, description generation using deep neural networks Andrej Karpathy & Li Fei-Fei “Deep Visual-Semantic Alignments for Generating Image Descriptions” CVPR 2015 https://cs.stanford.edu/people/karpathy/deepimagesent/

Data Mining Packages and Platforms Commercial and Open Source

Commercial Data Mining Systems Matlab Oracle data mining and lots more ….

Open Source Data Mining Tools RapidMiner Klinkenberg et al., Univ. of Dortmund, Germany WEKA Frank et al., University of Waikato, New Zealand Python Data Mining Libraries R Programming Language Ross Ihaka and Robert Gentleman Univ. of Auckland, New Zealand and many more ….

For other Data Mining Resources Books, conferences, journals, data repositories …

Data Mining Resources: Books "Data Mining: Practical Machine Learning Tools and Techniques (4th Edition)" I.H. Witten, E. Frank, M. Hall, C. Pal. Morgan Kaufmann Publishers. 2017. Introduction to Data Mining (2nd edition) P.-N. Tan, M. Steinbach, A. Karpatne, V. Kumar. Pearson, 2018. "Data Mining: Concepts and Techniques (3rd Edition)". J. Han and M. Kamber. Morgan Kaufmann Publishers. 2012. "Advances in Knowledge Discovery and Data Mining". Eds.: Fayyad, Piatetsky-Shapiro, Smyth, and Uthurusamy. The MIT Press, 1995. …

Data Mining Resources: Journals Data Mining and Knowledge Discovery Journal ACM SIGKDD Explorations Newsletter TKDE: IEEE Transactions in Knowledge and Data Engineering TODS: ACM Transactions on Database Systems JACM: Journal of ACM Data and Knowledge Engineering JIIS: Intl. Journal of Intelligent Information Systems …

Data Mining Resources: Conferences KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining ICDM: IEEE International Conference on Data Mining, SIAM International Conference on Data Mining PKDD: European Conference on Principles and Practice of Knowledge Discovery in Databases PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery Other related Conferences: ICML: Intl. Conf. On Machine Learning IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning IJCAI: International Joint Conference on Artificial Intelligence AAAI: American Association for Artificial Intelligence Conference SIGMOD/PODS: ACM Intl. Conference on Data Management ICDE: International Conference on Data Engineering VLDB: International Conference on Very Large Data Bases

Data Mining Resources: Data Univ. of California Irvine Machine Learning Data Repository. Univ. of California Irvine KDD Data Repository. Datasets for Data Mining Datamob - Public data put to good use. Time Series Data Library CMU's StatLib-Datasets Archive Stanford Large Network Dataset Collection (SNAP) 100+ Interesting Data Sets for Statistics …

Data Mining Summary

Summary Data mining is the “non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” The KDD process includes data collection and pre- processing, data mining, and evaluation and validation of those patterns Data mining is the discovery and extraction of patterns from data, not the extraction of data Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data

ruiz@wpi.edu http://www.cs.wpi.edu/~ruiz/ Thank you. ruiz@wpi.edu http://www.cs.wpi.edu/~ruiz/