Ch. Eick: Course Information COSC 4335 1 Introduction --- Part2 1. Another Introduction to Data Mining 2. Course Information.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Advertisements

CS583 – Data Mining and Text Mining
CS583 – Data Mining and Text Mining
Spatial and Temporal Data Mining
1 Introduction and Review CS 636 – Adv. Data Mining.
Data Mining: Concepts and Techniques
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
Data Mining By Archana Ketkar.
An Overview of Our Course:
Data Mining – Intro.
CS 5941 CS583 – Data Mining and Text Mining Course Web Page 05/cs583.html.
Ch. Eick: Introduction Data Mining and Course Information 1 Introduction --- Part2 1. Another Introduction to Data Mining 2. Course Information.
CS583 – Data Mining and Text Mining
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
COMP5331: Knowledge Discovery and Data Minig
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
1 Data Mining Books: 1.Data Mining, 1996 Pieter Adriaans and Dolf Zantinge Addison-Wesley 2.Discovering Data Mining, 1997 From Concept to Implementation.
Chapter 1 Introduction to Data Mining
I: Introduction to Data Mining A. Preview Data Mining B. A more detailed Introduction C. Course Information ©Jiawei Han and Micheline Kamber Material covered.
Christoph F. Eick: Introduction Knowledge Discovery and Data Mining (KDD) 1 Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting!
Overview of CS Class Jiawei Han Department of Computer Science
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
9/03 Data Mining – Introduction G Dong (WSU)1 CS499/ Data Mining Fall 2003 Professor Guozhu Dong Computer Science & Engineering WSU.
1 1 MSCIT 5210: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline.
Han: Introduction to KDD 1 Introduction to Knowledge Discovery and Data Mining ©Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab.
Lecture 01 – Introduction to DM
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 1 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
CSC 4740 / 6740 Fall 2016 Data Mining Instructor: Yubao Wu Fall 2016.
CS583 – Data Mining and Text Mining
Term Project Proposal By J. H. Wang Apr. 7, 2017.
Why Data Mining? What Is Data Mining?
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
CS583 – Data Mining and Text Mining
DATA MINING BY: PRADEEP AGRAWAL MBA (SEC – A) ALLIANCE UNIVERSITY – SCHOOL OF BUSINESS.
CS583 – Data Mining and Text Mining
中国计算机学会学科前沿讲习班:信息检索 Course Overview
Introduction C.Eng 714 Spring 2010.
Course Summary (Lecture for CS410 Intro Text Info Systems)
Data Mining: Concepts and Techniques Course Outline
CS583 – Data Mining and Text Mining
Promising “Newer” Technologies to Cope with the
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Introduction --- Part2 Another Introduction to Data Mining
Data Mining: Concepts and Techniques — Slides for Textbook —
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
CS583 – Data Mining and Text Mining
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
CS583 – Data Mining and Text Mining
Dept. of Computer Science University of Liverpool
Data Mining: Concepts and Techniques
CSCE 4143 Section 001: Data Mining Spring 2019.
CS583 – Data Mining and Text Mining
CSE591: Data Mining by H. Liu
Promising “Newer” Technologies to Cope with the
First 2-3 Lectures (Intro to DS/DM)
Presentation transcript:

Ch. Eick: Course Information COSC Introduction --- Part2 1. Another Introduction to Data Mining 2. Course Information

Ch. Eick: Course Information COSC Knowledge Discovery in Data [and Data Mining] (KDD) Let us find something interesting! Definition := “KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Fayyad) Frequently, the term data mining is used to refer to KDD. Many commercial and experimental tools and tool suites are available (see Field is more dominated by industry than by research institutions

Ch. Eick: Course Information COSC Motivation: “Necessity is the Mother of Invention” Data explosion problem Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining Data warehousing and on-line analytical processing (“analyzing and mining the raw data rarely works”) —idea: mine summarized,. aggregated data Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data collections

Ch. Eick: Course Information COSC ACME CORP ULTIMATE DATA MINING BROWSER What’s New?What’s Interesting? Predict for me YAHOO!’s View of Data Mining

Ch. Eick: Course Information COSC Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

Ch. Eick: Course Information COSC Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: Data reduction and transformation (the first 4 steps may take 75% of effort!) : Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Ch. Eick: Course Information COSC Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Ch. Eick: Course Information COSC Are All the “Discovered” Patterns Interesting? A data mining system/query may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

Ch. Eick: Course Information COSC Data Mining: Confluence of Multiple Disciplines Data Mining Machine Learning Statistics Applications Algorithm Pattern Recognition High-Performance Computing Visualization Database Technology

10 KDD Process: A Typical View from ML and Statistics Input Data Data Mining Data Pre- Processing Post- Processing This is a view from typical machine learning and statistics communities Data integration Normalization Feature selection Dimension reduction Association Analysis Classification Clustering Outlier analysis Summary Generation … Pattern evaluation Pattern selection Pattern interpretation Pattern visualization

Ch. Eick: Course Information COSC Data Mining Competitions Netflix Price: KDD Cup 2009: orange.com / orange.com / KDD Cup 2011:

Ch. Eick: Course Information COSC Summary Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

Ch. Eick: Course Information COSC 4335 COSC 4335 in a Nutshell 13 Preprocessing Data Mining Post Processing Association Analysis Pattern Evaluation Clustering Visualization Summarization Classification & Prediction Data Analysis Using R for Data Analytics and Programming

Ch. Eick: Course Information COSC Prerequisites The course is basically self contained; however, the following skills are important to be successful in taking this course: Basic knowledge of programming Programming languages of your own choice and data mining tools, particularly R, will be used in the programming projects Basic knowledge of statistics Basic knowledge of data structures Data Management and Discrete Math---can take it concurrently with this course.

Ch. Eick: Course Information COSC 4335 Course Objectives will know what the goals and objectives of data mining are will have a basic understanding on how to conduct a data mining project will obtain some knowledge and practical experience in data analysis and making sense out of data will have sound knowledge of popular classification techniques, such as decision trees, support vector machines and nearest-neighbor approaches. will know the most important association analysis techniques will have detailed knowledge of popular clustering algorithms, such as K- means, DBSCAN, and hierarchical clustering. will have sound knowledge of R, an open source statistics/data mining environment will get some basic background in data visualization and basic statistics will learn how to interpret data analysis and data mining results. will obtain practical experience in in applying data mining techniques to real world data sets and in developing software on the top of data mining and data analysis algorithms. 15

Ch. Eick: Course Information COSC Order of Coverage (subject to change!) Introduction  Data  Basic Introduction to R Part1  Exploratory Data Analysis  Similarity Assessment  Basic Introduction into R Part2  Clustering  Programming in R  Classification and Prediction  Preprocessing  How to Conduct a Data Mining Project  Association Analysis  Outlier Detection  Data Warehousing and OLAP  Top 10 Data Mining Algorithms  Summary

Ch. Eick: Course Information COSC In particular, R will be used for most course projects, The bad news is that it is more challenging to get started with R (compared to Weka---but Weka is a "dead" language), although you should be okay after you used R for some weeks. On the other hand, the good news about R is that it continues to grow quickly in popularity. A recent poll at KDnuggets found that 34% of respondents do at least half of their data mining in R. Although it's a domain specific language, it's versatile. As we have not used R in the course before, we expect some startup problems and ask you for your patience, but, on the positive side knowing R will be a plus when conducting research projects and when looking for jobs after you graduate, due to R's completeness and R's rising popularity.

Ch. Eick: Course Information COSC Where to Find References? Data mining and KDD Conference proceedings: ICDM, KDD, PKDD, PAKDD, SDM,ADMA etc. Journal: Data Mining and Knowledge Discovery Database field (SIGMOD member CD ROM): Conference proceedings: VLDB, ICDE, ACM-SIGMOD, CIKM Journals: ACM-TODS, J. ACM, IEEE-TKDE, JIIS, etc. AI and Machine Learning: Conference proceedings: ICML, AAAI, IJCAI, ECML, etc. Journals: Machine Learning, Artificial Intelligence, etc. Statistics: Conference proceedings: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization: Conference proceedings: CHI, etc. Journals: IEEE Trans. visualization and computer graphics, etc.

Ch. Eick: Course Information COSC Textbooks Recommended Text: P.-N. Tang, M. Steinback, and V. Kumar: Introduction to Data Mining, Addison Wesley, Link to Book HomePageLink to Book HomePage Mildly Recommended Text Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufman Publishers, second edition. Link to Data Mining Book Home Page

Ch. Eick: Course Information COSC Course Projects Project 1: Exploratory Data Analysis (available on January 29; 2 weeks); Group Project (Groups of 2), 2 weeks) Project 2: Traditional Clustering with K-means and DBSCAN and Interpreting Clustering Results and R-Programming (Individual Project, 4 weeks) Project 3: Classification and Prediction (Group Project, 4 weeks, groups of 3) Project 4: Association Analysis (Individual Project, 2 weeks)

Ch. Eick: Course Information COSC Teaching Assistant: Raju Duties: 1. Grading of assignments 2. Help students with homework, programming projects and problems with the course material 3. Grading of Exams (partially) Office: Office Hours: … Meet our TA: Thursday, January 29

Ch. Eick: Course Information COSC 4335 Students in my research group Yongli Zhang, Nguyen Pham and Puja Anchilia Will teach 3-5 lectures Will proctor exams Might or might not be involved with other talks 22

Ch. Eick: Course Information COSC Web and News Group Course Webpage ( UH-DMML Webpage ( COSC 4335 News Group

Ch. Eick: Course Information COSC 4335 Exams Open Textbook and Note (no computers!) Count about 50% towards the course grade 2-3 exams Schedule will be announced on Feb. 4 There might be or might not be a separate R- programming exam early April 24

Ch. Eick: Course Information COSC Where to Find References? DBLP, CiteSeer, Google Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Database systems (SIGMOD: ACM SIGMOD Anthology — CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.

Ch. Eick: Course Information COSC Teaching Philosophy and Advice Read the sections of the textbook and/or slides before you come to the lecture; if you work continuously for the class you will do better and lectures will be more enjoyable. Starting to review the material that is covered in this class 1 week before the next exam is not a good idea. Do not be afraid to ask questions! I really like interactions with students in the lectures… If you do not understand something at all send me an before the next lecture! If you have a serious problem talk to me, before the problem gets out of hand.

Ch. Eick: Course Information COSC Where to Find References? DBLP, CiteSeer, Google Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD Database systems (SIGMOD: ACM SIGMOD Anthology — CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AI & Machine Learning Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-PAMI, etc. Web and IR Conferences: SIGIR, WWW, CIKM, etc. Journals: WWW: Internet and Web Information Systems, Statistics Conferences: Joint Stat. Meeting, etc. Journals: Annals of statistics, etc. Visualization Conference proceedings: CHI, ACM-SIGGraph, etc. Journals: IEEE Trans. visualization and computer graphics, etc.