Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 1 Data Mining and Knowledge Discovery (DM & KD) prof. dr. Bojan Cestnik Temida d.o.o. & Jozef.

Slides:



Advertisements
Similar presentations
Relational Database and Data Modeling
Advertisements

What is a Database By: Cristian Dubon.
Chapter 5: Introduction to Information Retrieval
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
C6 Databases.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
QUANTITATIVE DATA ANALYSIS
Data Mining By Archana Ketkar.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Overview of Search Engines
Data at the Core of the Enterprise. Objectives  Define of database systems  Introduce data modeling and SQL  Discuss emerging requirements of database.
© 2008 The McGraw-Hill Companies, Inc. All rights reserved. ACCESS 2007 M I C R O S O F T ® THE PROFESSIONAL APPROACH S E R I E S Lesson 3 – Finding, Filtering,
APPENDIX C DESIGNING DATABASES
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Data at the Core of the Enterprise. Objectives  Define of database systems.  Introduce data modeling and SQL.  Discuss emerging requirements of database.
Intro to MIS – MGS351 Databases and Data Warehouses Chapter 3.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Software School of Hunan University Database Systems Design Part III Section 5 Design Methodology.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Analyzing and Interpreting Quantitative Data
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Database Design Principles – Lecture 3
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5-1 McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved.
Principles of Data Mining. Introduction: Topics 1. Introduction to Data Mining 2. Nature of Data Sets 3. Types of Structure Models and Patterns 4. Data.
Chapter 5 Data Resource Management. 2 I. Why do organizations store data?  Data resources must be structured and organized in some logical manner so.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Data Preparation and Description Lecture 25 th. RECAP.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Chapter 6: Analyzing and Interpreting Quantitative Data
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Chapter 17 Preparing Data for Mining. 2 Introduction Just as manufacturing and refining are about transformation of raw materials into finished products,
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
Intro to MIS – MGS351 Databases and Data Warehouses
Introduction Multimedia initial focus
Fundamentals & Ethics of Information Systems IS 201
Data Resource Management
Analyzing and Interpreting Quantitative Data
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Databases and Data Warehouses Chapter 3
Data Mining: Concepts and Techniques Course Outline
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment.
ERD’s REVIEW DBS201.
Spreadsheets, Modelling & Databases
The ultimate in data organization
Chapter 2 Database Environment Pearson Education © 2009.
Data Pre-processing Lecture Notes for Chapter 2
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 1 Data Mining and Knowledge Discovery (DM & KD) prof. dr. Bojan Cestnik Temida d.o.o. & Jozef Stefan Institute Ljubljana

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 2 Contents IntroductionIntroduction Basic Data Mining process Data kinds and formats ER diagram Data exploration Data preparation Examples in Excel and MySQL

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 3 Study guide and rules Lecture schedule –Tuesday, : :00 –Wednesday, : :00 orange room –Wednesday, : :00 orange room Web page: Literature for study Seminar assignment Exam

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 4 Basic Data Mining process Input:Input: transaction data table, relational database, text documents, web pages Goal:Goal: construct a classification model, find interesting patterns in data, etc.

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 5 KDD process KDD process involves several steps –Data preparation –Data mining –Evaluation and use of discovered patterns Data Mining is the key step –Only 15%-25% of the entire KDD process

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 6 Data kinds and formats Kinds of data: –Descriptive tables: instances, attributes, classes –Texts: documents, paragraphs, sentences, words –Multimedia: pictures, music, movies –… Data formats: –Relational databases –.xls: Excel table format –.csv: comma-separated file –.arff: attribute-relation file format (Weka) –…

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 7 Data sources example Local telephone company: –When the call was placed, who called, how long the call lasted, etc. Catalog company: –Items ordered, time and duration of calls, promotion response, credit card used, shipping method, etc. Credit card processor: –Transaction date, amount charged, approval code, vendor number, etc. Credit card issuer: –Billing record, interest rate, available credit update, etc. Package carrier: –Zip code, value of package, time stamp at truck, time stamp at sorting center, etc.

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 8 Tables I Single table: instances, attributes, classes instances attributesclass

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 9 Tables II Many tables: relations, ER diagram

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 10 Texts I Documents, web pages, etc. Transformations: lemmatization, stop-words, named entities, etc. Bag-of-words representation documents words

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 11 Texts II Areas of text processing –Semantic web –Semantic web – Knowledge representation and Reasoning –Information retrieval –Information retrieval – Search in DB –Natural language processing –Natural language processing – Computational linguistics –Text mining –Text mining – Data analysis

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 12 Texts III TFIDF measure for word relevance –(Term Frequency * Inverse Document Frequency) –Term Frequency: word frequency in a particular document (paragraph) –Inverse Document Frequency: how infrequent a word is in the collection of all documents (paragraphs)

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 13 Texts IV – document similarity Ideal: semantic similarity Practical: statistical similarity –Representation of documents as vectors –Cosine similarity between documents x y z v1v1 v2v2 3d example:

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 14 Multimedia: music I Finding the right attributes to describe different pieces of music Data preparation and pre-processing The need for special tools for data preparation

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 15 Multimedia: music II Widmer et al.: In Search of the Horowitz Factor, AI Magazine, 2004

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 16 Multimedia: music III Dynamics curves comparison

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 17 Multimedia: music IV Tempo / loudness performance curve

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 18 Multimedia: music V Mozart performance “alphabet”

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 19 Approaches to data gathering Problem definition –Class variable (dependent variable) –Attributes and values (independent variables) (1) Manual table construction (2) Generation from existing database (3) Combination of (1) and (2)

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 20 Models of the real world Real world: objects (entities), properties (attributes), relations Models: abstractions from the real world Data model: ERD diagram Conceptual data model – semantic view Logical data model – business view Physical data model – performance view

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 21 ER diagram Entities, attributes, relations

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 22 Entity = Table Rows: instances Columns: attributes, class

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 23 SQL Queries for ERD model Operations: –Data exploration –Data transformation Examples in MySQL

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 24 Data exploration What are the values in each column? –Columns with (almost) only one value –Columns with unique values What unexpected values are in each column? Are there any data format irregularities, such as time stamps missing hours and minutes or names being both upper- and lowercase? What relationships are there between columns? What are frequencies of values in columns and do these frequencies make sense?

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 25 Summary for one column I The number of distinct values in the column Minimum and maximum values An example of the most common value (called the mode in statistics) An example of the least common value (called the antimode) Frequency of the minimum and maximum values

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 26 Summary for one column II Frequency of the mode and antimode Number of values that occur only one time Number of modes (because the most common value is not necessarily unique) Number of antimodes

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 27 Basic statistical concepts The Null Hypothesis Confidence (versus probability) Normal Distribution

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 28 Data preparation I Dataflow operations: –Read –Output –Select (chooses the columns for the output; each column is either equal to input column or a function of some input columns) –Filter (removes rows based on the values in one or more columns; each input row either is or is not in the output table) –Append (appends columns to an existing table)

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 29 Data preparation II Dataflow operations: –Union (appends equally headed rows to an existing result) –Aggregate (groups columns together based on a common key; all the responding rows are summarized in a single output row) –Lookup (joining small tables) –Join (matches rows in two tables; for every matching pair a new row is created in the output) –Sort

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 30 Data types Numeric –Categorical –Rank –Interval –True numerics Date and time String

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 31 Derived variables During pre-processing or processing? Often contain very similar information Examples: –height ^2 / weight –debt / earnings –population / area –credit limit – balance Difference, ratio? Summarizations Extracting features from single columns –Date, time

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 32 Data sampling Selecting the right level of granularity Depends on the data types –Categorical –Rank –Interval –True numerics Sometimes we have to take what we have and do the best with it

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 33 Data variability How much data is enough? –How many rows? –How many columns? –How many bytes? –How much history? Selecting the right sample size Random sampling Beware of biased samples

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 34 Data confidence Statistical measures Stratified sampling techniques Example: variables gender and age in questionnaires Handling outliers –Do nothing –Filter the rows –Ignore the column –Replace the outlying values –Bin values into ranges Handling missing data

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 35 Data exploration with Excel Summary of a single column –Different values –Frequencies – value distribution –Aggregate functions Pivot tables Visualization: pivot graphs

Data Mining and Knowledge Discovery prof. dr. Bojan Cestnik 36 Overview DM algorithms want data in table format Data comes from warehouses, data marts, OLAP systems, external sources, etc. Data has to be transformed into a DM format: aggregations, joins Useful column types: categories, ranks, intervals, true numerics The art of DM: creating derived variables