CSE5230 - Data Mining, 2004Lecture 3.1 Data Mining - CSE5230 Pre-processing for Data Mining CSE5230/DMS/2004/3.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Unit 7: Store and Retrieve it Database Management Systems (DBMS)
Managing Data Resources
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Chapter 3 Database Management
Your Interactive Guide to the Digital World Discovering Computers 2012 Chapter 10 Managing a Database.
Organizing Data & Information
Living in a Digital World Discovering Computers 2010.
Pre-processing for Data Mining 3.1 COT5230 Data Mining Week 3 Pre-processing for Data Mining M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
CSE Data Mining, 2003Lecture 3.1 Data Mining - CSE5230 Pre-processing for Data Mining CSE5230/DMS/2003/3.
Data Mining – Intro.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
DASHBOARDS Dashboard provides the managers with exactly the information they need in the correct format at the correct time. BI systems are the foundation.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Troy Eversen | 19 May 2015 Data Integrity Workshop.
Business Intelligence Instructor: Bajuna Salehe Web:
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)
Database Systems – Data Warehousing
Objectives Overview Define the term, database, and explain how a database interacts with data and information Define the term, data integrity, and describe.
1 INTRODUCTION TO DATABASE MANAGEMENT SYSTEM L E C T U R E
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
© 2007 by Prentice Hall 1 Introduction to databases.
Observation & Analysis. Observation Field Research In the fields of social science, psychology and medicine, amongst others, observational study is an.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
Objectives Overview Define the term, database, and explain how a database interacts with data and information Describe the qualities of valuable information.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
ITGS Databases.
Data resource management
1 Technology in Action Chapter 11 Behind the Scenes: Databases and Information Systems Copyright © 2010 Pearson Education, Inc. Publishing as Prentice.
Managing Data Resources. File Organization Terms and Concepts Bit: Smallest unit of data; binary digit (0,1) Byte: Group of bits that represents a single.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Database Management Systems (DBMS)
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Flat Files Relational Databases
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
Chapter 5-1. Chapter 5-2 Chapter 5: Organizing and Manipulating the Data in Databases Introduction Normalization Validating the Data in Databases Extracting.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Data Mining What is to be done before we get to Data Mining?
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.
Managing Data Resources File Organization and databases for business information systems.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
What is a database? (a supplement, not a substitute for Chapter 1…) some slides copied/modified from text Collection of Data? Data vs. information Example:
Information Retrieval in Practice
Clustering and Term Project
Databases Chapter 9 Asfia Rahman.
GO! with Microsoft Office 2016
GO! with Microsoft Access 2016
Chapter Ten Managing a Database.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Relational Database Model
The ultimate in data organization
Chapter 17 Designing Databases
Data Warehousing Concepts
Chapter 3 Database Management
Best Practices in Higher Education Student Data Warehousing Forum
Presentation transcript:

CSE Data Mining, 2004Lecture 3.1 Data Mining - CSE5230 Pre-processing for Data Mining CSE5230/DMS/2004/3

CSE Data Mining, 2004Lecture 3.2 Lecture Outline u Data Preparation vAccessing data vData characterization vData selection vUseful operations for data clean-up and conversion vIntegration Issues u Data Modeling vMotivation vTen Golden Rules vObject modeling vData Abstraction vWorking with Meta Data

CSE Data Mining, 2004Lecture 3.3 Lecture Objectives u By the end of this lecture, you should be able to: vExplain why data preparation is necessary before data mining can commence vGive examples of useful operations during the process of data clean-up and conversion, and show how these operations are applied in specific cases vExplain why modeling is important in data preparation for data mining, and give examples of such models vExplain the notion of data abstraction and why it is useful

CSE Data Mining, 2004Lecture 3.4 Data Preparation for Data Mining - 1 u Before starting to use a data mining tool, the data has to be transformed into a suitable form for data mining u Many new and powerful data mining tools have become available in recent years, but the law of GIGO still applies: Garbage In  Garbage Out u Good data is a prerequisite for producing effective models of any type

CSE Data Mining, 2004Lecture 3.5 Data Preparation for Data Mining - 2 u Data preparation and data modeling can therefore be considered as setting up the proper environment for data mining u Data preparation will involve vAccessing the data (transfer of data from various sources) vIntegrating different data sets vCleaning the data vConverting the data to a suitable format

CSE Data Mining, 2004Lecture 3.6 Accessing the data - 1 u Before data can be identified and assessed, two major questions must be answered: vIs the data accessible? vHow does one get it? u There are many reasons why data might not be readily accessible, particularly in organizations without a data warehouse: vlegal issues vdepartmental access vpolitical reasons vdata format vconnectivity varchitectural reasons vtiming

CSE Data Mining, 2004Lecture 3.7 Accessing the data - 2 u Transferring from original sources vmay have to access from: high density tapes, attachments, FTP as bulk downloads u Repository types vDatabases »Obtain data as separate tables converted to flat files (most databases have the facility). vWord processors »Text output without any formatting would be the best vSpreadsheets »Small applications/organizations will store data in spreadsheets. Already in row/column format, so easy to access. Most problems due to inconsistent replications vMachine to Machine »Possible problems due to different computing architectures vThe Web »Structured, semi-structured or structureless data n HTML, XML, free text, images, video, audio, etc.

CSE Data Mining, 2004Lecture 3.8 Data characterization - 1 u After obtaining all the data streams, the nature of each data stream must be characterized vThis is not the same as the data format (i.e. field names and lengths) u Detail/Aggregation Level (Granularity) vall variables fall somewhere between detailed (e.g. transaction records) and aggregated (e.g. summaries) vin general, detailed data is preferred for data mining vthe level of available in a data set determines the level of detail that is possible in the output vusually the level of detail of the input stream must be at least one level below that required of the output stream

CSE Data Mining, 2004Lecture 3.9 Data characterization - 2 u Consistency vInconsistency can defeat any modeling technique until it is discovered and corrected »different things may have the same name in different systems »the same thing may be represented by different names in different systems »inconsistent data may be entered in a field in a single system, e.g. auto_type: Merc, Mercedes, M-Benz, Mrcds

CSE Data Mining, 2004Lecture 3.10 Data characterization - 3 u Pollution vData pollution can come from many sources. One of the most common is when users attempt to stretch a system beyond its intended functionality, e.g. »“B” in a gender field, intended to represent “Business”. Field was originally intended to only even be “M” or “F”. vOther sources include: »copying errors (especially when format incorrectly specified) »human resistance - operators may enter garbage if they can’t see why they should have to type in all this “extra” data

CSE Data Mining, 2004Lecture 3.11 Data characterization - 4 u Objects vprecise nature of object being measured by the data must be understood »e.g. what is the difference between “consumer spending” and “consumer buying patterns”? u Domain vEvery variable has a domain: a range of permitted values vSummary statistics and frequency counts can be used to detect erroneous values outside the domain vSome variables have conditional domains, violations of which are harder to detect »e.g. in a medical database a diagnosis of ovarian cancer is conditional on the gender of the patient being female

CSE Data Mining, 2004Lecture 3.12 Data characterization - 5 u Default values vif the system has default values for fields, this must be known. Conditional defaults can create apparently significant patterns which in fact represent a lack of data u Integrity vChecking integrity evaluates the relationships permitted between variables »e.g. an employee may have multiple cars, but is unlikely to be allowed to have multiple employee numbers vrelated to the domain issue

CSE Data Mining, 2004Lecture 3.13 Data characterization - 6 u Duplicate or redundant variables vredundant data can easily result from the merging of data streams voccurs when essentially identical data appears in multiple variables, e.g. “date_of_birth”, “age” vif not actually identical, will still slow building of model vif actually identical can cause significant numerical computation problems for some models — even causing crashes

CSE Data Mining, 2004Lecture 3.14 Extracting part of the available data u In most cases original data sets would be too large to handle as a single entity. There are two ways of handling this problem: vLimit the scope of the the problem »concentrate on particular products, regions, time frames, dollar values etc. OLAP can be used to explore data prior to such limiting »if no pre-defined ideas exist, use tools such as Self- Organizing Neural Networks to obtain an initial understanding of the structure of the data vObtain a representative sample of the data »Similar to statistical sampling u Once an entity of interest is identified via initial analysis, one can follow the lead and request more information (“walking the data”)

CSE Data Mining, 2004Lecture 3.15 Process of Data Access u Some problems one may encounter: vcopyright, security, limited front-end menu facilities Data source Query data source Obtain sample Temporary repository Apply filters or Clustering refining Data Mining Tool Request for updates

CSE Data Mining, 2004Lecture 3.16 Useful operations during data access/preparation - 1 u Text Standardization vconvert all text to upper- or lowercase »This helps to avoid problems due to case differences in different occurrences of the same data (e.g. the names of people or organizations vremove extraneous characters e.g. etc. vRemove punctuation (in some applications) vWord stemming (for text mining/text retrieval applications) u Concatenation vcombine data spread across multiple fields e.g. names, addresses. The aim is to produce a unique representation of the data object u Representation formats vsome sorts of data come in many formats »e.g. dates: 12/05/03, 05-Dec-03, vtransform all to a single, simple format »and one that is future-proof (e.g. Y2K)

CSE Data Mining, 2004Lecture 3.17 Useful operations during data access/preparation - 2 u Abstraction vit can sometimes be useful to reduce the information in a field to simple yes/no values: e.g. »flag people as having a criminal record, rather than having a separate category for each possible crime u Unit conversion vchoose a standard unit for each field and enforce it: e.g. »$A, €,  $US »yards, feet  metres vThis can have dramatic consequences if not done! »The loss of the Mars Climate Orbiter (admittedly not a DM example ) u Exclusion vdata processing takes up valuable computation time, so one should exclude unnecessary or unwanted fields where possible vfields containing bad, dirty or missing data may also be removed

CSE Data Mining, 2004Lecture 3.18 Useful operations during data access/preparation - 3 u Numeric Encoding vMany data mining tools require numerical input data (e.g. neural networks) — but not all data is numeric! Data variables can be: »Numeric (integer or floating point), e.g. n 1, 2.6, 56.7, 10e-7,… »Nominal: the names of categories or classes, e.g. n cat, sheep, goldfish, duck, elephant, goat,… »Ordinal: samples from an ordered list of categories, e.g.: n Terrible, Bad, OK, Good, Excellent vNumerical variables must sometimes be normalized (e.g. to the range [0,1]) before being presented to a data mining tool »e.g. to prevent saturation of a neural network node

CSE Data Mining, 2004Lecture 3.19 Useful operations during data access/preparation - 4 u Numeric Encoding cont. vCare must be taken when encoding nominal variables in numeric form »Do not introduce relationships that are not present in the original data. For example, the mapping n cat  1, sheep  2, goldfish  3, duck  4, elephant  5, goat  6, … implies that the “distance” between duck and elephant is 1, whereas between sheep and goat it is 4 n This is not a real relationship – it is an artefact introduced by the encoding »One solution is to convert the single original variable into multiple binary variables, one for each category vCare must also be taken with ordinal variables, particularly when one value implies the presence of others. »Consider a variable describing how often someone watches television, e.g. n At least once a day, at least once a week, at least once a month, etc.

CSE Data Mining, 2004Lecture 3.20 Data integration issues - 1 u Multi-source vOracle, Excel, Informix, DB2, MySQL, etc. »Standardized database drivers help (e.g. ODBC) »Data Warehousing helps u Multi-format vrelational databases, hierarchical structures, XML, HTML, free text, etc. u Multi-platform vDOS, MS Windows, UNIX, etc. »Issues such as end-of-line characters, bigendian/littleendian binary file formats, etc. u Multi-security vcopyright, privacy, personal records, government data, etc.

CSE Data Mining, 2004Lecture 3.21 Data integration issues - 2 u Multimedia vtext, images, audio, video, etc. »Features of interest must be defined and extracted from the raw data »Cleaning might be required when formats are inconsistent u Multi-location vLAN, WAN, dial-up connections, etc. u Multi-query vwhether query format is consistent across data sets »again, database drivers useful here vwhether multiple extractions are possible » i.e. whether large number of extractions are possible — some systems do not allow batch extractions, one has to obtain records individually, etc.

CSE Data Mining, 2004Lecture 3.22 Modeling Data for Data Mining - 1 u A major reason for preparing data is so that mining can discover models u What is modeling? vit is assumed that the data set (available or obtainable) contains information that would be of interest if only we could understand what was in it vSince we don’t understand the information that is in the data just by looking at it, some tool is needed which will turn the information lurking in the data set into an understandable form

CSE Data Mining, 2004Lecture 3.23 Modeling Data for Data Mining - 2 u Object is to transfer the raw data structure to a format that can be used for mining u The models created will determine the type of results that can be discovered during the analysis u With most current data mining tools, the analyst has to have some idea what type of patterns can be identified during the analysis, and model the data to suit these requirements u If the data is not properly modeled, important patterns may go undetected, thus undermining the likelihood of success

CSE Data Mining, 2004Lecture 3.24 Modeling Data for Data Mining - 3 u To make a model is to express the relationships governing how a change in a variable or set of variables (inputs) affects another variable or set of variables (outputs) u we also want information about the reliability of these relationships u the expression of the relationships may have many forms: vcharts, graphs, equations, computer programs

CSE Data Mining, 2004Lecture 3.25 Ten Golden Rules for Building Models Select clearly defined problems that will yield tangible benefits 2. Specify the required solution 3. Define how the solution is going to be used 4. Understand as much as possible about the problem and the data set (the domain) 5. Let the problem drive the modeling (i.e. tool selection, data preparation, etc.)

CSE Data Mining, 2004Lecture 3.26 Ten Golden Rules for Building Models State any assumptions 7. Refine the model iteratively 8. Make the model as simple as possible - but no simpler (paraphrasing Einstein) 9. Define instability in the model (critical areas where change in output is very large for small changes in inputs) 10. Define uncertainty in the model (critical areas and ranges in the data set where the model produces low confidence predictions/insights)

CSE Data Mining, 2004Lecture 3.27 Object modeling u The main approach to data modeling assumes an object-oriented framework, where information is represented as objects, their descriptive attributes, and relationships that exist between object classes. u Examples object classes vCredit ratings of customers can be checked vContracts can be renewed vTelephone calls can be billed u Identifying attributes vIn a medical database system, the class patient may have the attributes height, weight, age, gender, etc.

CSE Data Mining, 2004Lecture 3.28 Data Abstraction u Information can be abstracted such that the analyst can initially get an overall picture of the data and gradually expand in a top-down manner u Will also permit processing of more data u Can be used to identify patterns that can only be seen in grouped data, e.g. group patients into broad age groups (0-10, 10-20, 20-30, etc.) u Clustering can be used to fully or partially automate this process

CSE Data Mining, 2004Lecture 3.29 Working with Metadata - 1 u Traditional definition of metadata is “data about data” u Some data miners include “data within data” in the definition u Example: Deriving metadata from dates: videntifying seasonal sales trends videntifying pivot points for some activity »e.g. happens on the 2nd Sunday of July vNote: “July 4th, 1976” is potentially: 7th Month of the Year, 4th Day of the Month, 1976, Sunday, 1st Day of the Week, 186th Day of Year, 1st Quarter of the Financial Year, Winter (in the southern hemisphere), etc.

CSE Data Mining, 2004Lecture 3.30 Working with Metadata - 2 u Metadata can also be derived from vID numbers vpassport numbers vdrivers’ licence numbers vpost codes vetc. u data can be modeled to make use of these u Example: Metadata derived from addresses and names videntify the general make up of a shop’s clients »e.g. correlate addresses with map data to determine the distances customers travel to come to the shop

CSE Data Mining, 2004Lecture 3.31 References u Dorian Pyle, “Data Preparation for Data Mining”, Morgan Kaufmann Publishers, 1999.