Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources.

Slides:



Advertisements
Similar presentations
Web Mining.
Advertisements

An Introduction to Data Mining
Back to Table of Contents
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Week 9 Data Mining System (Knowledge Data Discovery)
Section 5 Data Mining.
Data Mining By Archana Ketkar.
Data Mining Adrian Tuhtan CS157A Section1.
Data Mining Concepts 1.1 COT5230 Data Mining Week 1 Data Mining Concepts M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T.
Data Mining – Intro.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Data Mining: A Closer Look
Data Mining CS 157B Section 2 Keng Teng Lao. Overview Definition of Data Mining Application of Data Mining.
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
D ATA M INING A N O VERVIEW BY : J OSEPH C ASABONA Data Warehouse-->
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
TCU Dept. of Computer Science CRESCENT Database Issues in Smart Homes Pervasive Intelligent Environments Spring 2004 March 2, 2004.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining By Fu-Chun (Tracy) Juang. What is Data Mining? ► The process of analyzing LARGE databases to find useful patterns. ► Attempts to discover.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Copyright © Curt Hill Data Mining A Brief Overview.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
 Fundamentally, data mining is about processing data and identifying patterns and trends in that information so that you can decide or judge.  Data.
Data Mining By Dave Maung.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
What is Data Mining? process of finding correlations or patterns among dozens of fields in large relational databases process of finding correlations or.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DATA MINING By Cecilia Parng CS 157B.
Business Intelligence - 2 BUS 782. Topics Data warehousing Data Mining.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
MIS2502: Data Analytics Advanced Analytics - Introduction.
DATA MINING PREPARED BY RAJNIKANT MODI REFERENCE:DOUG ALEXANDER.
Elsayed Hemayed Data Mining Course
Academic Year 2014 Spring Academic Year 2014 Spring.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Data Warehousing Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros Mitakos
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining Functionalities
Introduction BIM Data Mining.
Data Mining – Intro.
By Arijit Chatterjee Dr
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
MIS 451 Building Business Intelligence Systems
Adrian Tuhtan CS157A Section1
Sangeeta Devadiga CS 157B, Spring 2007
Data Analysis.
Data Mining: Introduction
Kenneth C. Laudon & Jane P. Laudon
CSE591: Data Mining by H. Liu
Presentation transcript:

Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining For many years, organizations have generated a large amount of data in the form of files and databases These data can be processed using database technology with languages such as SQL SQL drawbacks: it is assumed that the user is aware of the DB schema, some queries can become very complex, for example, those oriented to discover information…

Data mining Data mining refers to the discovery of information in terms of patterns or rules from vast amounts of data To be useful, data mining must be carried out efficiently on large files and databases Data mining uses techniques from areas such as machine learning, statistics, neural networks, and genetic algorithms, among others.

Data mining We will highlight the nature of the information that is discovered, the types of problems faced in databases and potential applications Data mining is related with a broader area called knowledge discovery (see below)

Data mining Remember: the goal of a Data Warehouse (DW) is to support decision making with data: Data mining can be used in conjuntion with a DW to help with decision making processes It is possible to apply data mining to operational databases (or files) with individual transactions However, to make data mining more efficient a DW could be used, where we could take advantage of the aggregated collection of data

Data mining Data mining helps in extracting meaningful patterns that cannot be found necessarily by merely querying or processing data in the DW Data mining requirements should be considered early, during the design of a DW Indeed, for very large databases, succesful use of data mining will depend first on the construction of the DW

Data mining Data mining is a part of the knowledge discovery process Knowledge discovery in databases (KDD), typically encompasses more than data mining

KDD The KDD comprises six phases: –Data cleansing –Enrichment –Data transformation and encoding –Data selection –Data mining –Reporting and display of the discovered information Data integration

KDD Data integration: Data cleansing, enrichment, data transformation, encoding Databases Data Warehouse Data Mining Pattern Evaluation Knowledge Selection

KDD: Data integration During data cleansing, invalid data can be fixed: fix zip codes or eliminate records with wrong phone prefixes

KDD: Data integration Enrichment typically enhances the data with additional information from other sources. For example, given the customer names and phone numbers, an organization can get (perhaps buy) other data such as age, income, and credit card rating and then append them to each customer record.

KDD: Data integration Data transformation and encoding may be done to reduce the amount of data. For example, product codes may be grouped in terms of product categories. Zip codes may be aggregated into geographic regions, incomes may be divided into ranges, and so on.

Data mining During data selection, data about specific products or categories of specific products, or from stores in a specific region, may be selected data miningAfter such preprocessing, data mining techniques are used to discover rules and patterns

Data mining For example, the result of mining could discover: –Association rules: whenever a customer buys video equipment, he also buys another electronic gadget –Sequential patterns: a customer who buys a camera, he will buy photographic supplies usually within the next three months, and within six months, an accesory item. A customer who buys more than twice in the lean periods* may be likely to buy at least once during Christmas period * Periodos de escasez

Data mining –Classification trees: customers may be classified by frequency of visits, by types of financing used, by amount of purchase, by affinity for types of items  some revealing statistics may be generated for such classes

Data mining This information can then be used –to plan additional store locations based on demographics –to run store promotions –to combine products in advertisements –to plan seasonal marketing strategies

Goals of data mining and knowledge discovery The goals of data mining fall into the following classes: –Prediction –Identification –Classification –Optimization

Goals of data mining and knowledge discovery Prediction: Data mining can show how certain attributes within the data will behave in the future: analysis of buying transactions to predict what consumers will buy under certains discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits

Goals of data mining and knowledge discovery Identification: to identify the existence of an item, an event, or an activity: intruders may be identified by the programs executed, files accessed, and CPU time per session; a gene can be identified by certain sequences of nucleotide symbols in the DNA sequence.

Goals of data mining and knowledge discovery Classification: Data mining can partition the data so that different classes can be identified based on combination of parameters: customers in a supermarket can be classified into discount-seekers or shoppers in a rush.

Goals of data mining and knowledge discovery Optimization: to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales under a given set of constraints  A strong resemblance with the objective function in operations research field (there is no sharp line separating data mining from this and other related disciplines)

Data mining Some types of knowledge discovered during data mining: –Association rules –Sequential patterns –Patterns within time series –Categorization and segmentation

Data mining Association rules*: correlate the presence of items with another range of values for another set of variables: when a female retail shopper buys a handbag, she is likely to buy shoes. * Later, we will focus on this type of knowledge.

Data mining Sequential patterns: a sequence of actions or events is sought: if a patient underwent cardiac bypass surgery and later developed high blood urea within a year of surgery, he is likely to suffer from kidney within the next year. Note that detection of sequential patterns is equivalent to detecting association among events with certain temporal relationships

Data mining Patterns within time series: similarities can be detected within positions of time series: stocks of a utility (service) company A and a financial company B show the same pattern during a year, two products show the same selling price pattern in summer but a different one in winter.

Data mining Categorization and segmentation: a given population of events or items can be partitioned into sets of “similar” elements: –a population of treatment data may be divided into groups based on similarity of side effects –a population may be categorized into groups from “most likely to buy” to “least likely to buy” –web accesses made by users may be analized in terms of keywords to reveal clusters of users Web usage mining

Association rules The database is regarded a collection of transactions (for example, purchases), each involving a set of items A common example is that of market- based data Consider the following example with four transactions:

Association rules Transaction_id Items_bought 1milk, bread, juice 2milk, juice 3milk, eggs 4bread, cookies, coffee Note: Some important information is not considered, for example, the quantity of each item purchased in each transaction

Association rules Another example: a text document data set, where each document is treated as a set of keywords: Doc 1: {student, teach, school} Doc 2: {student, school} Doc 3: {teach, school, city, game} Doc 4: {baseball, basketball} Doc 5: {basketball, team, city, game} Text mining, Web content mining

Association rules An association rule is of the form: LHS (left hand side)  RHS (right hand side) X  Y where X = {x 1, x 2, …, x n } and Y = {y 1, y 2, …, y m } are set of items, x i and y i being distinct items for all i and j and X  Y =  This association states that if a customer buys X, he is also likely to buy Y.

Association rules Association rules should include both support (prevalence) and confidence (strenght) The support for a rule LHS  RHS is the percentage of transactions that hold all the items in the set LHS  RHS. If the support is low, it implies that there is no overwhelming evidence that the items LHS  RHS occur together.

Association rules: Support examples Milk  Juice has 50% support. Bread  Juice has 25% support.

Association rules To compute confidence, we consider all transactions that include items in LHS. The confidence for LHS  RHS is the percentage of such transactions that also include RHS.

Association rules: Confidence examples Milk  Juice has 66.6% confidence. Bread  Juice has 50% confidence.

Association rules n = number of transactions, then: (X  Y).count Support = n Confidence = X.count