Data Mining: Current Status and Directions. What is Data Mining? Data mining (also called knowledge discovery in databases) Extraction of interesting.

Slides:



Advertisements
Similar presentations
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Advertisements

Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
By: Mr Hashem Alaidaros MIS 211 Lecture 4 Title: Data Base Management System.
Overview of Data Mining & The Knowledge Discovery Process Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Data Mining By Archana Ketkar.
Chapter 14 The Second Component: The Database.
Chapter 13 The Data Warehouse
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Ch3 Data Warehouse part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Data warehousing and mining Session VII (Part 1) 15: :10 Sunita Sarawagi School of IT, IIT Bombay.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data warehousing and mining. 2 Introduction Organizations getting larger and amassing ever increasing amounts of data Historic data encodes useful information.
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Chapter 1 Introduction to Data Mining
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining By Dave Maung.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
1 Topics about Data Warehouses What is a data warehouse? How does a data warehouse differ from a transaction processing database? What are the characteristics.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Of Crawlers, Portals, Mice and Men: Is there more to Mining the Web? Jiawei Han Simon Fraser University, Canada ACM-SIGMOD’99 Web Mining Panel Presentation.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
MIS2502: Data Analytics Advanced Analytics - Introduction.
January 8, 2016Data Mining: Concepts and Techniques1 Data Mining: Trends and Applications.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Academic Year 2014 Spring Academic Year 2014 Spring.
February 13, 2016 Data Mining: Concepts and Techniques 1 1 Data Mining: Concepts and Techniques These slides have been adapted from Han, J., Kamber, M.,
Applications and Trends in Data Mining Pertemuan 13 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
Data Mining Functionalities
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Chapter 13 The Data Warehouse
Data warehouse & Data Mining: Concepts and Techniques
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehouse.
Data Warehouse and OLAP
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Supporting End-User Access
Data Mining: Concepts and Techniques
Data Mining: Introduction
Data Warehousing Concepts
Data Mining: Concepts and Techniques
Data Warehouse and OLAP
Presentation transcript:

Data Mining: Current Status and Directions

What is Data Mining? Data mining (also called knowledge discovery in databases) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories The goal is to understand and use data, to make data itself something of value and strategic importance

Data is everywhere! Relational databases—A commodity of every enterprise POS (Point of Sales): Transactional DBs are often terabytes in size Legacy databases Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases Time-series data (e.g., stock trading) and temporal data Text (documents, s) and multimedia databases WWW: A huge, hyper-linked, dynamic, global information system

The potential for Data Mining Is Everywhere, too! Knowledge to be mined Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. Techniques utilized Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural networks, etc. Applications adapted Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.

Data Mining: A Confluence of Multiple Disciplines Data Mining Database Technology Statistics Other Disciplines Information Science Machine Learning (AI) Visualization

Multi-Dimensional Data Analysis Data warehousing: integration from heterogeneous or semi-structured databases Multi-dimensional modeling of data: star & snowflake schemas (in Relational DBMS) Efficient and scalable computation of data cubes or iceberg cubes (in MDDB) OLAP (on-line analytical processing): drilling, dicing, slicing, etc. Discovery-driven (data driven) exploration of data cubes

Data Cubes

Data cube dimensional hierarchy

Start with standard normalized relational database tables. Creating Multi-dimensional data warehouses

Data warehouse ‘STAR’ Schema In order to reduce the number of joins that must be performed, data is reformatted into ‘fact’ tables. Fact tables typically consist of many foreign keys

Data Warehouse ‘Snowflake’ Schema Very similar to the snowflake schema, can you tell what this schema lets us see that the snowflake did not?

Making optimal use of storage space Many cuboids can be materialized by analyzing another cuboid as opposed to the entire data set Example: Consider analyzing sales based on the dimensions of Route, Source, and Time. The number of rows in each view is given in Millions. Route, Source, Time Route, TimeRoute, SourceSource, Time RouteSource Time None 6 M.8 M6 M.2 M.01 M.1 M Materialization of all views would require roughly 19.1 Million rows

Dependent Cuboids Part, Supplier, Customer Part, Customer Part, Supplier Supplier, Customer Part(color), Customer (State) Part(size), Customer (State) Part (color), Customer (Country) Part (Size), Customer (Country) Part (Color), Customer (Individual) Part (Size), Customer (Individual) Assume that ‘Part’ can be further partitioned into ‘size’ and ‘color’, ‘Customer’ can be partitioned into ‘Individual’, ‘State’, and ‘Country’ PartCustomer 6 M.8 M Selective materialization in this case can reduce the number of stored rows by 12 Million

Association and Frequent Pattern Analysis Objective is to find patterns in the tendency of items to be found together. A typical 2-item association rule output will generally look something like this: Computer  Software (7%, 72%) This is telling you that 7% (a.k.a. confidence level) of your sales transactions involved computers AND software, and that 72% (a.k.a. support level) of all computer sales involved the sale of software.

Association and Frequent Pattern Analysis Associations can also be found among 3, 4, or more item sets, for example: (Computers, Software)  Mouse Pad (8%, 65%) This tells you that 8% of transactions involved computers, software, and mouse pads. And that 65% of transactions involving computers and software also involved the purchase of a mouse pad

Association and Frequent Pattern Analysis The problem with unguided associative analysis is that the number of associations can be enormous. Consider a store like L.L. Bean trying to identify meaningful associations. The output could number in the millions. In order to “filter” the output, users will frequently set parameters for confidence and support thresholds.

Visualization of association rules in MineSet 3.0

Clustering and Outlier Analysis Attribute of interest is plotted on a graph whose axes represent the dimensions of interest. Cluster analysis is frequently two dimensional, but does not have to be. The objective of the data mining algorithm is to find the centers of clusters that maximizes the distance between cluster centers while minimizing the distance between points in a cluster and the center of the cluster. The center of the cluster typically defines the cluster (e.g. males between 30 and 35 years old with incomes between 50K and 75K) and axes are usually parametric rather than continuous

Clustering Analysis Can include user-specified constraints (e.g. no cluster has less than 1000 customers)

Sequential Patterns and Time- Series Analysis Trend analysis Trend movement vs. cyclic variations, seasonal variations and random fluctuations Similarity search in time-series database Handling gaps, scaling, etc. Indexing methods and query languages for time-series Sequential pattern mining Various kinds of sequences, various methods Periodicity analysis Full periodicity, partial periodicity, cyclic association rules

Data Mining Industry and Applications Industry has grown rapidly over the past few years From research prototypes to data mining products, languages, and standards IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, etc. A few data mining languages and standards (esp. MS OLEDB for Data Mining). Application achievements in many domains Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.

The data mining industry Data mining is growing rapidly R & D has seen huge increases Applications have been broadened substantially But not as rapidly as some may have hoped. Why not? Value is easy to objectively measure It is difficult to sell on hype alone, although they try! Not on-the-shelf in nature Need training, understanding, and customization Definite learning curve associated with effective use Benefit of effective use not seen immediately

Trends in data mining Web mining (and incorporating data from outside the organization into the analysis of internal data) Towards integrated data mining environments and tools “Vertical” (or application-specific) data mining Invisible data mining Towards intelligent, efficient, and scalable data mining methods

Web Mining: A Rapidly Expanding area in Data Mining Mine what the Web search engine finds Automatic classification of Web documents Discovery of authoritative Web pages, Web structures and Web communities Meta-Web Warehousing: Web yellow page service Web usage mining

Mining the results of Web Search Engine Finds Current Web search engines: keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. Data mining will help: coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies better search primitives: user preferences/hints linkage analysis: authoritative pages and clusters customization: home page + Weblog + user profiles Identification of “hub” pages

A Layered Meta-Web Architecture Generalized Descriptions More Generalized Descriptions Layer 0 Layer 1 Layer n...

Importance of Constructing Multi-Layer Meta Web Benefits of Multi-Layer Meta-Web: Multi-dimensional Web info summary analysis Approximate and intelligent query answering Web high-level query answering (WebSQL, WebML) Web content and structure mining Observing the dynamics/evolution of the Web Is it realistic to construct such a meta-Web? It benefits even if it is partially constructed The benefit may justify the cost of tool development, standardization, and partial restructuring

Web Usage (Click-Stream) Mining Web-log provides rich information about Web dynamics Multidimensional Web-log analysis: disclose potential customers, users, markets, etc. Plan mining (mining general Web accessing regularities): Web linkage adjustment, performance improvements Trend analysis: Dynamics of the Web: what has been changing? Customized to individual users

Intelligent Tools for Data Mining Integration of users and mining algorithms paves the way to intelligent mining Smart interface brings intelligence Easy to use, understand and manipulate One picture may be worth 1,000 words Visual and audio data mining Towards self-tuning, self-managing, self- triggering data mining