Download presentation
Presentation is loading. Please wait.
Published byCharlene Bates Modified over 9 years ago
1
Intelligent Data Analysis and Probability Inference Data Mining : Intelligent Data Analysis for Knowledge Discovery Yike Guo Dept. of Computing Imperial College
2
Intelligent Data Analysis and Probability Inference Course Overview Goal –Basic Concepts of Data Mining – Data Mining Techniques –Data Mining Applications – Future Research Trends on Data Mining Reference Books Advances in Knowledge Discovery and Data Mining U.M Fayyad and G, Piatetsky-Shapiro AAAI/MIT Press. 1996 Predictive Data Mining: A Practical Guide Sholom M.Weiss and Nitin Indurkhya Morgan Kaufmann Publishers, Inc. 1997 Intelligent Data Analysis, Springer 1999 Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000
3
Intelligent Data Analysis and Probability Inference What does the data say? Day OutlookTemperature HumidityWindPlay Tennis 1 SunnyHotHighWeakNo 2SunnyHotHighStrongNo 3OvercastHotHighWeakYes 4RainMildHighWeakYes 5RainCoolNormalWeakYes 6RainCoolNormalStrongNo 7OvercastCoolNormalStrongYes 8SunnyMildHighWeakNo 9SunnyCoolNormalWeakYes 10RainMildNormalWeakYes 11SunnyMild NormalStrongYes 12OvercastMildHighStrongYes 13OvercastHotNormalWeakYes 14RainMildHighStrongNo
4
Intelligent Data Analysis and Probability Inference Turing Data into Knowledge
5
Intelligent Data Analysis and Probability Inference What does the data say?
6
Intelligent Data Analysis and Probability Inference
7
Why Data Mining Limitation of traditional database querying: –Most queries of interest to data owners are difficult to state in a query language “ find me all records indicating fraud”=> “ tell me the characteristics of fraud” (Summarisation) “find me who likely to buy product X” (classification problem) “find all records that are similar to records in table X” (clustering problem) –Ability to support analysis and decision making using traditional (SQL) queries become infeasible (query formulation problem ).
8
Intelligent Data Analysis and Probability Inference Relational Database Revisited Terabyte databases, consisting of billions of records, are becoming common Relational data model is the defacto standard A relational database : set of relations A relation : a set of homogenous tuples Relations are created, updated and queried using SQL Query = Keyword based search SELECT telephone_number FROM telephone_book WHERE last_name = “Smith”
9
Intelligent Data Analysis and Probability Inference SQL : Relational Querying Language Provides a well-defined set of operations: scan, join, insert, delete, sort, aggregate, union, difference Scan -- applies a predicate P to relation R For each tuple tr from R if P(tr) is true, tr is inserted in the output stream Join -- composes two relations R and S For each tuple tr from R For each tuple ts from S if join attribute of tr equals to join attribute of ts form output tuple by concatenating tr and ts
10
Intelligent Data Analysis and Probability Inference Relational database. A table (relation) is a set and the three basic table operations shown here are extensions of the standard set operations. Paper 1 Paper 2 Paper 3 Paper 4.. MUID Journal Volume Pages Year SELECT PROJECT MUID Author Author 1-1 Author 1-2 Author 2-1 Author 2-2 Author 2-3 Author 3-1.. JOIN MUID Journal Volume Pages Year Author
11
Intelligent Data Analysis and Probability Inference The Query Formulation Problem It is not solvable via query optimisation Has not received much attention in the database field or in traditional statistical approaches These problems are of inductive features: learning from data rather than search from data Natural solution is via train-by-example approach to construct inductive models as the answers Consider the query : What kinds of weather condition are suitable for playing tennis ?
12
Intelligent Data Analysis and Probability Inference Why Data Mining Now Data Explosion –Business Data : organisations such as supermarket chains, credit card companies, investment banks, government agencies, etc. routinely generate daily volumes of 100MB of data –Scientific Data: Scientific and remote sensing instruments collect data at the rates of Gigabytes per day: far beyond human analysis abilities. Data Wasting –O nly a small portion (5% - 10%) of the collected data is ever analysed –Data that may never be analysed continues to be collected, at great expense. We are drowning in data, but starving for knowledge!
13
Intelligent Data Analysis and Probability Inference What is Data Mining Data Mining: a non-trivial intelligent data analysis process for identifying valid, useful and understandable patterns from databases.
14
Intelligent Data Analysis and Probability Inference Data: set of facts F ( records in a database) Pattern : An expression E in a language L describing data in a subset FE of F and E is simpler than the enumeration of al l the facts of FE. FE is also called a class and E is also called a model or knowledge. Data Mining Process: data mining is a multi-step process involving multiple choices, iteration and evaluation. It is non- trivial since there is no closed-form solution. It always involve intensive search. Validity : E is true (with high probability) for F Useful : patterns are not trivial inductive properties of data Understandable: patterns should be understandable by data owners to aid in understanding the data/domain
15
Intelligent Data Analysis and Probability Inference Data Mining and Decision Support Data Warehousing: create/ select target database Data Warehousing: create/ select target database Sampling: choose data for building models Sampling: choose data for building models Data Cleaning: supply missing values eliminate noisy data Data Cleaning: supply missing values eliminate noisy data Data Reduction and Projection: derive useful features dimensionality reduction Data Reduction and Projection: derive useful features dimensionality reduction Data Mining: choose data mining tasks choose data mining methods to extract patterns / knowledge Data Mining: choose data mining tasks choose data mining methods to extract patterns / knowledge Model Test and Evaluation: test the accuracy of the model consistency check model refinement Model Test and Evaluation: test the accuracy of the model consistency check model refinement Decision Support Machine Learning Technologies
16
Intelligent Data Analysis and Probability Inference Data Warehousing “ A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” --- W. H. Inmon A data warehouse is – A decision support database that is maintained separately from the organization’s operational databases. –It integrates data from multiple heterogeneous sources to support the continuing need for structured and /or ad-hoc queries, analytical reporting, and decision support.
17
Intelligent Data Analysis and Probability Inference Modeling Data Warehouses Modeling data warehouses: dimensions & measurements – Star schema: A single object (fact table) in the middle connected to a number of objects (dimension tables) radically. – Snowflake schema: A refinement of star schema where the dimensional hierarchy is represented explicitly by normalizing the dimension tables. – Fact constellations: Multiple fact tables share dimension tables. Storage of selected summary tables: – Independent summary table storing pre-aggregated data, e.g., total sales by product by year. –Encoding aggregated tuples in the same fact table and the same dimension tables.
18
Intelligent Data Analysis and Probability Inference Example of Star Schema Many Time Attributes Time Dimension Table Many Store Attributes Store Dimension Table Sales Fact Table Time_Key Product_Key Store_Key Location_Key unit_sales dollar_sales Yen_sales Measures Many Product Attributes Product Dimension Table Many Location Attributes Location Dimension Table
19
Intelligent Data Analysis and Probability Inference A Star-Net Query Model Shipping Method AIR-EXPRESS TRUCK ORDER Customer Orders CONTRACTS Customer Product PRODUCT GROUP PRODUCT LINE PRODUCT ITEM SALES PERSON DISTRICT DIVISION OrganizationPromotion DISTRICT REGION COUNTRY Geography DAILYQTRLYANNUALY Time
20
Intelligent Data Analysis and Probability Inference OLAP: On-Line Analytical Processing A multidimensional, LOGICAL view of the data. Interactive analysis of the data: drill, pivot, slice_dice, filter. Summarization and aggregations at every dimension intersection. Retrieval and display of data in 2-D or 3-D crosstabs, charts, and graphs, with easy pivoting of the axes. Analytical modeling: deriving ratios, variance, etc. and involving measurements or numerical data across many dimensions. Forecasting, trend analysis, and statistical analysis. Requirement: Quick response to OLAP queries.
21
Intelligent Data Analysis and Probability Inference OLAP Architecture Logical architecture: –OLAP view: multidimensional and logic presentation of the data in the data warehouse/mart to the business user. –Data store technology: The technology options of how and where the data is stored. Three services components: –data store services –OLAP services, and –user presentation services. Two data store architectures: – Multidimensional data store: (MOLAP). – Relational data store: Relational OLAP (ROLAP).
22
Intelligent Data Analysis and Probability Inference Construction of Data Cubes sum 0-20K20-40K60K-sum Comp_Method …... sum Database Amount Province Discipline 40-60K B.C. Prairies Ontario All Amount Comp_Method, B.C. Each dimension contains a hierarchy of values for one attribute A cube cell stores aggregate values, e.g., count, sum, max, etc. A “sum” cell stores dimension summation values. Sparse-cube technology and MOLAP/ROLAP integration. “Chunk”-based multi-way aggregation and single-pass computation.
23
Intelligent Data Analysis and Probability Inference Decision Support with Data Warehouse Ad Hoc Queries: Q: How many customers do we have in London? A: 32776
24
Intelligent Data Analysis and Probability Inference Report and Spreadsheet
25
Intelligent Data Analysis and Probability Inference OLAP: Q:What are the sales figures for Y in the different regions:
26
Intelligent Data Analysis and Probability Inference Statistics: Q: Is there a relation between age and buy behaviour? A: Older clients buy more
27
Intelligent Data Analysis and Probability Inference Data Mining: Q: What factors influence buying behaviour ? A1: : Young men in sports cars buy 3 times as much audio equipment (clustering/regression): A2: Older woman with dark hair more often buy rinse (classification) A3: Buyers of cars are also the buyers of houses (asociation) Wage Old Young Middle YN N N Y Hair color Age B W LH
28
Intelligent Data Analysis and Probability Inference Example Data Mining Applications Commercial : –Fraud detection : Identify Fraudulent transaction –Loan approval: Establish the credit worthiness of a customer requesting a loan –Investment analysis : Predict a portfolio's return on investment –Marketing and sales data analysis: Identify potential customers; establishing the effectiveness of a sales campaign Medical: –Drug effect analysis : from patient records to learn drug effects –Disease causality analysis Political policy: –Election policy : people’s voting patterns –Social policy: tax/benefit policy Manufacturing: –Manufacturing process analysis : identify the causes of manufacturing problems –Experiment result analysis : Summarise experiment results and create predictive models
29
Intelligent Data Analysis and Probability Inference Related Fields: Machine learning: Inductive reasoning Statistics : Sampling, Statistical Inference, Error Estimation Pattern recognition: Neural Networks, Clustering Knowledge Acquisition, Statistical Expert Systems Data Visualisation Databases: OLAP, Parallel DBMS, Deductive Databases Data Warehousing: collection, cleaning of transactional data for on-line retrial
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.