Download presentation
Presentation is loading. Please wait.
Published byGavin Pitts Modified over 9 years ago
1
2 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion Objectives, Prerequisite and Content
2
3 Objectives This course provides: fundamental techniques of knowledge discovery and data mining (KDD) issues in KDD practical use and tools case-studies of KDD application
3
4 Nothing special but the followings are expected: Prerequisite for the course experience of computer use basis of databases, statistics, and mathematics programming skills
4
5 Content of the course Overview of KDD Mining association rules Mining action rules Decision tree induction Distributed knowledge systems and distributed query answering Cluster analysis
5
6 Outline of the presentation Objectives, Prerequisite and Content Brief Introduction to Lectures Discussion and Conclusion
6
7 Brief introduction to lectures Overview of KDD
7
8 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 4. Data Mining Methods 3. KDD Applications 5. Challenges for KDD
8
9 KDD: A Definition 10 6 -10 12 bytes: we never see the whole data set, so will put it in the memory of computers What is the knowledge? How to represent and use it? Then run Data Mining algorithms KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data.
9
10 We often see data as a string of bits, or numbers and symbols, or “objects” which we collect daily. Information is data stripped of redundancy, and reduced to the minimum necessary to characterize the data. Knowledge is integrated information, including facts and their relations, which have been perceived, discovered, or learned as our “mental pictures”. Data, Information, Knowledge Knowledge can be considered data at a high level of abstraction and generalization.
10
11 From Data to Knowledge From Data to Knowledge... 10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS,VIRUS 12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA 15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK,, 48, negative, n, n, n, BACTE(E), BACTERIA 16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?,negative, ?, n, n, ABSCESS, VIRUS... Medical Data by Dr. Tsumoto, Tokyo Med. & Dent. Univ., 38 attributes Numerical attribute categorical attribute missing values class labels IF cell_poly 15 THEN Prediction = VIRUS [87,5%] [confidence, predictive accuracy]
11
12 People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Raw data is rarely of direct benefit. Its true value depends on the ability to extract information useful for decision support. Impractical Manual Data Analysis knowledge base inference engine How to acquire knowledge for knowledge-based systems remains as the main difficult and crucial problem. ? Tradition: via knowledge engineers New trend: via automatic programs Data Rich Knowledge Poor
12
13 Volume Value EDP MIS DSS Benefits of Knowledge Discovery Generate Rapid Response Disseminate EDP: Electronic Data Processing MIS: Management Information Systems DSS: Decision Support Systems
13
14 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 4. Data Mining Methods 3. KDD Applications 5. Challenges for KDD
14
15 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) non-trivial process Multiple process valid Justified patterns/models novel Previously unknown useful Can be used understandable by human and machine
15
16 The Knowledge Discovery Process The Knowledge Discovery Process KDD is inherently interactive and iterative a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations 1 2 3 4 5 Understand the domain and Define problems Collect and Preprocess Data Data Mining Extract Patterns/Models Interpret and Evaluate discovered knowledge Putting the results in practical use
16
17 The KDD Process Data organized by function Create/select target database Select sampling technique and sample data Supply missing values Normalize values Select DM task (s) Transform to different representation Eliminate noisy data Transform values Select DM method (s) Create derived attributes Extract knowledge Find important attributes & value ranges Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods Data warehousing 1 2 3 4 5
17
18 Main Contributing Areas of KDD Databases Store, access, search, update data (deduction) Statistics Infer info from data (deduction & induction, mainly numeric data) Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) KDD [data warehouses: integrated data] [OLAP: On-Line Analytical Processing]
18
19 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 4. Data Mining Methods 3. KDD Applications 5. Challenges for KDD
19
20 Potential Applications Potential Applications Business information - Marketing and sales data analysis - Investment analysis - Loan approval - Fraud detection - etc. Manufacturing information - Controlling and scheduling - Network management - Experiment result analysis - etc. Scientific information - Sky survey cataloging - Biosequence Databases - Geosciences: Quakefinder - etc. Personal information
20
21 KDD: Opportunity and Challenges KDD: Opportunity and Challenges Data Rich Knowledge Poor (the resource) Enabling Technology (Interactive MIS, OLAP, parallel computing, Web, etc.) Competitive Pressure Data Mining Technology Mature KDD
21
22 KDD workshops: since 1989. Inter. Conferences: KDD (USA), first in 1995; PAKDD (Asia), first in 1997; PKDD (Europe), first in 1997. ML’04/PKDD’04 (in Pisa, Italy) Industry interests and competition: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, … About 80% of the Fortune 500 companies are involved in data mining projects or using data mining systems. JAPAN: FGCS Project (logic programming and reasoning). “Knowledge Discovery is the most desirable end-product of computing”. Wiederhold, Standford Univ. KDD: A New and Fast Growing Area
22
23 Lecture 1: Overview of KDD 1. What is KDD and Why ? 2. The KDD Process 4. Data Mining Methods 3. KDD Applications 5. Challenges for KDD
23
24 Primary Tasks of Data Mining Primary Tasks of Data Mining Classification Deviation and change detection Summarization Clustering Dependency Modeling Regression finding the description of several predefined classes and classify a data item into one of them. maps a data item to a real-valued prediction variable. identifying a finite set of categories or clusters to describe the data. finding a compact description for a subset of data finding a model which describes significant dependencies between variables. discovering the most significant changes in the data
24
25 Data General patterns Examples Cancerous Cell Data Classification “What factors determine cancerous cells?” Classification Algorithm Mining Algorithm - Rule Induction - Decision tree - Neural Network
25
26 If Color = light and Tails = 1 and Nuclei = 2 Then Healthy Cell (certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 Then Cancerous Cell (certainty = 87%) Classification: Rule Induction “What factors determine a cell is cancerous?”
26
27 Color = darkColor = light healthy Classification: Decision Trees #nuclei=1#nuclei=2 #nuclei=1#nuclei=2 #tails=1#tails=2 cancerous healthy #tails=1#tails=2 cancerous
27
28 Healthy Cancerous “What factors determine a cell is cancerous?” Classification: Neural Networks Color = dark # nuclei = 1 … # tails = 2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.