Data Mining: A Closer Look Chapter 2. 2.1 Data Mining Strategies.

Slides:

Advertisements

Similar presentations

布林代數的應用--- 全及項(最小項)和全或項(最大項)展開式

Advertisements

第七章抽樣與抽樣分配蒐集統計資料最常見的方式是抽查。這牽涉到兩個問題：抽出的樣本是否具有代表性?是否能反應出母體的特徵?

Data Mining: A Closer Look Chapter Data Mining Strategies.

Section 1.2 Describing Distributions with Numbers 用數字描述分配.

序列分析工具:MDDLogo 謝勝任林宗慶指導教授:李宗夷教授.

What is static?. Static? 靜態 ? class Test { static int staticX; int instanceX; public Test(int var1, int var2) { this.staticX = var1; this.instanceX =

指導教授：陳淑媛學生：李宗叡李卿輔.  利用下列三種方法 (Edge Detection 、 Local Binary Pattern 、 Structured Local Edge Pattern) 來判斷是否為場景變換，以方便使用者來找出所要的片段。

CH-23 失效原因樹分析 FTA Failure Tree Analysis. 前言為了提昇系統可靠度，產品在開發階段，利用類似品管方法之魚骨圖分析手法，找出潛在缺點，並加以改進，此種分析方法稱之為失效原因樹分析法 (Failure Tree Analysis)– FTA 。 FTA 是一種系統化的方法，可以有效的找出.

: Factstone Benchmark ★★☆☆☆ 題組： Problem Set Archive with Online Judge 題號： : Factstone Benchmark 解題者：鐘緯駿解題日期： 2006 年 06 月 06 日題意：假設 1960.

亂數產生器安全性評估之統計測試 SEC HW7 姓名：翁玉芬學號：

Review of Chapter 3 - 已學過的 rules( 回顧 )- 朝陽科技大學資訊管理系李麗華教授.

: OPENING DOORS ? 題組： Problem Set Archive with Online Judge 題號： 10606: OPENING DOORS 解題者：侯沛彣解題日期： 2006 年 6 月 11 日題意： - 某間學校有 N 個學生，每個學生都有自己的衣物櫃.

消費者物價指數反映生活成本。當消費者物價指數上升時，一般家庭需要花費更多的金錢才能維持相同的生活水準。經濟學家用物價膨脹（inflation）來描述一般物價持續上升的現象，而物價膨脹率（inflation rate）為物價水準的變動百分比。

Chapter 2 聯立線性方程式與矩陣緒言線性方程式組 (systems of linear equations) 出現在多數線性模式 (linear model) 中。根據以往解題的經驗，讀者們也許已發現方程式的解僅與該方程式的係數有關，求解的過程也僅與係數的運算有關，只要係數間的相關位置不改變，

STAT0_sampling Random Sampling  母體： Finite population & Infinity population  由一大小為 N 的有限母體中抽出一樣本數為 n 的樣本，若每一樣本被抽出的機率是一樣的，這樣本稱為隨機樣本 (random sample)

第 4 章迴歸的同步推論與其他主題.

8.1 何謂高度平衡二元搜尋樹 8.2 高度平衡二元搜尋樹的加入 8.3 高度平衡二元搜尋樹的刪除

基礎物理總論基礎物理總論熱力學與統計力學（三） Statistical Mechanics 東海大學物理系施奇廷.

Young/Freeman University Physics 11e. Ch 18 Thermal Properties of Matter © 2005 Pearson Education.

Department of Air-conditioning and Refrigeration Engineering/ National Taipei University of Technology 模糊控制設計使用 MATLAB 李達生.

Monte Carlo Simulation Part.2 Metropolis Algorithm Dept. Phys. Tunghai Univ. Numerical Methods C. T. Shih.

1 Part IC. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple Regression ( 多元迴歸、複迴歸 ) Spring 2007.

JAVA 程式設計與資料結構第十章 GUI Introdution III. File Chooser  File Chooser 是一個選擇檔案的圖形介面，無論我們是要存檔還是要開啟檔案，使用這個物件都會讓我們覺得容易且舒適。

1 第四章多變數函數的微分學 § 4.1 偏導數定義定義極限值 ■. 2 定理極限值的基本定理 (1) 極限值的唯一性 : 若存在，則其值必為唯一。 (2) 若且 ( 與為常數 ) ，則且為常數且.

Chapter 13 塑模靜態觀點：物件圖 Static View : Object Diagram.

Introduction to Java Programming Lecture 17 Abstract Classes & Interfaces.

:Problem D: Bit-wise Sequence ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10232: Problem D: Bit-wise Sequence 解題者：李濟宇解題日期： 2006 年 4 月 16.

: The largest Clique ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11324: The largest Clique 解題者：李重儀解題日期： 2008 年 11 月 24 日題意：簡單來說，給你一個 directed.

Matlab Assignment Due Assignment 兩個 matlab 程式 : Eigenface ： Eigenvector 和 eigenvalue 的應用. Fractal ： Affine transform( rotation, translation,

: Tight words ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： : Tight Words 解題者：鐘緯駿、林一帆解題日期： 2006 年 03 月 14 日題意：給定數字 k 與 n (0 ≦ k.

: War on Weather ★★☆☆☆ 題組： Contest Volumes Archive with Online Judge 題號： 10915: War on Weather 解題者：陳明凱題意：題目總共會給你 k 個點座標代表殺手衛星的位置，距離地球表面最少 50 公里以上，並且會給你.

: Fast and Easy Data Compressor ★★☆☆☆ 題組： Problem Set Archive with Online Judge 題號： 10043: Fast and Easy Data Compressor 解題者：葉貫中解題日期： 2007 年 3.

選舉制度、政府結構與政黨體系 Cox (1997) Electoral institutions, cleavage strucuters, and the number of parties.

: Playing War ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 11061: Playing War 解題者：陳盈村解題日期： 2008 年 3 月 14 日題意：在此遊戲中，有一類玩家一旦開始攻擊，就會不停攻擊同一對手，直到全滅對方或無法再.

: Problem A : MiniMice ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11411: Problem A : MiniMice 解題者：李重儀解題日期： 2008 年 9 月 3 日題意：簡單的說，題目中每一隻老鼠有一個編號.

: Count DePrimes ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11408: Count DePrimes 解題者：李育賢解題日期： 2008 年 9 月 2 日題意：題目會給你二個數字 a,b( 2 ≦ a ≦ 5,000,000,a.

: Multisets and Sequences ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 11023: Multisets and Sequences 解題者：葉貫中解題日期： 2007 年 4 月 24 日題意：在這個題目中，我們要定義.

1 Introduction to Java Programming Lecture 2: Basics of Java Programming Spring 2008.

: A-Sequence ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 10930: A-Sequence 解題者：陳盈村解題日期： 2008 年 5 月 30 日題意： A-Sequence 需符合以下的條件， 1 ≤ a.

1 100: The 3n+1 Problem ★★★☆☆ 題組： VOLUME CII 題號： 10721: Problem C-Chopsticks 陳冠男解題者：陳冠男、侯沛彣解題日期： 2006 年 4 月 23 日給定一個正整數 n (n>1) ，當 n 為奇數時令 n  3n+1.

: Lucky Number ★★★★☆ 題組： Proble Set Archive with Online Judge 題號： 10909: Lucky Number 解題者：李育賢解題日期： 2008 年 4 月 25 日題意：給一個奇數數列 1,3,5,7,9,11,13,15…

Section 4.2 Probability Models 機率模式. 由實驗看機率實驗前先列出所有可能的實驗結果。 – 擲銅板：正面或反面。 – 擲骰子： 1~6 點。 – 擲骰子兩顆： (1,1),(1,2),(1,3),… 等 36 種。決定每一個可能的實驗結果發生機率。 – 實驗後所有的實驗結果整理得到。

JAVA 程式設計與資料結構第二十章 Searching. Sequential Searching Sequential Searching 是最簡單的一種搜尋法，此演算法可應用在 Array 或是 Linked List 此等資料結構。 Sequential Searching 的 worst-case.

演算法 8-1 最大數及最小數找法 8-2 排序 8-3 二元搜尋法.

Chapter 6 線性規劃緒言如何在有限的經濟資源下進行最有效的調配與選用，以求發揮資源的最高效能。此問題愈來愈受到重視，也就是以最低的代價，獲取最大的效益。茲列舉如下： – 決定緊急設備與人員的地點，使反應時間最短化。 – 決定飛機、飛行員、地勤人員的飛航最佳日程安排。

-Antidifferentiation- Chapter 6 朝陽科技大學資訊管理系李麗華教授.

845: Gas Station Numbers ★★★ 題組： Problem Set Archive with Online Judge 題號： 845: Gas Station Numbers. 解題者：張維珊解題日期： 2006 年 2 月題意：將輸入的數字，經過重新排列組合或旋轉數字，得到比原先的數字大，

Structural Equation Modeling Chapter 6 CFA 根據每個因素有多重指標，以減少測量誤差並可建立問卷的構念效度驗證性因素分析.

Learning Method in Multilingual Speech Recognition Author : Hui Lin, Li Deng, Jasha Droppo Professor: 陳嘉平 Reporter: 許峰閤.

Chapter 10 m-way 搜尋樹與B-Tree

演算法課程 (Algorithms) 國立聯合大學資訊管理學系陳士杰老師 Course 7 貪婪法則 Greedy Approach.

Extreme Discrete Summation ★★★★☆ 題組： Contest Archive with Online Judge 題號： Extreme Discrete Summation 解題者：蔡宗翰解題日期： 2008 年 10 月 13 日.

Probability Distribution 機率分配汪群超 12/12. 目的：產生具均等分配的數值 (Data) ，並以『直方圖』的功能計算出數值在不同範圍內出現的頻率，及繪製數值的分配圖，以反應出該機率分配的特性。

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

: Problem E Antimatter Ray Clearcutting ★★★★☆ 題組： Problem Set Archive with Online Judge 題號： 11008: Problem E Antimatter Ray Clearcutting 解題者：林王智瑞.

著作權所有 © 旗標出版股份有限公司第 3 章資料庫物件的關係. 本章提要 Access 資料庫物件的關係 Access 資料庫物件的關係簡介 Access 的七大物件簡介 Access 的七大物件 Access 的群組 Access 的群組.

: Place the Guards ★★★☆☆ 題組： Problem Set Archive with Online Judge 題號： 11080: Place the Guards 解題者：陳盈村解題日期： 2008 年 3 月 26 日題意：有一個國王希望在他的城市裡佈置守衛，

1 Introduction to Java Programming Lecture 2: Basics of Java Programming Spring 2010.

牽涉兩個變數的 Data Table 汪群超 11/1/98. Z=-X 2 +4X-Y 2 +6Y-7 觀察 Z 值變化的 X 範圍觀察 Z 值變化的 Y 範圍.

: Finding Paths in Grid ★★★★☆ 題組： Contest Archive with Online Judge 題號： 11486: Finding Paths in Grid 解題者：李重儀解題日期： 2008 年 10 月 14 日題意：給一個 7 個 column.

著作權所有 © 旗標出版股份有限公司第 14 章製作信封、標籤. 本章提要製作單一信封製作單一郵寄標籤.

幼兒行為觀察與記錄第八章事件取樣法.

CH 14-可靠度工程之數學基礎探討重點失效時間之機率分配指數模式之可靠度工程.

Chapter 12 Estimation 統計估計. Inferential statistics Parametric statistics 母數統計 ( 母體為常態或大樣本 ) 假設檢定 hypothesis testing  對有關母體參數的假設，利用樣本資料，決定接受或不接受該假設的方法.

McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. 肆資料分析與表達.

Data Mining: A Closer Look

Data Mining: A Closer Look Chapter Data Mining Strategies 2.

Chapter 5 Data mining : A Closer Look.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.

Presentation transcript:

Data Mining: A Closer Look Chapter 2

2.1 Data Mining Strategies

Figure 2.1 A hierarchy of data mining strategies

Classification Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior.

Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior.

Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric.

The Cardiology Patient Dataset

A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%

A Sick Class Rule ******************************* Rules for Class Sick 93 instances ******************************* <= maximum heart rate <= :rule accuracy 75.81% :rule coverage 50.54%

Explanation If maximum heart rate is low, you may be at risk of having a heart attack (for prediction) If you have a heart attack, expect your maximum heart rate to decrease (for classification) A low maximum heart rate will cause you to have a heart attack (x)

A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%

Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers.

Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms.

關聯法則建立依照 Agrawal and Srikant( 1994 ) 所設計的流程，並以技術的觀點來看關聯式法則的建立，基本上可以分為下列兩個步驟： 1. 在資料庫中尋找出所有可能的多數項集合 ( Large Itemsets ) ，並且這些多數項集合的支持度 ( Support Level ) 要大於所設定的最小支持度 ( Minimal Support Level ) 。 2. 利用多數項集合以產生適當的法則。例如，假設找出的多數項集合為 X 和 Y ，則可能產生一條法則為 X→Y ，同時我們亦計算當 X 發生時也發生 Y 的機率 Support( X ∩ Y ) ∕ Support( X ) ，即所謂的信賴度 ( Confidence Level ) ；若是算出的信賴度大於所設定的最小信賴度 ( Minimal Confidence Level ) ，則此條法則就可以被確立。

最小支持度設定為 50 ％，最小信賴度為 50 ％。在 4 個交易中有 2 個交易同時出現 A 、 C( 亦即 Transaction ID 為 2000 及 1000 的交易 ) ，所以我們可計算出 AC 的支持度為 = 2 / 4 ( 50 ％ ) ；同時，在 4 個交易中有 3 個交易有出現 A ( 即 Transaction ID 為 2000 、 1000 及 4000 的交易，但又出現 C 的則有 2000 及 1000) 因此其信賴度 = 2 / 3 ( 66.6 ％ ) 。由於計算出的支持度及信賴度都大於我們所設定的最小支持度及最小信賴度，所以我們說此條法則 A→C 可以被確立。

2.2 Supervised Data Mining Techniques

The Credit Card Promotion Database

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: % Rule Coverage: 66.67%

Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Neural Networks

Figure 2.2 A multilayer fully connected neural network

See CreCardPro_forNNresult.xls

Statistical Regression Life insurance promotion = (credit card insurance) (sex) See credicard_regession.xls

Bayesian Classifier 判定為 c i 類

Simplified  naive bayeser

2.3 Association Rules

An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes

2.4 Clustering Techniques

Figure 2.3 An unsupervised cluster of the credit card database Name the cluster: negative relation Name the cluster: positive Name the cluster:?

2.5 Evaluating Performance

Evaluating Supervised Learner Models

Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors.

Two-Class Error Analysis

Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error

Comparing Models by Measuring Lift

Figure 2.4 Targeted vs. mass mailing

Computing Lift

Example: population 100K, Potential customer 1K 100K mail

Total Total 20000

Unsupervised Model Evaluation Refer to sec assign each cluster as a new population. (i.e., 加一欄定各群為各類 Use supervised approach to classify the population. A good result indicates a successful clustering 2. some clustering results may be not robust (i.e., k-mean), results of evaluation may not be satisfied. 3.Apply alternative measures, i.e., between- cluster attribute-value comparison.