2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 1 Data Analysis (by DM Techniques) for Biomedical Informatics.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Data Mining: Concepts and Techniques
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining By Archana Ketkar.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining.
CIT 858: Data Mining and Data Warehousing Course Instructor: Bajuna Salehe Web:
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
Shilpa Seth.  What is Data Mining What is Data Mining  Applications of Data Mining Applications of Data Mining  KDD Process KDD Process  Architecture.
Chapter 1. Introduction Motivation: Why data mining?
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
2015年9月10日星期四 2015年9月10日星期四 2015年9月10日星期四 Introduction to Data Mining 1 Chen. Chun-Hsien Department of Information Management Chang Gung University.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data Mining Techniques As Tools for Analysis of Customer Behavior Lecture 2:
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
COMP3503 Intro to Inductive Modeling
Chapter 1 Introduction to Data Mining
Basic Data Mining Technique
DATA MINING 1. 2 Data Mining Extracting or “mining” knowledge from large amounts of data Data mining is the process of autonomously retrieving useful.
2015年10月18日星期日 2015年10月18日星期日 2015年10月18日星期日 Introduction to Data Mining 1 Chapter 1 Introduction to Data Mining Chen. Chun-Hsien Department of Information.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
2015年10月22日星期四 2015年10月22日星期四 2015年10月22日星期四 Introduction to Data Mining 1 Chapter 1 Introduction to Data Mining Chen. Chun-Hsien Department of Information.
CS690L - Lecture 6 1 CS690L Data Mining and Knowledge Discovery Overview Yugi Lee STB #555 (816) This.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Introduction to Data-Mining Marko Grobelnik Institut Jozef Stefan.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Conclusions. Why Data Mining? -- Potential Applications Database analysis and decision support – Market analysis and management target marketing, customer.
Data Mining and Decision Support
Academic Year 2014 Spring Academic Year 2014 Spring.
February 13, 2016 Data Mining: Concepts and Techniques 1 1 Data Mining: Concepts and Techniques These slides have been adapted from Han, J., Kamber, M.,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Warehousing/Mining 1. 2 Chapter 1. Introduction v Motivation: Why data mining? v What is data mining? v Data Mining: On what kind of data? v Data.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
2016年6月12日星期日 2016年6月12日星期日 2016年6月12日星期日 Introduction to Data Mining 1 Chapter 1 Introduction to Data Mining Chen. Chun-Hsien Department of Information.
Data Mining – Introduction (contd…) Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
July 7, 2016 Data Mining: Concepts and Techniques 1 1.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining Functionalities
Data Mining.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 1 —
Data Mining – Intro.
DATA MINING © Prentice Hall.
Chapter 1 Introduction to Data Mining
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining Concepts and Techniques
Data Mining Techniques As Tools for Analysis of Customer Behavior
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Presentation transcript:

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 1 Data Analysis (by DM Techniques) for Biomedical Informatics Chen. Chun-Hsien Department of Information Management Chang Gung University

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 2 Outline Motivation to data mining for biomedical informatics What is data mining? Applications of data mining Data mining process Main data mining techniques

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 3 Motivation Data explosion problem Tremendous amount of Web pages 40 billion photos on Facebook 1 million new transactions/hour in Walmart database Big data in Clouds 全民健康保險研究資料庫 ( 全民健保處方及治療醫令 - 住院 ~ 17X10 6 筆, 至 2008/12 止 ) We are drowning in data, but starving for knowledge! Solution: Data Mining (A KDD technology) One of the 10 emerging technologies that will change the world in the near future (MIT Technology Review)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 4 What Is Data Mining? Data mining Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) knowledge (rules, regularities, patterns, trends, associations) from large amount of data What is not data mining? Google/database query processing Expert systems or simple statistical programs

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 5 Example : Mining a Concept Hierarchy all EuropeNorth_America MexicoCanadaSpainGermany Vancouver... Toronto Frankfurt all region country city

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 6 Part of International Sales Data

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 7 General Applications of Data Mining Decision support Biomedical decision support Fraud detection and management Market analysis and management Risk analysis and management

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 8 Specific Applications of Data Mining (Related to Biomedical Domain) Using Data Mining Techniques Help disease screening, diagnosis and treatment Help identify related genes of genetic diseases Help drug design and discovery Using Text Mining and Data Ming Techniques Help find related genes of genetic diseases from medical literature

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 9 Clean, Relevant Data Data Preprocessing Data Mining Evaluation/PresentationPattern Knowledge Raw data Steps in a KDD Process (KDD : Knowledge Discovery in Databases) (Technically)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 10 Main Steps of a KDD Process (Fully) Domain knowledge acquisition Learning important, relevant knowledge and goals of application Data collection and preprocessing (may take 60% of effort) Data generation, cleaning, and selection Data integration, reduction, and transformation Data mining (searching for interesting patterns) Choosing function types of data mining classification, association, clustering, summarization, regression. Choosing the mining algorithm(s) Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 11 Data Preprocessing Raw data Steps in a KDD Process (Step 1) Clean, Relevant Data

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 12 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 13 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers Data integration Integration of multiple databases, data sources, or files Data transformation Normalization and aggregation Data reduction Variable reduction, data set reduction, data representation reduction Data discretization Reduce the # of values for variables, especially for numerical variables

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 14 Relevant Data Data Preprocessing Data Mining Pattern Raw data Steps in a KDD Process

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 15 Main Data Mining Techniques Association Rule Mining Classification Cluster Analysis Outlier Analysis Trend Analysis Linear Regression

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 16 Main Data Mining Techniques Association Rule Mining Association Rule Mining Goal : find association and/or correlation Finding strong rules : sales(T, “computer”)  sales(T, “software”) [support = 1%, confidence = 75%] sales(T, “beer”)  sales(T, “diaper”) [support = 2%, confidence = 70%] age(X, “20..29”) ^ income(X, “30..39K”)  buys(X, “PC”) [support = 2%, confidence = 60%] (1/5) data context data item

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 17 Support and Confidence Association rule mining Find all the rules X  Y with min. support S and confidence C support S : the probability that a transaction contains X and Y confidence C : the conditional probability that a transaction having X also contains Y A  C (50%, 66.6%) C  A (50%, 100%) Customer buys diaper (Y) Customer buy both Customer buy beer (X)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 18 Classification Finding models that describe and distinguish classes for future forecast Representation models: decision-tree, neural network Typical Applications disease screening, diagnosis & treatment credit card/loan approval target marketing pattern recognition (2/5) Main Data Mining Techniques

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 19 An Example of Classification (Fruit Classifier) Classifier output Class label oval, red, orange, yellow shape=round color = red input features Apple shape=round color = orange Orange Mango

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 20 A General Classifier Classifier input features output class label : :

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 21 Model of Supervised Learning The model is in a form of Classifier f input features output : : x1x1 x2x2 xnxn y Main issue: What are x 1, …, x n ? How to get the model f ? How to collect training data with output y

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 22 Classification : A 2-Step Process Model construction Training Data (I, O) Classification Learning Algorithms Classifier Model Model usage Classifier Model input features output class label : :

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 23 Main Classification Methods Decision tree Artificial neural networks Naïve Bayesian classification k-nearest neighbor classifier

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 24 Training Dataset Example for buys_PC (An example from Quinlan’s ID3)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 25 Example: A Decision Tree for “buys_PC” age? overcast student?credit rating? <=30 >40 no yes noyes fair excellent

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 26 Extracting Classification Rules from a Decision Tree Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rule examples IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND rating = “fair” THEN buys_computer = “no” IF age = “>40” AND rating = “excellent” THEN buys_computer = “yes”

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 27 A Decision Tree for CAD Screening (Constructed from ~500 Records)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 28 Clustering Class label is unknown: Group data to form new classes e.g. disease profiling, patient profiling Clustering based on the principle: maximizing the intra- class similarity and minimizing the interclass similarity (3/5) Main Data Mining Techniques Cluster analysis

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 29 A B C Points A and B are in a same cluster X Y Z Points X, Y, and Z are outliers Example of Cluster Analysis

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Introduction to Data Mining 30 Clustering Example in Cluster Analysis CAD data) Clustering Example in High Dimension (Cluster Analysis CAD data) Data matrix for visualization Clustering dendrogram Profile of CAD patients Profile of healthy people

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 31 Major Clustering Approaches Partitioning algorithms: Construct various partitions and then evaluate them by some criterion Hierarchy algorithms: Create a hierarchical clustering structure for the set of data records using some criterion Density-based: Based on connectivity and density functions Grid-based: Quantize the data space into a finite number of cells that form a grid structure on which clustering are performed Model-based: A model is hypothesized for each of the clusters and find the best fit of the records to the given models

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 32 Hierarchical Clustering Use distance matrix as clustering criteria. Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 33 Organize the data objects into a several levels of tree clusters, called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. Dendrogram Showing Hierarchically Merged Clusters

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 34 Gene Expression Analysis by Clustering Analyze gene behavior from gene microarray data Clustering Microarrays

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 35 Profile of Stroke Patients (Diagnosis Indices of Chinese Medicine)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Data Mining: Concepts and Techniques 36 Examples of SOM Feature Map

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 37 Outlier analysis Outlier: a data object that does not comply with the general behavior of the given data set It can be considered as noise or exception but is quite useful in fraud detection, rare events (disease) analysis Trend analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Other pattern-directed or statistical analyses Other Data Mining Techniques (4/5)

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 38 Regression example x y y = x + 1 X1X1 Y1Y1 Y1’Y1’ Main Data Mining Techniques Linear Regression (5/5) Predict Y’s value at X 1 using linear regression

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 39 Linear regression : Y =  +  X Two parameters,  and  specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. Multiple regression : Y = b 0 + b 1 X 1 + b 2 X 2 +…+ b n X n Many nonlinear functions can be transformed into the above. Log-linear models : The joint probabilities of a multi-variable table is approximated by a product of single-variable tables. Probability: p(a, b, c, d) = p(a) p(b) p(c) p(d) log p(a, b, c, d) = log p(a) +log p(b) +log p(c) +log p(d) Regression Analysis and Log-Linear Models

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 40 Are All the “Discovered” Patterns Interesting? A data mining system may generate thousands of patterns, not all of them are interesting. Suggested approach: Human-centered, query-based, focused mining Interestingness measures: A pattern is interesting if it is easily understood, potentially useful, novel, valid on new or test data with some degree of certainty, or it validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures: Objective: based on statistics and structures of data patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 41 Can We Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness Can a data mining system find all the interesting patterns? Search for only interesting patterns: Optimization Can a data mining system find only the interesting patterns? Approaches First generate all the relevant patterns and then filter out the uninteresting ones. Generate only the interesting patterns—mining query optimization

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 42 Classification of Data Mining Systems General functionality Descriptive data mining Predictive data mining Different views → Different classifications Kinds of databases to be mined Kinds of knowledge to be discovered Kinds of disciplines utilized Kinds of applications adapted

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 43 A Multi-Dimensional View of Data Mining System Kinds of databases to be mined Relational, transactional, WWW, spatial, time-series, text, multi-media, object-oriented, object-relational, heterogeneous, legacy Kinds of knowledge to be mined Association, classification, clustering, trend, characterization, and outlier analysis, etc. Kinds of disciplines utilized Machine learning, statistics, visualization, database-oriented, data warehouse (OLAP), etc. Kinds of applications adapted Biomedical informatics, retail, telecommunication, financing, fraud analysis, stock market analysis, Web mining, etc.

2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics 44 Summary Data mining: automatically discovering interesting knowledge from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes domain knowledge acquisition, data preprocessing, data mining, pattern evaluation, and knowledge presentation. Main data mining techniques: association rule mining, classification, clustering, outlier, trend analysis, linear regression, etc.

年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 2015年11月2日星期一 Main Data Mining Techniques for Biomedical Informatics45 Thank You !!!! Have a Nice Day !