Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Functionalities (2)

Similar presentations


Presentation on theme: "Data Mining Functionalities (2)"— Presentation transcript:

1 Data Mining Functionalities (2)
Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

2 Data Mining Functionalities (3)
Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

3 Learned Model Use it to predict If acceptable, Learning Classification

4 Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur

5 Classification Process (1): Model Construction
Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ class label

6 Classification Process (2): Use the Model in Prediction
Classifier Testing Data IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Unseen Data (Jeff, Professor, 4) Tenured?

7 Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

8 (1) Quality Control Issue:
xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Exp #j xxxx xxxx (1) Quality Control Issue: In each experiment, can you do some preliminary statistical analysis that can assure at least the experiment has not been contaminated? (e.g., add sufficient amount of controls)

9 (2) Generating Bias Free Data:
xxxx Exp #1 Exp #2 Exp #3 Exp #4 Exp #5 Exp #6 (2) Generating Bias Free Data: How to design the experiments so that you can offset variations caused by experimental biases and errors? e.g., redundant spot printing, switch dyes, change spot printing order, etc.

10 Microarray Expression
xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx GeneID (3) Getting the Representative Values: How many experiments to produce statistically significant representative expression values? (average vs. mode)

11 How to partition expression data? Is one partitioning
Clustering 1 Clustering 2 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx (4) Clustering How to partition expression data? Is one partitioning better than the other – can you tell this without performing association mining?

12 (5) Association mining - 1 Can you associate each fragment
Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

13 (5) Association mining - 1 Can you associate each fragment
Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

14 (5) Association mining - 1 Can you associate each fragment
Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

15 (5) Association mining - 1 Can you associate each fragment
Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

16 (5) Association mining - 2
Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 2 Can you associate each fragment with transcription binding site data (promoter analysis) ?

17 Signal Transduction Pathway

18 (Gene_id, Family, Domain)
 Protein id  Items MGR4_HUMAN  G-protein Receptor Family3, Extracellular Ligand-binding Receptor MGR4_RAT G-protein Receptor Family3, COPP_HELFE  Copper Ion Binding Family, Heavy Metal Transport/Detoxification Protein Domain GLHA_RABIT  Glycoprotein Hormones Alpha Chain Family, Growth Factor Domain Table 6: BioDataset1 Transactional Database

19 (Gene_id, Domain, GO Cellular Component, GO Biological Process )
 Protein name  Items  Homeobox Protein Antennapedia Type Domain, Nucleus, Regulation of Transcription DNA-Dependent HXA7_HUMAN  Homeobox Protein Antennapedia Type Domain, Nucleus, Regulation of Transcription DNA-Dependent  O42504  3’5’Cyclic Nucleotide Phosphodiesterose Domain, ?, Signal Transduction  CN4B_HUMAN 3’5’Cyclic Nucleotide Phosphodiesterose Domain, ?, Signal Transduction CN4D_RAT Table 7: BioDataset2 Transactional Database

20 (Gene_id, Family, fingerprint, GO Cellular Component, GO Biological Process )
 Protein name  Items ORA1_LOCMI GPCR Rhodopsin-like Superfamily, GPCRRHODPSN, Integral to Membrane, G-Protein Coupled Receptor Protein Signaling Pathway ORA2_LOCMI Octapamine Receptor Family, OCTOPAMINER, A1AB_MESAU ML1A_HUMAN Melatonin1A Receptor Family, MELATONIN1AR, Table 8: BioDataset3 Transactional Database

21 What are the genotype features that could be used in clustering microarray expression data?
Codon Usage Degree of Phylogenetic conservation 2.1 Strong purifying selection 2.2 Positive selection 2.3 Neutral selection Domain architecture – Single or Multiple domain Promoter Data 4.1 Boxes (Transcription factor binding site) 4.2 TATA regions 4.3 Regulatory Protein (Transcription factor) # of introns, exons

22 Dynamic class labeling?
Phenotype Genotype xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx GeneID GeneID Dynamic class labeling?

23 Joint DMS/NIGMS Initiative to Support Research
in the Area of Mathematical Biology - NSF Examples of areas of research 1. Evolutionary theory and practice arising from genomics advances 2. Statistical and other approaches to the discovery of genes contributing to complex behavior, and their environmental interactions 3. Explanatory and predictive models of the cellular state 4. Growth, motility, cell division, membrane trafficking, and other cellular behavior 5. Metabolic circuitry and dynamics 6. Signal transduction 7. Informational molecule dynamics 8. Design principles and dynamics of pattern formation in development and differentiation 9. New approaches to the prediction of molecular structure 10. Improved algorithms for structure determination by x-ray crystallography, Nuclear Magnetic Resonance (NMR), and electron microscopy 11. Simulations of the human systemic responses to burn, trauma, and other injury 12. New approaches to understanding system-wide effects of pharmacological agents and anesthetics, and their genetic and environmental modifiers

24 Finding Pathway partners
You find Ecsit Now study Ecsit!

25 Discussion Use genotype and phenotype (expression intensity) together to do analysis The efforts required to build the genotype list How to utilize the promoter sequences in microarray experiments


Download ppt "Data Mining Functionalities (2)"

Similar presentations


Ads by Google