Data Mining Functionalities (2)

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Gene Ontology John Pinney
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
Classification and Prediction
BACKGROUND E. coli is a free living, gram negative bacterium which colonizes the lower gut of animals. Since it is a model organism, a lot of experimental.
Classification.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
11/9/2012ISC471 - HCI571 Isabelle Bichindaritz 1 Classification.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Inferring Function From Known Genes Naomi Altman Nov. 06.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Bioinformatics and Computational Biology
Classification And Bayesian Learning
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Classification and Prediction
Data Mining and Decision Support
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Molecular Cell Biology Logic and Approaches to Research Cooper.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
The Transcriptional Landscape of the Mammalian Genome
DATA MINING © Prentice Hall.
Control of Gene Expression
Microarray Experiment Design and Data Interpretation
Chapter 6 Classification and Prediction
Introduction to Bioinformatics February 13, 2017
The Basics of Molecular Biology
Human Cells Gene Expression
Dept of Biomedical Informatics University of Pittsburgh
High-throughput Biological Data The data deluge
Classification and Prediction
Lecture 6 By Ms. Shumaila Azam
Large Scale Data Integration
1 Department of Engineering, 2 Department of Mathematics,
Prepared by: Mahmoud Rafeek Al-Farra
1 Department of Engineering, 2 Department of Mathematics,
Biological Information and Biological Databases
1 Department of Engineering, 2 Department of Mathematics,
In multicellular organisms
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Discriminative Frequent Pattern Analysis for Effective Classification
Supervised vs. unsupervised Learning
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
RECEPTOR “ A receptor is a macromolecular component of a cell or organism that interacts with a drug and initiates the chain of biochemical events leading.
Unit 7: Molecular Genetics
From Mendel to Genomics
©Jiawei Han and Micheline Kamber
Problems from last section
Cell to Cell Interaction (Cell signaling/cell communication)
Classification 1.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Presentation transcript:

Data Mining Functionalities (2) Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

Data Mining Functionalities (3) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

Learned Model Use it to predict If acceptable, Learning Classification

Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur

Classification Process (1): Model Construction Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ class label

Classification Process (2): Use the Model in Prediction Classifier Testing Data IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Unseen Data (Jeff, Professor, 4) Tenured?

Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

(1) Quality Control Issue: xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx Exp #j xxxx xxxx (1) Quality Control Issue: In each experiment, can you do some preliminary statistical analysis that can assure at least the experiment has not been contaminated? (e.g., add sufficient amount of controls)

(2) Generating Bias Free Data: xxxx Exp #1 Exp #2 Exp #3 Exp #4 Exp #5 Exp #6 (2) Generating Bias Free Data: How to design the experiments so that you can offset variations caused by experimental biases and errors? e.g., redundant spot printing, switch dyes, change spot printing order, etc.

Microarray Expression xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx GeneID (3) Getting the Representative Values: How many experiments to produce statistically significant representative expression values? (average vs. mode)

How to partition expression data? Is one partitioning Clustering 1 Clustering 2 xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx (4) Clustering How to partition expression data? Is one partitioning better than the other – can you tell this without performing association mining?

(5) Association mining - 1 Can you associate each fragment Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

(5) Association mining - 1 Can you associate each fragment Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

(5) Association mining - 1 Can you associate each fragment Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

(5) Association mining - 1 Can you associate each fragment Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 1 Can you associate each fragment with some meaningful data?

(5) Association mining - 2 Clustering i xxxx A xxxx xxxx B xxxx C xxxx xxxx D xxxx xxxx E xxxx xxxx F xxxx xxxx G xxxx xxxx H xxxx xxxx I xxxx (5) Association mining - 2 Can you associate each fragment with transcription binding site data (promoter analysis) ?

Signal Transduction Pathway

(Gene_id, Family, Domain)  Protein id  Items   MGR4_HUMAN  G-protein Receptor Family3, Extracellular Ligand-binding Receptor MGR4_RAT G-protein Receptor Family3, COPP_HELFE  Copper Ion Binding Family, Heavy Metal Transport/Detoxification Protein Domain GLHA_RABIT  Glycoprotein Hormones Alpha Chain Family, Growth Factor Domain Table 6: BioDataset1 Transactional Database

(Gene_id, Domain, GO Cellular Component, GO Biological Process )  Protein name  Items  Homeobox Protein Antennapedia Type Domain, Nucleus, Regulation of Transcription DNA-Dependent HXA7_HUMAN  Homeobox Protein Antennapedia Type Domain, Nucleus, Regulation of Transcription DNA-Dependent  O42504  3’5’Cyclic Nucleotide Phosphodiesterose Domain, ?, Signal Transduction  CN4B_HUMAN   3’5’Cyclic Nucleotide Phosphodiesterose Domain, ?, Signal Transduction CN4D_RAT Table 7: BioDataset2 Transactional Database

(Gene_id, Family, fingerprint, GO Cellular Component, GO Biological Process )  Protein name  Items   ORA1_LOCMI GPCR Rhodopsin-like Superfamily, GPCRRHODPSN, Integral to Membrane, G-Protein Coupled Receptor Protein Signaling Pathway ORA2_LOCMI Octapamine Receptor Family, OCTOPAMINER, A1AB_MESAU ML1A_HUMAN Melatonin1A Receptor Family, MELATONIN1AR, Table 8: BioDataset3 Transactional Database

What are the genotype features that could be used in clustering microarray expression data? Codon Usage Degree of Phylogenetic conservation 2.1 Strong purifying selection 2.2 Positive selection 2.3 Neutral selection Domain architecture – Single or Multiple domain Promoter Data 4.1 Boxes (Transcription factor binding site) 4.2 TATA regions 4.3 Regulatory Protein (Transcription factor) # of introns, exons

Dynamic class labeling? Phenotype Genotype xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx GeneID GeneID Dynamic class labeling?

Joint DMS/NIGMS Initiative to Support Research in the Area of Mathematical Biology - NSF 02-125 Examples of areas of research 1. Evolutionary theory and practice arising from genomics advances 2. Statistical and other approaches to the discovery of genes contributing to complex behavior, and their environmental interactions 3. Explanatory and predictive models of the cellular state 4. Growth, motility, cell division, membrane trafficking, and other cellular behavior 5. Metabolic circuitry and dynamics 6. Signal transduction 7. Informational molecule dynamics 8. Design principles and dynamics of pattern formation in development and differentiation 9. New approaches to the prediction of molecular structure 10. Improved algorithms for structure determination by x-ray crystallography, Nuclear Magnetic Resonance (NMR), and electron microscopy 11. Simulations of the human systemic responses to burn, trauma, and other injury 12. New approaches to understanding system-wide effects of pharmacological agents and anesthetics, and their genetic and environmental modifiers

Finding Pathway partners You find Ecsit Now study Ecsit!

Discussion Use genotype and phenotype (expression intensity) together to do analysis The efforts required to build the genotype list How to utilize the promoter sequences in microarray experiments