Data Mining in Genomics: the dawn of personalized medicine

Slides:

Advertisements

Similar presentations

TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST

Advertisements

© 2002 page 1 Data Mining Tools For ZLE Copying and Use Restrictions: Material under this presentation is the Intellectual Property of HP Corporation and.

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Chapter 6 Flowcharting.

By D. Fisher Geometric Transformations. Reflection, Rotation, or Translation 1.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

Year 6 mental test 5 second questions

Around the World AdditionSubtraction MultiplicationDivision AdditionSubtraction MultiplicationDivision.

Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,

Université du Québec École de technologie supérieure Face Recognition in Video Using What- and-Where Fusion Neural Network Mamoudou Barry and Eric Granger.

An Application of Linear Programming Lesson 12 The Transportation Model.

1 University of Utah – School of Computing Computer Science 1021 "Thinking Like a Computer"

Microarray Technology and Applications

Solving Equations How to Solve Them

Feature Selection 1 Feature Selection for Image Retrieval By Karina Zapién Arreola January 21th, 2005.

1 Developing a Predictive Model for Internet Video Quality-of-Experience Athula Balachandran, Vyas Sekar, Aditya Akella, Srinivasan Seshan, Ion Stoica,

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

Adding Up In Chunks.

Biology and Cells All living organisms consist of cells. Humans have trillions of cells. Yeast - one cell. Cells are of many different types (blood, skin,

Chapter 5 Test Review Sections 5-1 through 5-4.

Chapter 2 Overview of the Data Mining Process

Addition 1’s to 20.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

25 seconds left…...

Chapter 10: The Traditional Approach to Design

Analyzing Genes and Genomes

Systems Analysis and Design in a Changing World, Fifth Edition

We will resume in: 25 Minutes.

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Controls and Substantive Tests of Transactions.

Intracellular Compartments and Transport

A SMALL TRUTH TO MAKE LIFE 100%

PSSA Preparation.

Essential Cell Biology

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

How Cells Obtain Energy from Food

Energy Generation in Mitochondria and Chlorplasts

Chapter 5 The Mathematics of Diversification

Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.

From Data Mining to Knowledge Discovery: An Introduction Gregory Piatetsky-Shapiro KDnuggets.

Applications to Bioinformatics: Microarray Data Mining

Alternative Splicing As an introduction to microarrays.

Copyright © 2002 KDnuggets Knowledge Discovery in Microarray Gene Expression Data Gregory Piatetsky-Shapiro IMA 2002 Workshop on Data-driven.

Generate Affy.dat file Hyb. cRNA Hybridize to Affy arrays Output as Affy.chp file Text Self Organized Maps (SOMs) Functional annotation Pathway assignment.

Data Mining – Intro.

Data Mining Knowledge Discovery: An Introduction

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Data Mining Chun-Hung Chou

Whole Genome Expression Analysis

Knowledge Discovery and Data Mining Evgueni Smirnov.

Knowledge Discovery and Data Mining Evgueni Smirnov.

Scenario 6 Distinguishing different types of leukemia to target treatment.

Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.

داده كاوي و كاربرد آن در پزشكي بنام خدا نام دانشجو : بابك رزاقي شماره دانشجويي : استاد راهنما : جناب آقاي دكتر توحيد خواه ( سمينار درس كاربرد.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Data Mining and Decision Support

Data Mining Copyright KEYSOFT Solutions.

Data Mining – Intro.

Microarray Technology and Applications

From Data Mining to Knowledge Discovery: An Introduction

Presentation transcript:

Data Mining in Genomics: the dawn of personalized medicine Gregory Piatetsky-Shapiro KDnuggets www.KDnuggets.com/gps.html Connecticut College, October 15, 2003

Overview Data Mining and Knowledge Discovery Genomics and Microarrays Microarray Data Mining

Trends leading to Data Flood More data is generated: Bank, telecom, other business transactions ... Scientific Data: astronomy, biology, etc Web, text, and e-commerce More data is captured: Storage technology faster and cheaper DBMS capable of handling bigger DB

Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Data Mining Patterns and Rules Knowledge RawData __ ____ Transformation Selection & Cleaning Understanding Transformed Data DATA Ware house Target Data

Major Data Mining Tasks Classification: predicting an item class Clustering: finding clusters in data Associations: e.g. A & B & C occur frequently Visualization: to facilitate human discovery Summarization: describing a group Estimation: predicting a continuous value Deviation Detection: finding changes Link Analysis: finding relationships

Major Application Areas for Data Mining Solutions Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web

Genome, DNA & Gene Expression An organism’s genome is the “program” for making the organism, encoded in DNA Human DNA has about 30-35,000 genes A gene is a segment of DNA that specifies how to make a protein Cells are different because of differential gene expression About 40% of human genes are expressed at one time Microarray devices measure gene expression

Molecular Biology Overview Nucleus Cell Chromosome Gene expression Gene (DNA) Protein Gene (mRNA), single strand Graphics courtesy of the National Human Genome Research Institute

Affymetrix Microarrays 1.28cm 50um ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Gene expression computed from PM and MM

Affymetrix Microarray Raw Image Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 Scanner raw data enlarged section of raw image

Microarray Potential Applications New and better molecular diagnostics New molecular targets for therapy few new drugs, large pipeline, … Outcome depends on genetic signature best treatment? Fundamental Biological Discovery finding and refining biological pathways Personalized medicine ?!

Microarray Data Mining Challenges Avoiding false positives, due to too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Model needs to be robust in presence of noise For reliability need large gene sets; for diagnostics or drug targets, need small gene sets Estimate class probability Model needs to be explainable to biologists

False Positives in Astronomy cartoon used with permission

CATs: Clementine Application Templates CATs - examples of complete data mining processes Microarray CAT Preparation Multi- Class Clustering 2-Class

Key Ideas Capture the complete process X-validation loop w. feature selection inside Randomization to select significant genes Internal iterative feature selection loop For each class, separate selection of optimal gene sets Neural nets – robust in presence of noise Bagging of neural nets

Microarray Classification Train data Feature and Parameter Selection Data Model Building Evaluation Test data

Classification: External X-val Gene Data Train data Feature and Parameter Selection T r a i n Data Model Building Evaluation Test data FinalTest Final Model Final Results

Measuring false positives with randomization Class Gene Class 178 105 4174 7133 1 2 2 1 Randomize 500 times Gene Class Bottom 1% T-value = -2.08 Select potentially interesting genes at 1% 178 105 4174 7133 2 1

Gene Reduction improves Classification most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference Heuristic: select equal # genes from each class Then apply a favorite machine learning algorithm

Iterative Wrapper approach to selecting the best gene set Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation. Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class For randomized algorithms, average 10+ Cross-validation runs! Select gene set with lowest average error

Clementine stream for subset selection by x-validation

Microarrays: ALL/AML Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000 genes well-studied (CAMDA-2000), good test example ALL AML Visually similar, but genetically very different

Gene subset selection: one X-validation Single Cross-Validation run

Gene subset selection: multiple cross-validation runs For ALL/AML data, 10 genes per class had the lowest error: (<1%) Point in the center is the average error from 10 cross-validation runs Bars indicate 1 st. dev above and below

ALL/AML: Results on the test data Genes selected and model trained on Train set ONLY! Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): 33 correct predictions (97% accuracy), 1 error on sample 66 Actual Class AML, Net prediction: ALL other methods consistently misclassify sample 66 -- misclassified by a pathologist?

Pediatric Brain Tumour Data 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital Outer cross-validation with gene selection inside the loop Ranking by absolute T-test value (selects top positive and negative genes) Select best genes by adjusted error for each class Bagging of 100 neural nets

Selecting Best Gene Set Minimizing Combined Error for all classes is not optimal Average, high and low error rate for all classes

Error rates for each class Genes per Class

Evaluating One Network Averaged over 100 Networks: Class Error rate MED 2.1% MGL 17% RHB 24% EPD 9% JPA 19% *ALL* 8.3%

Bagging 100 Networks Class Individual Error Rate Bag Error rate Bag Avg Conf MED 2.1% 2% (0)* 98% MGL 17% 10% 83% RHB 24% 11% 76% EPD 9% 91% JPA 19% 81% *ALL* 8.3% 3% (2)* 92% Note: suspected error on one sample (labeled as MED but consistently classified as RHB)

AF1q: New Marker for Medulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein Related to leukemia (3 PUBMED entries) but not to Medulloblastoma

Future directions for Microarray Analysis Algorithms optimized for small samples Integration with other data biological networks medical text protein data Cost-sensitive classification algorithms error cost depends on outcome (don’t want to miss treatable cancer), treatment side effects, etc.

Acknowledgements Eric Bremer, Children’s Hospital (Chicago) & Northwestern U. Greg Cooper, U. Pittsburgh Tom Khabaza, SPSS Sridhar Ramaswamy, MIT/Whitehead Institute Pablo Tamayo, MIT/Whitehead Institute

Thank you Further resources on Data Mining: www.KDnuggets.com Microarrays: www.KDnuggets.com/websites/microarray.html Contact: Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html