The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 9/6/2006 Frequent Patterns Administrative The book “Elements of Statistical Learning" is on reserve in engineering library. The other two books (Data Mining, Bioinformatics) are recalled. Presentation paper selection is due this Friday, 3 are remaining: Data Mining in Systems Biology Data Mining in Proteomics Analyzing Bionetworks

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 9/6/2006 Frequent Patterns Administrative Paper presenter Always keep in mind the following four “w” questions in your presentation What is the problem Why the problem is important (“who cares”) What are the related work (“why bother”) Impacts of the presented work (“so what”) Define and explain your computational task Give intuitions before discussion details Present the pros and cons of the methods Audience: Ask at least one question

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 9/6/2006 Frequent Patterns Outline What is Microarray? Terms from molecular biology—This is NOT Bio101 Goals Raw data collection Raw data analysis Frequent pattern discovery in Microarry data analysis

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 9/6/2006 Frequent Patterns Microarray Microarrays are currently used to do many different things: to detect and measure gene expression at the mRNA level to find mutations and to genotype; to sequence DNA; to locate chromosomal changes and more. There are many different types of microarrays cDNA chips Affymetrix chips

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 9/6/2006 Frequent Patterns Goals of a Microarray Experiment Find the genes that change expression between experimental and control samples Classify samples based on a gene expression profile Find patterns: Groups of biologically related genes that change expression together across samples/treatments

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 9/6/2006 Frequent Patterns Microarray Procedure In general there are two basic aspects of microarrays: Data acquisition Producing chips preparing samples for detection; hybridization; scanning; Data analysis Low level analysis: normalization and significance test High level analysis: clustering, classification, and pattern discovery We are interested in the data analysis section. However, it is dangerous to go into data analysis without knowing how the data are collected

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 9/6/2006 Frequent Patterns Microarrays are Popular PubMed search "microarray"= 13,948 papers 2005 = 4406 2004 = 3509 2003 = 2421 2002 = 1557 2001 = 834 2000 = 294

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 9/6/2006 Frequent Patterns Necessary Background Gene Central Dogma DNA RNA Nucleic acid hybridization

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 9/6/2006 Frequent Patterns Genes The human genome contains 23 pairs of chromosomes. In each pair, one chromosome is paternally inherited, the other maternally inherited. Chromosomes are made of compressed and entwined DNA. A (protein-coding) gene is a segment of chromosomal DNA that directs the synthesis of a protein.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 9/6/2006 Frequent Patterns Central Dogma The expression of the genetic information stored in the DNA molecule occurs in two stages: (i) transcription, during which DNA is transcribed into mRNA; (ii) translation, during which mRNA is translated to produce a protein. DNA  mRNA  protein

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 9/6/2006 Frequent Patterns DNA A deoxyribonucleic acid or DNA molecule is a double-stranded polymer composed of four basic molecular units called nucleotides. There are four types of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T). Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 9/6/2006 Frequent Patterns RNA A ribonucleic acid or RNA molecule is a nucleic acid similar to DNA, but It is single-stranded; uracil (U) replaces thymine (T) as one of the bases. RNA plays an important role in protein synthesis and other chemical activities of the cell. Several classes of RNA molecules messenger RNA (mRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and other small RNAs.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 9/6/2006 Frequent Patterns Nucleic acid hybridization: here DNA-RNA

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 9/6/2006 Frequent Patterns DNA Chip Microarrays Put a large number (~100K) of DNA sequences or synthetic DNA oligomers onto a glass slide in known locations on a grid. Measure amounts of RNA bound to each square in the grid Make comparisons Cancerous vs. normal tissue Treated vs. untreated Time course Many applications in both basic and clinical research

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 9/6/2006 Frequent Patterns GeneChip

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 9/6/2006 Frequent Patterns Spot your own Chip Robot spotter Ordinary glass microscope slide

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 9/6/2006 Frequent Patterns Affymetrix “Gene chip” system Commercial product Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) RNA labeled and scanned in a single “color” Currently mass produced arrays targeting 17 different organisms More than 40 different array types/sets Proprietary system: “black box” software, can only use their chips

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 9/6/2006 Frequent Patterns Data Acquisition in Microarray Scan the arrays Quantitate each spot Subtract background Normalize Export a table of fluorescent intensities for each gene in the array

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 9/6/2006 Frequent Patterns Hybridization to the Chip

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 9/6/2006 Frequent Patterns The Chip is Scanned

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 9/6/2006 Frequent Patterns Images

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 9/6/2006 Frequent Patterns cDNA clones (probes) PCR product amplification purification printing microarray Hybridise target to microarray mRNA target) laser 1 laser 2 emission scanning analysis overlay images and normalize Data Acquisition

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 9/6/2006 Frequent Patterns Streamlined Array Analysis Normalize Raw data Filter ClassificationSignificance Clustering Gene lists Function (Genome Ontology) (RMA) Present/Absent Minimum value Fold change t-test SAM Rank Product PAM Machine learning Hierarchical CL Biclustering

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 9/6/2006 Frequent Patterns Lower Level Data Analysis Normalization: when you have variability in measurements, you need replication and statistics to find real differences Significance test: It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 9/6/2006 Frequent Patterns Sources of Variability in Raw Data Biological variability Sample preparation Probe labeling RNA extraction Experimental condition temperature, time, mixing, etc. Scanning laser and detector, chemistry of the flourescent label Image analysis identifying and quantifying each spot on the array

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 9/6/2006 Frequent Patterns False colour overlay Self-self hybridizations

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 9/6/2006 Frequent Patterns Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold. Variability

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 9/6/2006 Frequent Patterns Data Normalization Can control for many of the experimental sources of variability (systematic, not random or gene specific) Bring each image to the same average brightness Can use simple math or fancy: divide by the mean (whole chip or by sectors) LOESS (locally weighted regression) No sure biological standards

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 9/6/2006 Frequent Patterns Significance Test In a microarray experiment, each gene (each probe or probe set) is really a separate experiment Yet if you treat each gene as an independent comparison, you will always find some with significant differences (the tails of a normal distribution)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 9/6/2006 Frequent Patterns False Discovery Statisticians call false positives a "type 1 error" or a "False Discovery" False Discovery Rate (FDR) is equal to the p-value of the t- test X the number of genes in the array For a p-value of 0.01 X 10,000 genes = 100 false “different” genes You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 9/6/2006 Frequent Patterns Higher Level Data Analysis Computational tasks: Clustering Classification Statistical validation Data visualization Pattern detection Biological problems : Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments Linkage between gene expression data and gene sequence/function/metabolic pathways databases

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 9/6/2006 Frequent Patterns Types of Clustering Herarchical Link similar genes, build up to a tree of all Self Organizing Maps (SOM) Split all genes into similar sub-groups Finds its own groups (machine learning) Principle Component every gene is a dimension (vector), find a single dimension that best represents the differences in the data

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 9/6/2006 Frequent Patterns Cluster by Color Difference

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 9/6/2006 Frequent Patterns GeneSpring

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 9/6/2006 Frequent Patterns Classification How to sort samples into two classes based on gene expression data Cancer vs. normal Cancer sub-types: benign vs. malignant Responds well to drug vs. poor response

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 9/6/2006 Frequent Patterns Functional Genomics Take a list of "interesting" genes and find their biological relationships Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods Requires a reference set of "biological knowledge"

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 9/6/2006 Frequent Patterns GO Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO) 3 hierarchical sets of terminology Biological Process Cellular Component (location within cell) Molecular Function about 1000 categories of functions

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 9/6/2006 Frequent Patterns Gene Ontology

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 9/6/2006 Frequent Patterns Biological Pathways

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 9/6/2006 Frequent Patterns Microarray Databases Large experiments may have hundreds of individual array hybridizations Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments Data-mining - look for similar patterns of gene expression across different experiments

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 9/6/2006 Frequent Patterns Public Databases Gene Expression data is an essential aspect of annotating the genome Publication and data exchange for microarray experiments Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a Microarray Experiment)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 9/6/2006 Frequent Patterns GEO at the NCBI

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 9/6/2006 Frequent Patterns Array Express at EMBL

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 9/6/2006 Frequent Patterns Array Express at EMBL

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 9/6/2006 Frequent Patterns Are the Treatments Different? Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments Before making these lists, ask the question: "Are the treatments different?" Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test) If there are differences, find the genes most responsible If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 9/6/2006 Frequent Patterns Association Rules in Microarray Analysis Row enumeration vs column enumeration Suppose we have m rows and I columns The search space of column enumeration is 2 I Reduces the search space to 2 m when m << I Supports rule set pruning using minimum support, minimum confidence, and chi-square

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 9/6/2006 Frequent Patterns Background – Rule Groups Large number of rules contained in a microarray data set Rule sets contain a lot of redundancy Makes interpretation difficult Ex: 31 rules could be generated from class label {a, b, c, d, e, Cancer}, all with the same support FARMER finds rule groups 31 rules above would be grouped together Only finds interesting rule groups. Consider abcd->Cancer with a confidence of 90% and ab->Cancer with a confidence of 95%. All rows covered by abcd must be covered by ab. Therefore, abcd->Cancer is not interesting.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 9/6/2006 Frequent Patterns Preliminaries and Definitions D = dataset (rows), R = set of rows {r1,…,rn}, I = set of items {i1,…,im}, C = set of class labels {c1,…,ck} Row support set: Given I’, R(I’) is the set of rows that contains I’ Item support set: Given R’, I(R’) is the set of items common among the rows in R’

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 9/6/2006 Frequent Patterns Example I’ = {a,e,h} R(I’) = {r2,r3,r4} R’ = {r2,r3} I(R’) = {a,e,h}

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 9/6/2006 Frequent Patterns Preliminaries and Definitions IRG – interesting rule groups – rules whose item sets come from the same set of rows are clustered into one entity Reduces the number of rules discovered Each IRG corresponds to a unique row set R Each IRG has a unique upper bound rule, e.g. ab  Cancer Rule group definition 1: G = {A i  C | A i  I} is a rule group with (antecedent) support set R and consequent C, iff 1)  A i  C  G, R(A i ) = R For every rule in group G, the set of rows that contain A i is R 2)  R(A i ) = R, A i  C  G For every row support set equal to R, rule A i  C exists in group G

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 9/6/2006 Frequent Patterns Preliminaries and Definitions Rule group definition 1 continued: Rule  u  G (  u : A u  C) is an upper bound of G iff there exists no  ’  G (  ’ : A’  C) such that A’  A u Rule  l  G (  l : A l  C) is a lower bound of G iff there exists no  ’  G (  ’ : A’  C) such that A’  A l Lemma 1: Given a rule group G with consequent C and the antecedent support set R, it has a unique upper bound  (  : A  C)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 9/6/2006 Frequent Patterns FARMER - Enumeration Row enumeration tree Rows with consequent C are ordered before rows without C to support efficient pruning Each node is a combination of rows R’ and is labeled with I(R’)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide54 9/6/2006 Frequent Patterns FARMER – Transposed tables Example table and corresponding transposed table. Represents the TT at the root of the enumeration tree.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide55 9/6/2006 Frequent Patterns Performance Varying minsup. FARMER is 2 or 3 orders of mgnitued faster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide56 9/6/2006 Frequent Patterns Performance Varying minconf. CHARM is unable to finish due to insufficient memory, and ColumnE always runs for more than a day

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide57 9/6/2006 Frequent Patterns Usefulness of IRGs IRGs were compared to two other classifiers: Support Vector Machines (SVM), and Classifications Based on Associations (CBA) IRGs have the highest average accuracy Both efficient and easily understandable

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide58 9/6/2006 Frequent Patterns Summary What is microarray Rule discovery in analyzing microarry: a high dimension low sample size problem

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide59 9/6/2006 Frequent Patterns References CH 14. of the book: Bioinformatics: Genes, Proteins, and Computers, Christine Orengo, David Jones, Janet Thornton edit, Bios Scientific Publishers, 2003. (ISBN: 1-85996-0545) Gao Cong, Anthony K. H. Tung, Xin Xu, Feng Pan, Jiong Yang, FARMER: Finding Interesting Rule Groups in Microarray Datasets, SIGKDD'04 FARMER: Finding Interesting Rule Groups in Microarray Datasets, Feng Pan, Gao Cong, Anthony K. H. Tung, Jiong Yang, Mohammed J. Zaki, Carpenter: finding closed patterns in long biological datasets, SIGMOD'03Carpenter: finding closed patterns in long biological datasets Many slides are taken from http://bioinformatics.ca/course_work/workshops/ http://bioinformatics.ca/course_work/workshops/ and http://ww.med.nyu.edu/rcr/rcr/course/PPT/microarray.ppt

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback