Presentation is loading. Please wait.

Presentation is loading. Please wait.

Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005.

Similar presentations


Presentation on theme: "Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005."— Presentation transcript:

1 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005

2 Outline Brief Overview of Bioinformatics Microarray Technology Motivation and Potential Impacts “Peano” Conclusion

3 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Brief Overview of Bioinformatics The term was first coined in 1988 by Dr. Hwa Lim The original definition was : “a collective term for data compilation, organisation, analysis and dissemination” Using information technology to help solve biological problems by designing novel and incisive algorithms and methods of analyses It also serves to establish innovative software and create new/maintain existing databases of information, allowing open access to the records held within them.

4 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Brief Overview of Bioinformatics Bioinformatics’ - the new ‘buzz word’ in the scientific community It is an umbrella term for genomics, proteomics and evolution, and computer science It is now necessary for scientists to be inter-disciplinary The data is collected from a variety of sources The terminology is specific to its particular branch of science

5 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Overview of Bioinformatics To allow the effortless transfer of information gathered and the interrogation of databases across the global interface i.e. to make the data easily and universally interpretable by scientists. It is a discipline vital in the era of post-genomics. Biologists have been classifying data on species of plants and animals since the 17th century The knowledge acquired has escalated in harmony with the evolution of technology

6 In a nutshell…. Bioinformatics will also serve to advance medical research regarding the drug discovery process and therapeutic intervention. Implementing the information disclosed permits us to discern biological systems well enough and at such a level to build models reflecting how natural pathways/processes work, to predict their response and behavior, to manipulate them, as well as to identify defects in order to better understand and fight disorders and disease.

7 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data What is Microarray? A way of studying how large numbers of genes interact with each other and how a cell's regulatory networks control vast batteries of genes simultaneously. The method uses a robot to precisely apply tiny droplets containing functional DNA to glass slides. Fluorescent labels are attached to DNA of cell to study.

8 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data What is Microarray? Genetics began when Mendel proved his laws of hereditary with varieties of peas and flowers in 1865 The invention of the compound microscope in the 19th century The first protein to be sequenced – insulin The first complete sequencing of an enzyme, ribonuclease in 1960 To the sequencing of the first complete genome (Haemophilus influenzae) published in 1995 Since, we have moved on to technologies permitting the sequencing, recombination and cloning of DNA

9 Microarray Technology Refer to by other names: –microchip, biochip, DNA chip, DNA microarray, gene array, GeneChip®, and genome chip Microarray analysis encompasses: –Data Capture –Data Mining Making sense of gene expression data –Visualization and Interfaces How to make all of this complicated data and analysis software accessible

10 Microarray Two general types that are popular –Spotted Arrays (Pat Brown, Stanford) –Oligonucleotide Arrays (Affymetrix) Both based on the same basic principles –Anchoring pieces of DNA to glass/nylon slides –Complementary hybridization

11

12 Motivation Microarrays provide a tool for answering a wide variety of questions about the dynamics of cells: –In which cells is each gene active? –Under what environmental conditions is each gene active? –How does the activity level of a gene change under different conditions? Stage of a cell cycle? Environmental conditions? Diseases? –What genes seem to be regulated together?

13 Motivation (2) Microarrays may be used to assay gene expression within a single sample or to compare gene expression in two different cell types or tissues samples, such as in healthy and diseased tissue. –Follow population of (synchronized) cells over time, to see how expression changes (vs. baseline) –Expose cells to different external stimuli and measure their response (vs. baseline) –Take cancer cells (or other pathology) and compare to normal cells.

14 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Potential Impacts Preventative medicine Ability to subtype disease and design drugs that treat disease causes, rather than symptoms Specific genotype (population) targeted drugs Targeted drug treatments

15 1.Control Cells (left) and Target Cells (right) 2. Harvesting mRNA from both cell group 3. Tagging the mRNA with green and red dye 4. Applying the mRNA to the cDNA microarray 5. Reading the result using a laser 6. A false-color composite representing the results Spotted Arrays

16 Microarray Animation

17 Oligonucleotide Arrays Gene Chips –Instead of putting entire genes on array, put sets of DNA 25-mers (synthesized oligonucleotides) –Produced using a photolithography process similar to the ones used to make semiconductor chips –mRNA samples are processed separately instead of in pairs

18

19 Condition –1. Internal cellular physiology from different cell lines –2. Diverse physiological conditions in an intact organism –3. Pathological tissues specimens from patients –4. Serial time points following a stimulus to a cell or organism Profile is the list of measurements along each row or column. Features are the individual expression measurements with in each profile.

20 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Where does computer science come into it? Bioinformaticians act as bridge between both science. The HGP has brought to light the limitations of traditional lab work – although mostly automated they are expensive and time consuming We need to incorporate original techniques to allow greater understanding of protein function, protein-protein interactions and protein-DNA interactions and put it all in a cellular context The gap between the data stored and its biological significance

21 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data “Peano” Method for Association Rule Mining “Peano” is a technology that employs Association Rule Mining as means to do data mining of the microarray data. Association Rule Mining is an advanced data mining technique that is useful in deriving meaningful rules from a given data. Our approach proposes a new microarray data mining technology, which involves a "Data Mining Ready" data structure, called Peano count tree (P-tree), to measure gene expression levels. The method involves treating the microarray data as spatial data

22 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data “Peano” Method for Association Rule Mining Each spot on the microarray is presented as a pixel with corresponding red and green ratios. The microarray data is reorganized into an 8-bit bSQ file (where each attribute or band is stored as a separate file) Each bit is then converted in a quadrant base tree structure P-tree from which a data cube is constructed and meaningful rules readily obtained.

23 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data An association rule is a relationship of the form X  Y where X is the antecedent item set and Y is the consequent itemset An example of the rule can be, "customers who purchase an item X are very likely to purchase another item Y at the same time. The rule X  Y has support s% in the transaction set T if s% of transactions in T contains X  Y.

24 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data The rule has confidence c% if c% of transactions is T that contain X also contain Y. The goal of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. The data mining model for Market Research dataset can be treated as a relation R(Tid, i1,........in) Where Tid is the transaction identifier and i1..... in denote the feature attributes - all the items available for purchase from the store.

25 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Transactions constitute the rows in the data-table whereas itemsets form the columns. The values for the itemsets for different transactions are in binary representation The microarray data is currently represented as a relation R(Gid, T1,.....Tn) Where Gid is the gene identification for each gene and T1....Tn are the various kinds of treatments to which the genes were exposed.

26 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data The genes constitute the rows in the data table where as treatments are the columns. The values are in the form of normalized Red/Green color ratios representing the abundance of transcript for each spot on the microarray This table can be called as a "Gene table". Currently the data mining techniques - clustering and classification is being applied to the Gene table

27 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Clustering Data format Divide dataset into clusters/classes by grouping on the rows (genes). Design a data format called "Treatment table" formed by flipping the gene table. The relation R of a Treatment table can be represented as R(Tid, G1,...........Gn) Where Tid represents the treatment ids and G1…….Gn are the gene identifiers

28 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data The goal here is to mine for rules among the genes by associating the columns (genes) in the Treatment table. Treatment table can be viewed as a 2-dimensional array of gene expression values Treatment table can be organized into a new spatial format called bit Sequential (bSQ) proposed by Qin Ding, Qiang Ding and William Perrizo [1]. Bit Sequential (bSQ) is a new data format for representing the spatial data.

29 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, the high-order bits alone provide the necessary information. Second, the bSQ format facilitates the representation of a precision hierarchy (from one bit precision up to eight bit precision). Third, bSQ format facilitates better compression through creation of an efficient, rich data structure called “Peano” Count Tree and accommodates algorithm pruning based on one-bit-at-a-time approach.

30 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Bit Sequence File 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 P-tree 55 __________/ / \ \__________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Figure 1: P-tree and PM tree for 8x8 image

31 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Figure 1 can be considered as a set of 8-row-8-column microarray data, representing the expression levels for 64 different treatments for a single gene. 55 is the number of 1's in the entire microarray data set. This root level is labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants The first and last quadrants are composed entirely of 1-bits (called a "pure 1 quadrant") we terminate them.

32 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data We do not need sub-trees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0-bits are called "pure 0 quadrants" which also terminate the tree branches. This pattern is continued recursively using the Peano or Z- ordering of the four sub quadrants at each new level. Every branch terminates eventually (at the "leaf" level, each quadrant is a pure quadrant). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering of the original raster data..

33 Binary Representation Gene ArrayABCDE 101010 210100 310101 401100 511000 610101 701100 800000 Determine a cutoff and convert any value>=cutoff (2.0) to 1, others to 0. Genes up-regulated have a value of 1. Could convert to -1, 0, and 1 to look at up-regulation and down-regulation

34 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Setting rules can provide valuable information to the biologist as to the gene regulatory pathways and identify important relationships between the different gene expression patterns. The biologist may be interested in some specific kinds of rules which can be called as "rules of significance“ In gene regulatory pathways, a biologist may be interested in identifying genes that govern the expression of other sets of genes.

35 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data These relationships can be represented as following, {G1,..............Gn}  Gm where G1...….Gn represents the antecedent and Gm represents the consequent of the rule. The intuitive meaning of this rule is that for a given confidence level the expression of G1.....Gn genes will result in the expression of Gm gene. The algorithm used here is described in Figure 2 as the P- ARM algorithm; it assumes a fixed precision, for example, 3-bit precision in all bands which is being used for our experiment.

36 Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Peano Algorithm Procedure P-ARM { Data Discretization; F 1 = {frequent 1-Asets}; For (k=2; F k-1  ) do begin C k = p-gen(F k-1 ); Forall candidate Asets c  C k do c.count = rootcount(c); F k = {c  C k | c.count >= minsup} end Answer =  k F k } insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from F k-1 p, F k-1 q where p.item 1 = q.item 1, …, p.item k-2 = q.item k-2, p.item k-1 < q.item k-1, p.item k-1.group <> q.item k-1.group Figure 2: P-ARM algorithmFigure 3: Join step in p-gen function

37 Conclusions Controls yielded no rules or itemsets. Shows associations are not happening by chance. We are detecting order. Itemset and rules count peek at n genes per set. Verifying biological associations are actually meaningful. Few examples in the literature, yet technique shows promise. Larger dataset is more prone to yield better results


Download ppt "Array Based Cancer Diagnostics: Gene Expression Profiling of DNA Microarray Data Abdoulaye Samb DPS 2005 Proceedings Student Research May 06, 2005."

Similar presentations


Ads by Google