Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao

Presentation Outline Conventional Order Preserving Submatrixes Multiple-value matrix data model Mining OPSMs from the new data model Efficient methods – bounding techniques Experimental evaluation

Order Preserving Submatrix The Order Preserving Submatrix is a pattern-based subspace clustering problem which usually applies on mining gene expression datasets. Gene expression dataset: One of the goals of microarray data analysis is to group the coexpressing genes into a cluster.  Genes that shows similar changes of expression levels (up or down of expression values) under some environmental stimuli (experimental conditions). C1C2C3C4C5C6C7C8 G1363212191842338 G211223324303923 G31418482838113321 G420145107244413 G5382510241939822 Experimental conditions (Experimental settings) Expression value of a gene under an experimental setting. Genes

Order Preserving Submatrix Gene’s expression level may vary substantially due to its sensitivity to experimental settings.  i.e. the change of expression level in response to the change of experimental condition is often considered more meaningful than its actual value. In microarray experiments, the design of experimental conditions is often based on little knowledge of gene functions.  i.e. clustering algorithms should consider a subset of conditions which maximizes the similarities among a subset of genes. C1C2C3C4C5C6C7C8 G1363212191842338 G211223324303923 G31418482838113321 G420145107244413 G5382510241939822 Raw gene expression dataset Data matrix plotted No obvious patterns observed.

Order Preserving Submatrix C1C2C3C4C5C6C7C8 G1363212191842338 G211223324303923 G31418482838113321 G420145107244413 G5382510241939822 Raw gene expression dataset Data matrix plotted C3C5C4C2C1C6 G1121819323642 G45710142024 G5101924253839 Reordered subset of columns (experimental conditions) Subset of Genes Order Preserving Submatrix No obvious patterns observed. Consider a subset of experimental conditions and a subset of genes and we reorder the columns. The change of expression values of the genes in response to the change of experimental condition is the same. They are all increasing. Identifying this submatrix (subset of genes and conditions and the ordering of conditions) is particularly useful for the biologists. E.g. infer gene regulatory networks.

Order Preserving Submatrix Given a data matrix M with n rows and m columns. An order preserving submatrix S is  A subset of row R. E.g. R ={ G 1,G 4,G 5 }  A subset of column C. E.g. C ={ C 1,C 2,C 3,C 4,C 5,C 6 }  A column order constraint, s.t. the entries of all rows in R are increasingly ordered. E.g. Mining OPSMs: Find all OPSMs with | R | greater than or equal to a user specified threshold (frequent) and | C | greater than or equal to c. C3C5C4C2C1C6 G1121819323642 G45710142024 G5101924253839 Subset of Genes Order Preserving Submatrix Reordered subset of columns (experimental conditions) C1C2C3C4C5C6…CmCm G1363212191842…8 G211223324303…23 G33253122114…26 G42014510724…13 G5382510241939…22 ……………………… GnGn… Raw gene expression dataset

Mining OPSMs Mining OPSMs can be reduced to a special case of sequential pattern mining.  Transform the data matrix to a sequence dataset by sorting each row in ascending order and replace the entries with the corresponding column labels.  Each sequential pattern uniquely specifies an OPSM, with all the supporting sequences as the supporting rows. A row supports an OPSM if the order constraint of the OPSM is a subsequence of the transformed column sequence of the row. C1C2C3C4C5C6C7C8 G1363212191842338 G211223324303923 G31418482838113321 G420145107244413 G5382510241939822 Raw data matrix GeneColumn Sequence G1 G2 G3 G4 G5 Transformed sequence dataset Sequential pattern (OPSM) An OPSM

Mining OPSMs – Two properties Apriori property  If the OPSM with column order constraint is infrequent (#rows smaller than a user specified support threshold), the OPSMs with column order constraint as it’s superset (e.g.,, ) are all infrequent.  E.g. OPSM with column order constraint is supported by G 1 only, it is infrequent. Adding more constraints to the OPSM will only reduce the number of supporting sequences. OPSM must be infrequent. According to the Apriori property, we can have an iterative method to prune the search space. GeneColumn Sequence G1 G2 G3 G4 G5 Transformed sequence dataset i.e. Mine the frequent size-k OPSMs, identify infrequent size-k+1 OPSMs and prune them, continue to the next (k+1) iteration.

Mining OPSMs – Two properties Transitivity  If the column order constraint of an OPSM O 1 is and another OPSM O 2 is. Then the intersection of R 1 and R 2 yields the set of supporting rows for OPSM O 3 with column order constraint.  E.g. OPSM is supported by G 1, G 4, OPSM is supported by G 1. OPSM is supported by G 1. According to the Transitivity property, we can obtain the supports of size-( k +1) OPSMs from size- k frequent OPSMs without rescan the sequence dataset. GeneColumn Sequence G1 G2 G3 G4 G5 Transformed sequence dataset Gene G1 G2 G3 G4 G5

The Subsequence function verifies the supporting rows of the candidate OPSMs. Mining OPSMs OPSM-Gen Subsequence Function Subsequence Function Size-2 OPSMs candidates Frequent size- k OPSMs Size k +1 Candidate OPSMs According to the Apriori property, we only generate those size k +1 candidates with all proper subsets being frequent. According to the Transitivity property, we can obtain the supporting rows of the size k +1 candidates from the size k large OPSMs. No need to scan the dataset. Start from mining size-2 OPSMs because size-1 OPSMs does not have any “orderings”. Those OPSMs with #supporting rows greater than or equal to the support threshold are frequent, they are passed into the OPSM- Gen procedure. The algorithm terminates when no more candidates are generated.

Mining OPSMs – Data Structure Subsequence Function Subsequence Function OPSM-Gen Column order constraintSupporting rows 1,3,5,6 …… Head/TailOPSMptr H a b b cd c ad root Head/TailOPSMptr T A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM- Gen procedure. To add an OPSM into the Head-tail tree, we first follow the “head” of the OPSM (i.e. ) to traverse the tree and store the OPSM in the leaf node, we indicate it reaches the leaf node by following the “head” (H) of the OPSM. Then, we follow the “tail” of the OPSM (i.e. ) to traverse the tree and store the OPSM in the leaf node, indicate it reaches the leaf node by following the “tail” (T) of the OPSM. Size-3 frequent OPSMs table In the OPSM-Gen procedure, we have the size-k frequent OPSMs table storing the column order constraint (OPSMs) and the support rows (bit vectors/ tid-lists) of the OPSMs.

Mining OPSMs – Data Structure Column order constraintSupporting rows 1,3,5,6 4,5,6 1,2,3,4,5 2,5,6,7 … Head/TailOPSMptr H H T T a b b cd c ad root Head/TailOPSMptr T According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs within each leaf node. Size-3 frequent OPSMs table Subsequence Function Subsequence Function OPSM-Gen

Mining OPSMs – Data Structure Column order constraintSupporting rows 1,3,5,6 4,5,6 1,2,3,4,5 2,5,6,7 … Head/TailOPSMptr H H T T a b b cd c ad root Head/TailOPSMptr T A Head-tail tree data structure was proposed to facilitate candidate generation in the OPSM-Gen procedure. According to the Transitivity property, to generate the size-4 candidates, we can simply merge the tail OPSMs and head OPSMs within each leaf node. Size-3 frequent OPSMs table For example, Tail and Head can be merged to form a new size-4 OPSM. Follow their ptrs, we can find their supporting rows. By the Transitivity property, OPSM is supported by row 4 and 5 (intersection of the two support row vectors). Subsequence Function Subsequence Function

The replicated data model Multiple-value matrix

Multiple-value matrix data model Recently, researches in microarray data analysis have shown that any single microarray output is subject to substantial variability.  [Stefan Bleuler et al, Evo Workshops 2005; R. Coombes et al, Journal of Computational Biology 2002; G. C. Tseng et al, Nucleic Acids Research. 2001; J. P. Brody et al., National Academy of Sciences.,2002]  The error of the expression values of the genes under an experimental condition can be large. Replication is strongly supported by biologists as a straightforward approach for improving the quality of inferences made from experimental studies.  [T.-K. Jenssen et al, Nucleic Acids Research 2002 ; M.-L. T. Lee et al, PNAS 2000 ; J. Novak et al, Genomics,2002; R. Ramakrishnan et al, Nucleic Acids Research 2002] TechnicalBiological Variability Measurement error of the experimental system (comparatively small). Natural heterogeneity among individuals (can be large). Replication Repeated measurement of the same sample. The practice of measuring multiple samples under the same experimental condition.

Multiple-value matrix data model The practice of conducting repeated experiments is stressed in many literatures on microarray studies.  A study on the effect of repeated measures on the detection of differentially expressed genes has reported that stable results are typically not obtained until at least five biological replicates have been used. [P. Pavlidis et al, Bioinformatics 2003 ]  Another study on variability analysis of gene expression data suggests that at least three repeated experiments should be conducted instead of one. [M.-L. T. Lee et al, PNAS 2000] Therefore, it is necessary to consider the data outputted by the repeated experiments when analyzing the gene expression data.

Multiple-value matrix data model With repeated experiments, the data outputted by microarray experiments can be organized as a matrix in which each entry is a set of expression levels of a gene under an experimental condition. 8 Conditions 3 Genes There are 3 repeated experiments conducted under experimental condition (column) C 1, the expression value of gene (row) G 1 in the first, second and third repeated experiments (replicates) are 23, 24 and 22 respectively.

Multiple-value matrix data model Which OPSM does G 2 supports? Expression values There are two enumerated column orderings deduced from G 2 which conform to the column order constraint of this OPSM. Let’s consider this set of replicates. Expression values Since the expression values of G 2 in column are increasingly ordered, we say that the OPSM with column order constraint is one of the OPSM that is possibly supported by G 2.

Multiple-value matrix data model Which OPSM does G 2 supports? Expression values There are six enumerated column orderings deduced from G 2 which conform to the column order constraint of this OPSM. Which OPSM does G 2 more conform to?

We define the score given by a row (gene) to an OPSM being the fraction of all the enumerated column orderings which conform to the column order constraint of the OPSM. The problem of obtaining the counts of the enumerated column ordering is equivalent to obtaining the number of subsequence matches of in the transformed sequence dataset. Scoring Model RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset Enumerated column orderings table Which OPSM, or does G 1 supports? From the raw dataset, we can enumerate all the possible column orderings and store them in the Enumerated column orderings table. In this case, there are 16 enumerated column orderings in total. has 11 out of 16 of the enumerated column orderings, the OPSM with column order constraint scores 11/16 from G 1. Counts115 Score11/165/16 #Subsequence matches 115 Score11/165/16 Subsequence Matches Enumerated column orderings counts Similar to the conventional OPSM mining, we can transform the raw dataset to a sequence dataset. The denominator of the score function can be calculated by multiplying the number of replicates of the columns involved. i.e. 4*4=16.

Scoring Model Column order constraint Supporting rows (#subsequence matches ) Total Score 1(14), 3(11), 5(22), 6(63)(14+11+22+63) / 64 4(44), 5(36), 6(25)(44+36+25) / 64 1(52), 2(42), 3(14), 4(20)(52+42+14+20) / 64 2(42), 5(31), 6(36), 7(13)(42+31+36+13) / 64 … …… Head/TailOPSMptr H H T T a b b cd c ad root Size-3 frequent OPSMs table To mine the OPSMs under the scoring model, each supporting row (gene) is associated with the #subsequence matches of the column order constraint (OPSM). Here, we use the total score as the support measure of the OPSMs. Those OPSMs with total scores over a user- specified threshold are regarded as frequent. From the #subsequence matches, we can calculate the score contributed by each row (gene) and the total sum of the scores obtained for the OPSM.

Mining OPSMs from multiple-value matrix RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset #Subsequence matches 1110 Score11/1610/16 Subsequence Matches An example raw dataset with 3 conditions (columns), and each condition has 4 repeated experiments (replicates). We transform the raw dataset into a sequence dataset by sorting the entries in ascending order and replace the entries with their condition (column) IDs. From the row 1 sequence, we found that has 11 subsequence matches in row 1. Similarly, has 10 subsequence matches in row 1.

After obtaining the total scores of the two OPSMs, we found that they are frequent. Therefore, they are stored in the size-2 frequent OPSM table. Mining OPSMs from multiple-value matrix RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table #Subsequence matches 1110 Score11/1610/16 Subsequence Matches

Mining OPSMs from multiple-value matrix RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table #Subsequence matches 1110 Score11/1610/16 Subsequence Matches Head-Tail Tree Subsequence Function Subsequence Function OPSM-Gen In the OPSM-gen procedure, OPSMs are organized in a Head- Tail tree data structure.

Mining OPSMs from multiple-value matrix RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table #Subsequence matches 1110 Score11/1610/16 Subsequence Matches Head-Tail Tree OPSM-Gen Subsequence Function Subsequence Function According to the transitivity property, Tail and Head can be merged to form a size-3 OPSM. Question: Can we deduce the #subsequence matches (score) of in Row 1 from the size-2 frequent OPSMs table? Recall that in conventional OPSM mining, we can deduce the support of from the size-2 frequent OPSM table by intersecting the supporting rows s.t. we do not need to rescan the dataset.

Mining OPSMs from multiple-value matrix Raw Dataset 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree OPSM-Gen Subsequence Function Subsequence Function Enumerated ordering tables (size-2) for row 1 Essentially, obtaining the #subsequence matches of is equivalent to perform a join on the column C 2 of the two tables. Since the joining information cannot be deduced from the count (#subsequence matches), we cannot obtain the #subsequence matches of without revisiting the sequence dataset. Question: Can we materialize these tables to facilitate the joining?

Mining OPSMs from multiple-value matrix OPSM-Gen Subsequence Function Subsequence Function Size-2 OPSMs candidates Frequent size- k OPSMs Size k +1 Candidate OPSMs Combinatorial explosion of the number of candidates. Unlike the conventional OPSM mining, we have to revisit the dataset to obtain the #subsequence matches for the candidates. Obtain the #subsequence matches of a candidate requires enumeration of the column orderings, which is exponential to the size of the candidate OPSMs. Same process has to be repeated for all rows. Reduce the #candidates through some bounding techniques. Organize the candidates in a prefix tree and verify the #subsequence matches in a single scan over the dataset. Compress the sequence dataset to reduce the effort for obtaining the #subsequence matches (tree traversal).

This is an upper bound of the #subsequence matches of in row 1. If we apply this bound on all the rows, we can obtain an upper bound of the score of an OPSM. Min upper bound 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Motivating questions  Assume the # replicates of all columns are 4.  We have 11 subsequences in row 1, and there are 4 “C 3 ”s, the maximum possible #subsequence matches of in row 1 is …  We have 10 subsequences in row 1, and there are 4 “C 1 ”s, the maximum possible #subsequence matches of in row 1 is …  Therefore, the upper bound of the possible #subsequence matches of in row 1 is … 44 40 We assume all the 4”C 3 ”s are on the right of the 11 subsequences. Therefore we guess the maximum possible #subsequence matches of is 11*4= 44.

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array How many “C 2 ”s after the 1st “C 1 ”? 1 2 3 4 4 How many “C 2 ”s after the 2nd “C 1 ”? Construct a T-array for the tail OPSM. 421

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array How many “C 2 ”s after the 1st “C 1 ”? 1 2 3 4 4 How many “C 2 ”s after the 2nd “C 1 ”? 421 H-array 1 2 3 4 How many “C 3 ”s after the 1st “C 2 ”? 4 How many “C 3 ”s after the 2nd “C 2 ”? 321

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of.

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. There are 4 “C 2 ”s after the 1st “C 1 ”. There are 4 “C 3 ”s after the 1st “C 2 ”. 4 So we can conclude that there are 4 orderings formed by the 1st C 1 and the 1st C 2. # Subsequence matches of =

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. There are 4 “C 2 ”s after the 1st “C 1 ”. There are 3 “C 3 ”s after the 2nd “C 2 ”. 4 Therefore we can conclude that there are 3 orderings formed by the 1st C 1 and the 2nd C 2. + 3 # Subsequence matches of =

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. There are 4 “C 2 ”s after the 1st “C 1 ”. There are 2 “C 3 ”s after the 3rd “C 2 ”. 4+ 3+ 2 # Subsequence matches of =

HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. There are 4 “C 2 ”s after the 1st “C 1 ”. There is 1 “C 3 ” after the 4th “C 2 ”. 4+ 3+ 2+ 1 # Subsequence matches of =

Similar for the 2nd “C 1 ”. There are 4 “C 2 ”s after the 2nd “C 1 ”. HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. 4+ 3+ 2+ 1 # Subsequence matches of =

Similar for the 2nd “C 1 ”. There are 4 “C 2 ”s after the 2nd “C 1 ”. HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. 4+ 3+ 2+ 1 So we can sum all the 4 entries of the H-array. + 10 # Subsequence matches of =

There are 2 “C 2 ”s after the 3rd “C 1 ”, which slots of the H- array should we sum up? HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. 4+ 3+ 2+ 1+ 10 # Subsequence matches of =

There are 2 “C 2 ”s after the 3rd “C 1 ”, which slots of the H- array should we sum up? HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. 4+ 3+ 2+ 1+ 10 Since there are only 2 “C 2 ”s after the 3rd “C 1 ”, the 2 “C 2 ”s must be the 3rd and 4th “C 2 ”s. Otherwise, T-array[3] will not be 2. + 2 + 1 # Subsequence matches of =

Finally, there is 1 “C 2 ” after the 4th “C 1 ”. HT arrays RowColumn sequence Row 1 Raw Dataset Transformed Sequence Dataset T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 With these two arrays, we can deduce the #subsequence matches of. # Subsequence matches of = 4+ 3+ 2+ 1+ 10 Since there is only 1 “C 2 ” after the 4th “C 1 ”, the “C 2 ” must be the 4th “C 2 ”. Otherwise, T-array[4] will not be 1. + 2 + 1+ 1 = 24 Finally, we can deduce that the #subsequence matches of from row 1 is 24.

Can we store the HT- arrays instead of the #subsequence matches s.t. we don’t need to rescan the dataset in the subsequence function procedure? HT arrays T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree T-array 1 2 3 4 … H-array 1 2 3 … However, the number of slots of H-array is exponential to the number of columns in the OPSMs. Generalized HT-arrays :

This slot indicate how many “C 2 ”s after the 1st “C 1 ”, therefore it’s value cannot be larger than #replicates of C 2. To obtain an upper bound of the #subsequence matches of, we try to guess the T-array s.t. the #subsequence matches of is maximum. This can be done by assigning the “C 1 ”s to the left in the column sequence as much as possible. Motivation: We can obtain the bound of the #subsequence matches of by guessing the HT-arrays from the #subsequence matches of the tail and head OPSMs. HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430

This slot indicate how many “C 2 ”s after the 1st “C 1 ”, therefore it’s value cannot be larger than #replicates of C 2. To obtain an upper bound of the #subsequence matches of, we try to guess the T-array s.t. the #subsequence matches of is maximum. This can be done by assigning the “C 1 ”s to the left in the column sequence as much as possible. Motivation: We can obtain the bound of the #subsequence matches of by guessing the HT-arrays from the #subsequence matches of the tail and head OPSMs. HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left For the H-array, we assign the “C 3 ”s to the right in the column sequence as much as possible. 4 430 Push right

Motivation: We can obtain the bound of the #subsequence matches of by guessing the HT-arrays from the #subsequence matches of the tail and head OPSMs. The H-array cannot be. If there are no C 3 after the 1st C 2, then there will not be any C 3 after the 2nd, 3rd and 4th C 2. Therefore, there is a constraint when assigning the value to the HT arrays: Array [x] >= Array [x+1]. HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left For the H-array, we assign the “C 3 ”s to the right in the column sequence as much as possible. 4 430 Push right 3 322 0244 This slot indicate how many “C 2 ”s after the 1st “C 1 ”, therefore it’s value cannot be larger than #replicates of C 2.

HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound of the # Subsequence matches of = Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of from the two guessed HT-arrays.

HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound of the # Subsequence matches of = 3 + 3 + 2 + 2 Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of from the two guessed HT-arrays.

HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound of the # Subsequence matches of = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2 Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of from the two guessed HT-arrays.

HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound of the # Subsequence matches of = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2+ 3 + 2 + 2 Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of from the two guessed HT-arrays.

HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound of the # Subsequence matches of = 3 + 3 + 2 + 2 + 3 + 3 + 2 + 2+ 3 + 2 + 2 = 27 The upper bound of #subsequence matches of from row 1 is 27. Upper bound: 27 Follow the previous algorithm, we can obtain the upper bound of the #subsequence matches of from the two guessed HT-arrays.

HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound Similarly, we can obtain a lower bound of #subsequence of by assigning C 1 on the right of C 2 as much as possible, and C 3 on the left of C 2 as much as possible.

HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound Similarly, we can obtain a lower bound of #subsequence of by assigning C 1 on the right of C 2 as much as possible, and C 3 on the left of C 2 as much as possible. Lower bound of the # Subsequence matches = of 6

HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound Similarly, we can obtain a lower bound of #subsequence of by assigning C 1 on the right of C 2 as much as possible, and C 3 on the left of C 2 as much as possible. 6 + 6 Lower bound of the # Subsequence matches = of

HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound Similarly, we can obtain a lower bound of #subsequence of by assigning C 1 on the right of C 2 as much as possible, and C 3 on the left of C 2 as much as possible. 6 + 6 + 6 Lower bound of the # Subsequence matches = of

HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound Similarly, we can obtain a lower bound of #subsequence of by assigning C 1 on the right of C 2 as much as possible, and C 3 on the left of C 2 as much as possible. 6 + 6 + 6 + 2 Lower bound of the # Subsequence matches = of

The lower bound of the #subsequence matches of is 20. HT bounds T-array 1 2 3 4 H-array 1 2 3 4 2 root 13 Head/TailOPSMptr T<C1,C2><C1,C2> H ……… Column order constraint Supporting rows (#subsequence matches ) 1(11), … 1(10), … …… Size-2 frequent OPSMs table Head-Tail Tree Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound :20 6 + 6 + 6 + 2 = 20 Lower bound of the # Subsequence matches = of

Comparisons T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 T-array 1 2 3 4 H-array 1 2 3 4 Push left 3 332 Push right 4 420 Lower bound :20 HT array: 24 T-array 1 2 3 4 4421 H-array 1 2 3 4 4321 Recall that the HT- array method can return the exact #subsequence matches of, which is 24. However, it is not feasible to keep the HT-arrays for each candidate. Min upper bound: 40 The Min upper bound approach returns 40 as the upper bound of the #subsequence matches of. Compare with the HT- bound technique, the HT-bound is much more tighter.

Generalized HT upper bound Tail sequence T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = Head sequence Generated sequence New = Assume the number of replicate for column C y is r(C y ). Middle = Middle sequence

Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = The qth -slot of the T- array represents the number of “Middle sequence”s after the qth C 1. Middle = Therefore, the #slots for T-array is equal to the #replicates of C 1. The maximum possible value for each slot is equal to the maximum possible #subsequence matches for the middle sequence. Maximum possible value Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

Therefore the #slots for H-array is equal to the maximum possible #subsequence matches of the middle sequence. Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Maximum possible value ……… H-array 1 2 3 ……………… The qth -slot of the H-array represents the number of “C x “s after the qth “Middle sequence”. The maximum possible value for each slot in H-array is equal to the #replicates of C x. Maximum possible value Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros. Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Maximum possible value Let T be the #subsequence matches for the Tail sequence (i.e. 11 in the example) Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros. Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Maximum possible value e.g. The first = 2 slots with value 4. Let T be the #subsequence matches for the Tail sequence (i.e. 11 in the example) Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence Rule 1: The first slots with value.

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros. Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Maximum possible value Let T be the #subsequence matches for the Tail sequence (i.e. 11 in the example) Rule 1: The first slots with value. Rule 2: The slot with value. e.g. The 3rd slot with value 11 mod 4 = 3. Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

We notice that the “push left” assignment always yields a T-array in which the first k slots are fully filled, and all the slots after the k+1 slot are zeros. Generalized HT upper bound … T-array 1 2 3 … T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Maximum possible value Let T be the #subsequence matches for the Tail sequence (i.e. 11 in the example) Rule 1: The first slots with value. Rule 2: The slot with value. Rule 3: The other slots with value zero. Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

Generalized HT upper bound T-array 1 2 3 4 H-array 1 2 3 4 Push left 4 430 Push right 3 322 Upper bound: 27 Tail = Head = New = Middle = Let H be the #subsequence matches for the Head sequence (i.e. 10 in the example) Rule 1: The first slots with value. Rule 2: The other slots with value. ……… H-array 1 2 3 ……………… Maximum possible value Similar to the T-array, the H-array can be divided into two partitions, the values in the first partition are larger than the values in the second partition by 1. Tail sequence Head sequence Generated sequence Assume the number of replicate for column C y is r(C y ). Middle sequence

Generalized HT upper bound … T-array 1 2 3 … Maximum possible value Let T be the #subsequence matches for the Tail sequence. Rule 1: The first slots with value. Rule 2: The slot with value. Rule 3: The other slots with value zero. Let H be the #subsequence matches for the Head sequence. Rule 1: The first slots with value. Rule 2: The other slots with value. ……… H-array 1 2 3 ……………… Maximum possible value With these rules, we can deduce a formula to calculate the upper bound without constructing these arrays. Similar method can be applied for the HT- lower bound, therefore we do not need to materialize any of the HT-arrays.

Compression method RowColumn sequence Row 1 Transformed Sequence Dataset Given a column sequence, we would like to find the #subsequence matches of in the column sequence of row 1. The naive method is to enumerate all the size-2 subsequences and count the occurrence of, which requires enumerating 16 column orderings. RowColumn sequence Row 1 Compressed Sequence Dataset #subsequence matches of in row 1 : 3*1 There are 3 “C 1 ”s on the left of 1”C 2 ”, therefore there are 3*1= 3 s.

Compression method RowColumn sequence Row 1 Transformed Sequence Dataset Given a column sequence, we would like to find the #subsequence matches of in the column sequence of row 1. The naive method is to enumerate all the size-2 subsequences and count the occurrence of, which requires enumerating 16 column orderings. RowColumn sequence Row 1 Compressed Sequence Dataset #subsequence matches of in row 1 : 3*1+ 3*3+ 1*3 There are 3 “C 1 ”s on the left of 3”C 2 ”s, therefore there are 3*3= 9 s. There are 3 “C 1 ”s on the left of 1”C 2 ”, therefore there are 3*1= 3 s. There are 1 “C 1 ” on the left of 3”C 2 ”s, therefore there are 1*3= 3 s.

Compression method RowColumn sequence Row 1 Transformed Sequence Dataset Given a column sequence, we would like to find the #subsequence matches of in the column sequence of row 1. The naive method is to enumerate all the size-2 subsequences and count the occurrence of, which requires enumerating 16 column orderings. RowColumn sequence Row 1 Compressed Sequence Dataset #subsequence matches of in row 1 : 3*1+ 3*3+ 1*3 There are 15 s in total. This way to obtain the #subsequence matches only requires enumerating 3 column orderings. = 15

Experimental Evaluation

Experimental settings C programming language Machine  CPU : 2.6 GHz  Memory : 1 Gb  Fedora Dataset  Real dataset : Yeast galactose dataset Subset of 205 genes (rows) yeast galactose data 20 experimental conditions (columns) 4 biological replicates per condition Publicly available : http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedo vic_bioinf2003.html http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedo vic_bioinf2003.html  Synthetic dataset Replicate simulation - Generate normal distributions according to means and variances of the replicates in the real dataset, and randomly generate a new replicate value according to the distribution. Column simulation – Generate a new column by randomly select an experimental condition in the real dataset and perturb the mean and variance. Row simulation – Generate normal distributions according to means and variances of the replicates in the real dataset, and generate a new row according to the distributions.

Execution time per iteration The Brute-force approach is to mine the OPSMs without using any bounding techniques. All the algorithms start from mining size-2 OPSMs. For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs. The number of candidates generated in each iteration using different bounding techniques The HT upper bound technique can reduce the #candidates by more than a half in all iterations.

Execution time per iteration The HT bounds + compression approach uses the HT upper and lower bounds to reduce candidate set, and uses the compression method to reduce the cost of obtaining the #subsequence matches of the candidates. The number of candidates generated in each iteration using different bounding techniques Execution time in each iteration using different bounding techniques For the HT-bounds, we use the HT upper bound to identify infrequent candidates which can be pruned, and we use the HT lower bound to identify large OPSMs. We do not verify the #subsequence matches for those large OPSMs.

Vary the support threshold The saving from the HT upper bound decreases as the support threshold decreases. It is because it’s harder for an upper bound to be less than the support requirement (pruning condition) as the support requirement decreases. Scalability test on support threshold The saving from the lower bound increases as the support threshold decreases. The reason is that as support requirement decreases, the differences between the supports of large candidates and the support requirement increase, those large OPSMs become more obvious and become more easy to identify. Execution time saving (%) compared with the Brute force approach The HT bounds + compression method achieves the best execution time saving.

Vary the #columns Scalability test on #columns Essentially, increase in columns will increase the number of candidates generated but NOT the cost of obtaining the #subseqeunce matches for the candidates. The pruning power of the bounding techniques are quite independent to the number of columns in the dataset. Execution time saving (%) compared with the Brute force approach

Vary the #replicates Scalability test on #replicates Execution time saving (%) compared with the Brute force approach The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

Vary the #replicates Execution time saving (%) compared with the Brute force approach … T-array 1 2 3 … Maximum possible value ……… H-array 1 2 3 ……………… Maximum possible value HT Upper bound The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds. The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why?

Vary the #replicates The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why? Execution time saving (%) compared with the Brute force approach Min upper bound: 40 We have 11 subsequences in row 1, and there are 4 “C 3 ”s (#replicates), the maximum possible #subsequence matches of in row 1 is … 11*4=44 … T-array 1 2 3 … Maximum possible value ……… H-array 1 2 3 ……………… Maximum possible value HT Upper bound The number of slots of the T and H arrays are determined by the #replicates, essentially, the larger the arrays, the looser the bounds. In Min upper bound, we multiply the #replicate of C 3 with #subsequences of. The tightness of the Min bound is also determined by the #replicates.

Vary the #replicates Scalability test on #replicates Execution time saving (%) compared with the Brute force approach The saving from both Min upper bound and HT upper bound decreases as #replicates increases. Why? The saving from HT bounds + compression method increases as #increases. This is mainly due to the saving from compressing the sequence s.t. the #enumerated sequences is reduced.

Vary the #rows Scalability test on #rows

Conclusion Single microarray output is subject to substantial variability, replication is the common practice to address this issue. We have proposed a scoring model to mine the Order Preserving Submatrixes from gene expression dataset with repeated measurements. Mining OPSMs under the scoring model requires heavy computational cost (obtaining #subsequence matches)  An HT Bounding technique and compression method is proposed to efficiently mine the OPSMs.  Experimental results show that the HT bounding technique + compression method achieves the best CPU cost saving.

Things not covered in this talk Biological evaluation of cluster quality : oPOSSIUM, Gene Ontology, ARI Efficient method of the subsequence function.  Prefix tree to organize the candidates, verify the supports through a single dataset scan.  Compression on the sequence dataset, reduce the #prefix tree traversal. Bounding techniques Application in other areas :Collaborative Filtering Visualization of OPSMs

End Thank you!

Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Similar presentations

Presentation on theme: "Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao.

Similar presentations

Presentation on theme: "Mining Order Preserving Submatrices (OPSMs) from data with replicates Presenter: Chun-Kit Chui Supervisor: Dr. Ben Kao."— Presentation transcript:

Similar presentations

About project

Feedback