Presentation is loading. Please wait.

Presentation is loading. Please wait.

Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and.

Similar presentations


Presentation on theme: "Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and."— Presentation transcript:

1 Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data
Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and William Perrizo North Dakota State University, USA Riad Rahhal, University of Iowa, USA Willy Valdivia, Orion Intregrated Biosciences, USA

2 High throughput techniques are producing massive quatities of boiinformatics data
Consequently, there is a need for analysis methodologies that scale to larger and larger datasets. In this paper we us Association rule mining (ARM) to discover relationships in Saccharomyces cerevisiae (Yeast) genomic data. ARM was 1st proposed for Market Basket Research (MBR) ARM comes into its own when much of the data is categorical or where there are a very large number of dimensions. However, ARM has been noted for producing a large number of rules, which can overwhelm researchers Frequent itemset mining (1st step in ARM) also provides indexing for attributes that appear often, for faster access to information.

3 We propose a new ARM technique which
Optimizes the rule-discovery process by giving biologists the flexibility of incorporating their knowledge into it, Reduces the overwhelming number of rules that match the specified minimum support and confidence thresholds, Operates in an incremental and interactive mode, Allows new queries to be posed from old ones; interactive mining Uses previous results to answer new queries; incremental mining Stores and processes data vertically

4 Data Representation Feature Total Values pathway 80 EC 622 complexes
316 function 259 localization 43 protein class 191 phenotype 181 interactions 6347 Data used was extracted mostly from the MIPS database (Munich Information center for Protein Sequences) Left column shows all considered features (feature groups) Right column shows the number of distinct feature values in the extent domain of each feature

5 Data Representation We built a Binary gene-by-feature table.
For a categorical feature, we consider each category as a separate attribute or column by bit-mapping it. For numeric attributes and hierarchical categorical attributes, we used a bit vector for each bit position or hierarchy level (reducing the number of bit vectors by ~ log(n) The resulting table has a total of 8039 distinct feature bit vectors (corresponding to “items” in MBR) for 6374 yeast genes (corresponding to transactions in MBR) For processing and storage optimization, we use Predicate tree (P-tree) patent pending technology to vertically store and process the resulting bit vectors

6 But it is pure (pure0) so this branch ends
Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: Base 2 Base 10 R(A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] = Horizontally structured records Scanned vertically R11 1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/21 subsets), until purity is achieved. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Horizontally AND basic Ptrees 1. Whole is pure1? false  0 0 0 0 1 10 ^ P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 And it’s pure so branch ends 7. Rt half of lf of rt? false0 0 0 0 1 10 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 6. Lf half of lf of rt? true1 0 0 0 1 1 But it is pure (pure0) so this branch ends For categorical attributes, a bitmap is formed for each category then compressed into a P-tree.

7 R11 1 Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 1 1 1 bottom up construction of 2-Dimensional Ptrees (eg, natural dim choice for images) bit-file (e.g., hi-order bit of Green band): Which, in spatial raster order is: 1 Ptree using 2-Dim Peano order. 1 1 1 1 1 1 1 1 1 1

8 Mining The Yeast Genome
A scientist interested in investigating the effect of one subset of the features over another, such as the effect of phenotype on function would Mine the frequent itemsets from the phenotype and function feature values separately (produce two independent sets of frequent itemsets) Perform a join on the two sets of frequent itemsets and produce a new set containing all frequent itemsets combining the two features We assume the antecedent is to come from one feature set and the consequent from the other, thus, each frequent itemset will produce at most one rule (if the confidence of that rule is high enough). All subsequent queries that include phenotype and/or function would benefit from the frequent itemset mining already done.

9 The Mining Algorithm Input: Rule query minisupp and miniconf
Step 1: Mining of FISs from Individual Features  Relevant feature F, mine all frequent itemsets from F-values separately Using P-trees: Support of an itemset containing items F1 and F2 is just PF1 and PF2 Perfom the ROOTCOUNT operation on the result Because of the independent treatment of the feature, mining them involved is done in parallel Step 2: Joining of Feature FISs After separately mining all the frequent itemsets from the items of all selected features, we perform a join step

10 The Mining Algorithm Exploits down closure property of support with respect to itemset size any itemset must have support greater than or equal to the support of any of its supersets and thus no itemset can be frequent unless all of its subsets are also frequent E.g., phenotypefunction: If the join of two frequent itemsets Iphenotype and Ifunction is a non-frequent itemset then there is no need to join Iphenotype or any of its supersets with Ifunction or any of its supersets

11 The Mining Algorithm Step 3: Producing Strong Rules
No enumeration of different rules that could be derived from a frequent itemset is needed (second step in traditional ARM) Note: computing the confidence of a rule is also efficient using P-trees: confidence of a rule AC is ROOTCOUNT(PAC) /ROOTCOUNT (PA) Step 4: After the user examines the returned rules, s/he often wishes to issue a related but slightly different query. This can be viewed as the start of the interactive mode Such new queries typically involve features that have already been included in previous query. Our approach would incrementally build on the results obtained so far to answer the new query

12 The Mining Algorithm For example, suppose that the user submits: “localizationfunction” after “phenotype function” , all that needs to be done is to mine frequent itemsets from localization and join them with function If a new query, “localization, phenotypefunction”, is submitted, we utilize the all frequent itemsets from the first request and join them those derived from localization.

13 Algorithmic Details Transcription Metabolism Ø Energy b) a) Cell cycle defects Sensitivity to antibiotics Stress response defects For the generation of FISs, we utilize a previous P-tree ARM approach [Rahal, Denton, Perrizo JIKM Journal Dec [13] and store them in a (frequent) Set Enumeration (SE) tree containing all frequent itemsets a) example (frequent) SE for function b) example (frequent) SE for phenotype

14 Experimental Study Implementations coded in C++ and executed on an Intel Pentium-4 2.4GHz processor workstation, 2GB RAM, Redhat Linux 9.0. All implementations use P-tree API For our approach, we computed the total time for executing 5, 10, 15, 20 and 25 consecutive inter-related queries Each query contains up to 3 features and uses at least one feature from a previous query We compare with the standard approach (mine over all attribute values) we only include the time needed to mine the whole dataset without the time needed to scan the resulting set of rules for the subset of interest We set the min. conf. threshold to 90% and varied the min. supp. threshold between 0.05% and 20%

15 The figure clearly shows the gain achieved by using our approach
The post-processing approach needs more than 620 seconds at 5.9% support threshold

16 Biologists could go to very low support thresholds and mine frequent itemsets (and eventually rules) that would go undetected in the post-processing approach

17 The brute-force approach returned slightly less than a million rules at support 5.9% most of which are irrelevant to the queries we’ve selected For our queries, interesting rules started to show up at support ~ 0.5% For high support, mostly uninteresting & evident (trivial) rules appeared Here is where our results associated the yeast eIF2B factor with specific interactions within the cellular complex.

18 A significant portion of the rules were straight forward in the sense of providing only common knowledge, e.g., complex=cytoplasmic ribosomal large subunit  localization=cytoplasm Of significant interest to our biological colaborators was a set of rules pertinent to the yeast eukaryotic initiation factor 2B (eIF2B) “complex = eIF2B (5 ORFs)”“function = ribosome biogenesis” A multi-sub-unit guanine nucleotide exchange factor which catalyzes the exchange of GDP bound to initiation factor eIF2 for GTP, generating active eIF2-GTP. In humans, it is composed of five subunits, alpha, beta, delta, gamma and epsilon In yeast, the eIF2B factor mediates the exchange of a series of proteins bound to translation initiation, the process preceding formation of the peptide bond between the first two amino acids of a protein. In specific, it catalyzes a vital regulatory step in the initiation of the translation of mRNA

19 lossless, compressed, distributed, vertically-structured database
Architecture for the DataMIME™ System (DataMIMEtm data mining, NO NOISE) DII (Data Integration Interface) Data Integration Language DIL YOUR DATA DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL Internet Data Repository lossless, compressed, distributed, vertically-structured database

20 Conclusion In this paper, we proposed a computational approach targeted at the analysis of the yeast genome annotation data It gives biologists the flexibility of incorporating domain knowledge, in the form of queries, thus aiding in focusing their analysis on specific features of interest. It optimizes the rule-discovery process by allowing operation in the interactive and incremental modes and enables parallel processing reuse of mined results Vertical, efficient storage and processing

21 Future Directions Extend the features in our analyzed data such as to include secondary protein structure information We also aim to pursue similar analysis over different genomes such as the human genome A broader goal is to look for “inter-organism” association rules valid across organisms rather than on a single organism Don’t do the join: a rare phenotype may imply a common and important function and that rule maybe the most important of all of them ft.=cell death


Download ppt "Incremental Interactive Mining of Constrained Association Rules from Biological Annotation Data Imad Rahal, Dongmei Ren, Amal Perera, Hassan Najadat and."

Similar presentations


Ads by Google