Presentation is loading. Please wait.

Presentation is loading. Please wait.

Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Relational.

Similar presentations


Presentation on theme: "Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Relational."— Presentation transcript:

1 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Relational Graphical Models for Link Analysis: Learning from Processes William H. Hsu Department of Computing and Information Sciences Kansas State University http://www.kddresearch.org Thursday, 04 December 2003 This presentation is http://www.kddresearch.org/KSU/CIS/IA-State-20031204.ppt Joint work with: S. Harmon, R. Joehanes, J. A. Thornton

2 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Application: Workflow Modeling in Bioinformatics –Collaborative recommendation –Example: gene expression modeling Methodology: Relational Graphical Models (RGMs) –DESCRIBER project: using RGMs –Other workflows: proteomics, metabolomics Link Analysis Problems –Finding dynamic relational attributes –Identity uncertainty Other Applications –Market basket analysis: cross-selling –Spatial data cleaning Software for Building Graphical Models: BNJ, etc. Outline

3 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Collaborative Recommendation: Data Mining from Commercial Clickstreams © 2003 Amazon.com, Inc. Explanation from Recommender (Decision Support) System Classification and Regression based upon Historical Customer Data (Market Basket Analysis)

4 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Computational Genomics and Microarray Gene Expression Modeling Treatment 1 (Control) Treatment 2 (Pathogen) Messenger RNA (mRNA) Extract 1 Messenger RNA (mRNA) Extract 2 cDNA DNA Hybridization Microarray (under LASER) Adapted from Friedman et al. (2000) http://www.cs.huji.ac.il/labs/compbio/http://www.cs.huji.ac.il/labs/compbio/ Learning Environment G = (V, E) Specification Fitness (Inferential Loss) B = (V, E,  ) [B] Parameter Estimation G1G1 G2G2 G3G3 G4G4 G5G5 [A] Structure Learning G1G1 G2G2 G3G3 G4G4 G5G5 D val (Model Validation by Inference) D: Data (User, Microarray)

5 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org How do we get from microarray data (and other expression data) to a linked network? © G. Simpson (1999) Used with permission Bioinformatics: Data Mining from DNA Hybridization Microarrays

6 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org n A. thaliana has 25,500 estimated genes; n (25,500) 2 = 650 million! n Affy 2-chip complete Arabidopsis genome set costs $1,000 which is 4 cents per data point; n Total cost: $26 million; Is there a problem here? n A. thaliana has 25,500 estimated genes; n (25,500) 2 = 650 million! n Affy 2-chip complete Arabidopsis genome set costs $1,000 which is 4 cents per data point; n Total cost: $26 million; Is there a problem here? Bioinformatics: Managing Complexity! © 2002 S. M. Welch, KSU Used with permission

7 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org 0.70.3 0.20.8 0.8x 0.16x 0.112x 0.0784x =0.06272 0.2x0.04x 0.012x 0.0036x =0.00072 Low High A B =0.00072 This profile is 87.1 times as probable © 2002 S. M. Welch, KSU Used with permission Bioinformatics: Profiling of Gene Expression (and Regulatory Dynamics)

8 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Domain-Specific Workflow Repositories Workflows Transactional, Objective Views Workflow Components Data Sources, Transformations; Other Services Data Entity, Service, and Component Repository Index for Bioinformatics Experimental Research Learning over Workflow Instances and Use Cases (Historical User Requirements) Use Case & Query/Evaluation Data Personalized Interface Domain-Specific Collaborative Recommendation User Queries & Evaluations Decision Support Models Users of Scientific Workflow Repository Interface(s) to Distributed Repository Example Queries: What experiments have found cell cycle-regulated metabolic pathways in Saccharomyces? What codes and microarray data were used? How and why? DESCRIBER Project

9 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Application: Workflow Modeling in Bioinformatics –Collaborative recommendation –Example: gene expression modeling Methodology: Relational Graphical Models (RGMs) –DESCRIBER project: using RGMs –Other workflows: proteomics, metabolomics Link Analysis Problems –Finding dynamic relational attributes –Identity uncertainty Other Applications –Market basket analysis: cross-selling –Spatial data cleaning Software for Building Graphical Models: BNJ, etc. Outline

10 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org cDNA Microarray- Experiment Gene Protein protein-product role pathway functional- description canonical- name accession-number protein-ID Relational Link (Reference Key) Probabilistic Dependency cDNA-sequence treatment hybridization normalization data regulation DNA-sequence Pathway pathway- descriptor pathway-name pathway-ID pathway Relational Graphical Models (RGMs): Computational Genomics Domain

11 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org RGMs of Queries Module 4 Learning & Validation of RGMs for User Requirements Complete RGMs of User Queries Module 1 Collaborative Recommendation Front-End Personalized Interface Module 5 RGM Parameters from User Query Data Module 3 Estimation of RGM Parameters from Workflow and Component Database RGMs of Workflows Complete RGMs of Workflows (Data-Oriented) Recommendations/Evaluations (Before and After Use) User Queries Module 2 Learning & Validation of Relational Graphical Models (RGMs) for Experimental Workflows and Components Workflow Logs, Instances, Templates, Components (Services, Data Sources) Training Data Structure & Data Training Data Structure & Data DESCRIBER: System Overview

12 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Module 1 Collaborative Recommendation Front-End Personalized User Interface: Workflow Development and Repurposing Graphical Relational Models of Past User Queries Learning to Predict User Preferences Decision Support: Selection of Workflows, Components, and Services to Reuse/Adapt Models for CR Recommendations to User New Queries, Evaluations from User DESCRIBER Graphical Relational Models of Workflows, Components User Queries (Use Case Data) to Module 4 Workflow Instances, Logs, Templates, Components to Module 2 from Module 3 from Module 5 DESCRIBER: Using Relational Graphical Models

13 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Graphical Models [1]: Temporal Probabilistic Reasoning Goal: Estimate Filtering: r = t –Intuition: infer current state from observations –Applications: signal identification –Variation: Viterbi algorithm Prediction: r < t –Intuition: infer future state –Applications: prognostics Smoothing: r > t –Intuition: infer past hidden state –Applications: signal enhancement CF Tasks –Plan recognition by smoothing –Prediction cf. WebCANVAS – Cadez et al. (2000) Adapted from Murphy (2001), Guo (2002)

14 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org General-Case BBN Structure Learning: Use Inference to Compute Scores Optimal Strategy: Bayesian Model Averaging –Assumption: models h  H are mutually exclusive and exhaustive –Combine predictions of models in proportion to marginal likelihood Compute conditional probability of hypothesis h given observed data D i.e., compute expectation over unknown h for unseen cases Let h  structure, parameters   CPTs Posterior ScoreMarginal Likelihood Prior over StructuresLikelihood Prior over Parameters Graphical Models [2]: Learning Structure from Data

15 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Application: Workflow Modeling in Bioinformatics –Collaborative recommendation –Example: gene expression modeling Methodology: Relational Graphical Models (RGMs) –DESCRIBER project: using RGMs –Other workflows: proteomics, metabolomics Link Analysis Problems –Finding dynamic relational attributes –Identity uncertainty Other Applications –Market basket analysis: cross-selling –Spatial data cleaning Software for Building Graphical Models: BNJ, etc. Outline

16 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Learning and Inference in Graphical Models of Probability (2000 – present) Learning Environment G = (V, E) Specification Fitness (Inferential Loss) B = (V, E,  ) [B] Parameter Estimation G1G1 G2G2 G3G3 G4G4 G5G5 [A] Structure Learning G1G1 G2G2 G3G3 G4G4 G5G5 D val (Model Validation by Inference) D: Data (User, Microarray)

17 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Finding Dynamic Relational Attributes: From Workflows to Class Diagrams cDNA Microarray- Experiment Gene Protein protein-product role pathway functional- description canonical- name accession-number protein-ID Relational Link (Reference Key) Probabilistic Dependency cDNA-sequence treatment hybridization normalization data regulation DNA-sequence Pathway pathway- descriptor pathway-name pathway-ID pathway TAVERNA Workbench my Grid Project © 2003 Oinn et al. DESCRIBER Schema © 2003 Hsu Transactional View (cf. UML Sequence Diagram)Objective View (cf. UML Class Diagram)

18 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Identity Uncertainty How to Tell When Two Descriptors Refer to Same Entity? Problem –Coalesced databases –Multiple sources Errors and Inconsistencies –Spatial, temporal error –Inconsistent descriptors Clues –Proximity in space, time –Similarities in values of key variables (attributes, features) Applications –Data cleaning –Fraud detection

19 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Application: Workflow Modeling in Bioinformatics –Collaborative recommendation –Example: gene expression modeling Methodology: Relational Graphical Models (RGMs) –DESCRIBER project: using RGMs –Other workflows: proteomics, metabolomics Link Analysis Problems –Finding dynamic relational attributes –Identity uncertainty Other Applications –Market basket analysis: cross-selling –Spatial data cleaning Software for Building Graphical Models: BNJ, etc. Outline

20 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Cross-Selling (based upon Market Basket Analysis) Collaborative Recommendation © 2002 Amazon.com, Inc. Cross-Selling: Market Basket Analysis and Collaborative Recommendation

21 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Spatial Data Cleaning: STARWARD Groundwater irrigation lifetime estimates in the Ogallala region of the Kansas High Plains aquifer. [Wilson et al. 2002] http://snurl.com/39kz http://snurl.com/39kz Darkest: already depleted Next darkest: 25-50 years Problems Water well location (identity uncertainty in coalesced spatial databases), descriptive statistics (paraconsistency), spatial outlier detection

22 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Commercial Tools: Ergo, Netica, TETRAD, Hugin Bayesian Network tools in Java (BNJ) – Hsu et al. (1999-present) –Distribution page http://bndev.sourceforge.net http://bndev.sourceforge.net –Development group http://groups.yahoo.com/group/bndev http://groups.yahoo.com/group/bndev –Current (re)implementation projects for KSU KDD Lab Continuous state: Minka (2002) – Harmon, Hsu, Joehanes Formats: XML BNIF (MSBN), Netica – Guo, Joehanes, Hsu Space-efficient DBN inference – Hsu, Patel, Plummer Bounded cutset conditioning – Chandak Bayes Net Toolbox (BNT) – Murphy (1997-present) –http://groups.yahoo.com/group/BayesNetToolboxhttp://groups.yahoo.com/group/BayesNetToolbox Graphical Models in R (gR) – Lauritzen et al. (2002-present) Software Packages for Building Graphical Models: BNJ, etc.

23 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Laboratory for Knowledge Discovery in Databases (KDD) –Computational science and engineering (CSE): decision support –Bioinformatics and Medical Informatics (BMI) working group www.kddresearch.org/Groups/Bioinformatics www.kddresearch.org/Groups/Bioinformatics –Human-Computer Interaction (HCI) - e.g., simulation-based training www.kddresearch.org/Groups/{Intelligent-HCI | Vislab} www.kddresearch.org/Groups/{Intelligent-HCI | Vislab} Programs and Workshops –PRISM: CIS (Hsu), EECE (Das), IMSE (Chang) www.kddresearch.org/Groups/Soft-Computing –Machine Learning and KDD (IJCAI-2001) www.kddresearch.org/Workshops/IJCAI-2001 –Real-Time Decision Support and Diagnosis (AAAI/UAI/KDD-2002) www.kddresearch.org/Workshops/RTDSDS-2002 –Learning Graphical Models for Computational Genomics (IJCAI-2003) www.kddresearch.org/Workshops/IJCAI-2003-Bioinformatics Research Partnerships –Defense (ONR), NSF (EPSCoR), industry (Syngenta), academia –Illinois (NCSA), UC Davis, TIGR, U. of Manchester, U. of Southampton Graphical Models for KDD: Research Projects at K-State

24 Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Acknowledgements Kansas State University Lab for Knowledge Discovery in Databases –Alumni: Haipeng Guo (hpguo@cis.ksu.edu), Benjamin Perryhpguo@cis.ksu.edu –Graduate research assistants: Scott Harmon (sjh4069@cis.ksu.edu), Roby Joehanes (hpguo@cis.ksu.edu)sjh4069@cis.ksu.eduhpguo@cis.ksu.edu –Other grad students: Siddharth Chandak, Vinod Chandana, Edwin Rodriguez (affiliate), Julie A. Thornton (affiliate) –Undergraduate researchers: Chris Meyer, James Plummer, Silpan Patel Joint Work with –KSU Bioinformatics and Medical Informatics (BMI) group: Sanjoy Das (EECE), Judith L. Roe (Biology), Stephen M. Welch (Agronomy) –KSU Microarray group: Scot Hulbert (Plant Pathology), J. Clare Nelson (Plant Pathology), Jan Leach (Plant Pathology) –Kansas Geological Survey (Geoff Bohling, Bob Buddemeier), Kansas Biological Survey, KU EECS Other Research Partners –The Institute for Genomic Research (J. Quackenbush, A. Saeed) –Manchester (C. Goble, R. Stevens), Southampton (M. Addis) –International Rice Research Institute (R. M. Bruskiewich)


Download ppt "Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab (www.kddresearch.org)www.kddresearch.org Relational."

Similar presentations


Ads by Google