Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal.

Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal Aparna S. Varde Department of Computer Science Worcester Polytechnic Institute September, 2004 Dissertation Committee Prof. Elke A. Rundensteiner (Advisor) Prof. David C. Brown Prof. Carolina Ruiz Prof. Neil T. Heffernan Prof. Richard D. Sisson Jr., CHTE (External Member) 12/26/2018

Outline Introduction Problem Definition State-of-the-art
Proposed Approach and Tasks 1: Integrating Clustering and Classification 2: Domain-Type-Dependent Clustering 3: Domain-Type-Dependent Classification Summarizing Dissertation Contributions Schedule for Completion 12/26/2018

Introduction Experimental results in science / engineering often plotted as graphs. Graphs represent functional behavior of process parameters. They are used for analysis to assist decision-making. Lab experiment consumes time and resources. Desirable to estimate graphs given input conditions. 12/26/2018

Science and Engineering Experiments
Consider the domain “Heat Treating of Materials”. Controlled heating & cooling of materials to achieve desired mechanical & thermal properties. Equipment cost: $1000s. Recurrent costs: $100s. Results plotted as graphs. CHTE Experimental Setup 12/26/2018

Graphs plotted from Experiments
Heat Transfer Coefficient (hc) Curve: Plot of heat transfer coefficient (hc) v/s part temperature. hc: Heat extraction capacity determined by density, cooling rate etc. Domain semantics in hc curve. Heat Transfer Coefficient Curve with domain semantics LF: Leidenfrost point BP: Boiling point of quenchant MAX: Maximum heat transfer 12/26/2018

Problem Definition To develop an estimation technique for decision support. Goals: Given input conditions in an experiment, estimate resulting graph. 2. Given desired graph in an experiment, estimate conditions to obtain it. 12/26/2018

Problem Definition (Contd.)
Assumptions: Graph is a 2-D plot of 1 dependent v/s 1 independent variable. There exists a correlation between input conditions & results. Details of performed experiments are stored in a database. Evaluation Criteria: Domain expert intervention should not be needed in estimation. Estimation accuracy should be satisfactory for decision support. Estimation should be much faster than real lab experiment. 12/26/2018

State-of-the-art Naïve Similarity Searching: [HK-01]
Non-matching conditions could be significant in the domain. Weighted Similarity Searching: [WF-00] [M-97] Precise weights denoting relative importance not known apriori. Case-based Reasoning: [AP-03], [K-88], [AV-01], [S-88]. Regular CBR: Involves human intervention. Exemplar Reasoning: Problem similar to naïve search. Instance-based Reasoning: Problem similar to weighted search. Mathematical Modeling: [M-95], [S-60], [PG-60] Variables not known in existing models in some cases. Precise equations not known in some cases. 12/26/2018

Proposed Approach: AutoDomainMine
Knowledge Discovery: Cluster existing experiments based on graphs, incorporating domain semantics. Using classification to learn clustering criteria, build domain-specific representative cases. Estimation: Use learnt criteria and representative cases to estimate new cases. 12/26/2018

Knowledge Discovery in AutoDomainMine
12/26/2018

Estimation in AutoDomainMine
12/26/2018

Proposed Tasks Integrating Clustering and Classification.
Domain-Type-Dependent Clustering. Domain-Type-Dependent Classification. 12/26/2018

Integrating Clustering and Classification
Task 1 Integrating Clustering and Classification 12/26/2018

Learning Analogous to Scientists
Grouping Reasoning Cause of Similarity Input Conditions 12/26/2018 Source: CHTE

Motivating Example The following facts were learnt by Materials Scientists from results of experiments. Thin oxide layer around a part causes vapor blanket around part to break, resulting in faster cooling. Thick oxide layer around a part acts as insulator, resulting in slower cooling. This learning was done by: Performing lab experiments with the given input conditions, Grouping based on resulting graphs, Reasoning based on input conditions. Reference: CHTE, WPI. 12/26/2018

Clustering Definition: Unsupervised Learning: Analogous to:
Process of placing a set of physical or abstract objects into groups of similar objects [HK-01]. Unsupervised Learning: Learning by observation. No pre-defined labels. Analogous to: Scientist grouping similar graphs by observation. 12/26/2018

Classification Definition: Supervised Learning: Analogous to:
Form of data analysis that can be used to extract models to predict categorical labels [HK-01]. Supervised Learning: Class labels of training samples provided. Classification target is pre-defined. Analogous to: Scientist reasoning what combination of input conditions characterize a group, after the groups are formed. 12/26/2018

Clustering followed by Classification
Clustering useful to group graphs. Before clustering, no labels. After clustering, labels available as Cluster IDs. Cluster IDs form classification target. …. Clustering 12/26/2018

Clustering followed by Classification (Contd.)
Input Conditions Classification Cluster ID Output of Clustering Classification is used to identify the combinations of input conditions that characterize each cluster. 12/26/2018

Reason for Clustering Graphs
Possible Alternative: Cluster input conditions, and learn clustering criteria. Problem with above Alternative: Clustering technique attaches same weight to all conditions. This adversely affects accuracy. Cannot be corrected by introducing relative weights. Since weights are not known in advance. They need to be learnt. We propose to learn from results (graphs) of experiments. 12/26/2018

State-of-the-art Integrating Rule-based and Case-based Reasoning [PC-97]. Combined approach better than either one individually in some domains. Example: Medicine, Law. Integrating Classification and Association Rule Mining [LHM-98]. Discover associations between items and use them to mine classification rules. Derive rules with confidence and support above threshold such that they classify a target. Neural Networks can be used for clustering and also for classification [HK-01], [NIPS-98]. 12/26/2018

Domain-Type-Dependent Clustering
Task 2 Domain-Type-Dependent Clustering 12/26/2018

Clustering Graphs Clustering algorithms: kmeans, EM etc. [HK-01].
Default notion of similarity: Euclidean Distance. Problem: Graphs below placed in same cluster. Should be in different clusters as per domain. Other distance metrics in the literature. Not known which one(s) works best in given domain. Domain semantics may be available in subjective form. Issue: Learning domain-specific distance metric. 12/26/2018

Categories of Distance Metrics
In the literature Position-based: e.g., Euclidean distance, Manhattan distance [HK-01]. Statistical: e.g., Mean distance, Max distance [PNC-99]. Shape-based: e.g., Tri-plots [TTPF-01]. Others: e.g., Order-based in DNA data [KB-04]. Statistical: Example DMax(A,B) Graph A Graph B 12/26/2018

Categories of Distance Metrics (Contd.)
Defined by us Critical Distance: Distance between domain-specific critical regions in two objects in a given domain. Critical Distance Example Dcritical(A,B) Graph B Graph A 12/26/2018

LearnMet: Learning Domain-Specific Distance Metric for Graphs
Goal: To learn a domain-specific distance metric for accurately clustering graphs. Assumptions: Training Set: Correct clusters of graphs in the domain. Possible Additional Input from Domain Experts: Example: Relative importance of distance types in subjective manner. 12/26/2018

General Definition of Distance Metric
Distance metric defined in terms of Weights*Components Components: Position-based, Statistical, etc. Weights: Values denoting relative importance. Formula: Distance “D” defined as, D = w1*c1 + w2*c2 + …….. wn*cm D = Σ{i=1 to m} ws*cs Example D = 4*Euclidean + 3*Mean + 5*Critical 12/26/2018

Steps of LearnMet Guess Initial Metric. Do Clustering.
Compare With Correct Clusters. Adjust And Re-Execute. Output Final Metric. Alternative A: Additional Domain Expert Input. Alternative B: No Additional Input. 12/26/2018

Alt A, Step 1: Guess Initial Metric
Domain Expert Input: Significant Distance Types. E.g.: Euclidean, Mean, Critical. Consider this as guess of components. If input about relative importance available, Then use this input to guess weights. Else randomly guess weights for each component. Thus guess initial metric, e.g., D = 4*Euclidean + 3*Mean + 5*Critical Proposed Task: Defining a heuristic to guess weights. Considering relative importance of components. Using fundamental knowledge of distance types in the literature. 12/26/2018

Alt A, Step 2: Do Clustering
Use guessed metric as “distance” in clustering. Example: If clustering using k-means. k points chosen as random cluster centers. Repeat Instances assigned to closest cluster center by “D = Σ{i=1 to m} w*c”. Mean of each cluster calculated. Means form new cluster centers. - Until same points assigned to each cluster in consecutive iterations. 12/26/2018

Alt A,Step 3: Compare With Correct Clusters
Measure Error(E) between predicted & actual clusters. Predicted: Clusters with LearnMet. Actual: Correct clusters in training set. E α D(p,a) with this metric p: predicted and a: actual clusters. “α” denotes “proportional to”. In the literature: Error Functions [WF-00] : for “n” values, - Mean squared error E = [ (p1-a1)^2 + …. + (pn-an)^2 ] / n - Root mean squared error E = √ { [ (p1-a1)^2 + …. + (pn-an)^2 ] / n } - Mean absolute error E = [ |p1-a1| + …. + |pn-an| ] / n Proposed Task: Defining suitable error function. Considering error functions in the literature. Using domain knowledge and notion of distance types. Taking into account efficiency and accuracy in learning. 12/26/2018

Alt A, Step 4: Adjust And Re-Execute
Use error to adjust weights of components for next iteration. Apply general principle of error back-propagation [RN-95]. Thus make next guess for metric, e.g., Old D = 4*Euclidean + 3*Mean + 5*Critical New D = 5*Euclidean + 1*Mean + 6*Critical Use this guessed metric to re-do clustering. Repeat Until error is minimum OR max # of epochs reached. Proposed Task: Defining heuristic to distribute error. Considering impact of components on error. Using some domain knowledge if applicable. Taking into account efficiency and accuracy in learning. 12/26/2018

Alt A, Step 5: Output Final Metric
If error minimum, then distance D likely to give high accuracy in clustering. Hence output this D as learnt distance metric. Example: D = 3*Euclidean + 2*Mean + 6*Critical If maximum number of training epochs reached, go to Alternative B. 12/26/2018

Alternative B No input about significant aspects.
Use Occam’s Razor [M-97] to guess metric. Select simplest hypothesis that fits the data. Example: Initially guess only Euclidean distance. Do clustering & compare with correct clusters as in Alt A. To adjust and re-execute Pass 1: Alter weights. Repeat as in alternative A until error min. OR max. # of epochs. Pass 2: Alter one component at a time. Repeat whole process until error min. OR max. # of epochs. Output corresponding metric D as learnt distance metric. Proposed Tasks: Determining the simplest hypothesis. Outlining a strategy for altering components. 12/26/2018

State-of-the-art Learning nearest neighbor in high-dimensional spaces: [HAK-00]. Given metric, learn relative importance of dimensions. Distance metric learning given similar pairs of points: [XNJR-03]. Position-based distances only using general formula of Mahalanobis distance. Similarity search in multimedia databases: [KB-04]. Use various metrics in different applications, do not learn a single distance metric. 12/26/2018

Dimensionality Reduction
Each graph has thousands of points. Dimensionality reduction needed. Selective Sampling [HK-01], [WF-00]. Consider points at regular intervals, e.g. every 10th point. Include all significant aspects. Fourier Transforms. [B-68], [H-03], [AFS-93] Map data to frequency sinusoids. Xf = (1/√n)Σ{t = 0 to n-1} exp(-j2πft/n) where f = 0,1… (n-1) and j = √ -1 Retain most significant Fourier Coefficients. Selective Sampling Fourier Transforms 12/26/2018

Effect of Dimensionality Reduction in Learning Metric
Learnt distance metric involves components. Issue: How do components correspond to reduced space. Possible Solutions: Option 1: Learn metric in original space, then map to reduced space. Option 2: Reduce dimensionality, then learn metric in reduced space. Proposed Tasks: Exploring options 1 and 2 with domain semantics and properties of reduction techniques. Comparing Selective Sampling with Fourier Transforms given the learnt distance metric. 12/26/2018

Domain-Type-Dependent Classification
Task 3 Domain-Type-Dependent Classification 12/26/2018

Classification Techniques
Classification Techniques [HK-01]. Neural Networks. Decision Trees. Case Based Reasoning. We use Decision Tree Classification because: Eager learning, so pre-classifies experiments. Provides reasons for decisions, as basis for representative cases. Partial Decision Tree 12/26/2018

Decision Tree Classification
Sample Partial Output of Clustering Decision Tree Classifier Snapshot of Partial Decision Tree created 12/26/2018

Selecting Representative Cases
Decision Trees: Basis for selecting representative cases. Process: From all paths to a cluster, select one as representative conditions. From all graphs in the cluster, select any one as a representative graph. Sample Partial Decision Tree Sample Representative Case 12/26/2018

Need for Designing Representative Cases
Selecting any one representative not good. May not incorporate significant aspects of cluster. E.g, several combinations of input conditions may lead to one graph. Simple average of conditions not good. E.g. agitation1 = “high” and agitation2 = “low”. Common condition agitation = “medium” is not a good representation. Simple average of graphs not good. Some features on the graph may be more significant than others. Issue: Designing a good representative case of input conditions and graph to serve as a better classifier. 12/26/2018

DesCase: Designing Domain-Specific Representative Cases as Classifiers
Goal: To design a representative case of input conditions and graph per cluster serving as a classifier for new cases. Assumptions: Given correct clusters in the domain with clustering criteria learnt. 12/26/2018

Designing a Representative of Conditions
Following alternatives being explored. Alternative 1: Storing the most frequent combination. Alternative 2: Re-Clustering within each cluster. Alternative 3: Retaining all possible combinations with abstraction. One of more of them likely to be used. Domain constraints should be taken into account in each alternative. 12/26/2018

Alternative 1: Most Frequent Combination
Frequency denotes likelihood of occurrence. Thus if two or more paths lead to the same cluster, store the one that resulted from a greater number of experiments. E.g., Consider two combinations that lead to cluster E. E1: Quenchant Name = “Argon”, Part Material = “ST4140”, Agitation = “Low”, Quenchant Temp = “(20 – 30]”. E2: Quenchant Name = “Air”, Part Material = “SS304”, Agitation = “Absent”, Quenchant Temperature = “(20-30]”. If E1 resulted from 5 experiments and E2 from 7 experiments, Then store combination E2. 12/26/2018

Alternative 2: Re-Clustering
Each cluster has many combinations. Re-cluster based on combinations to form sub-clusters. For each sub-cluster select a representative. Each original cluster now has fewer and more meaningful combinations. Repeat this until a stopping criterion is met. Proposed Task: Defining a suitable stopping criterion. 12/26/2018

Alternative 3: All with Abstraction
Store maximum information at minimum cost. Use domain knowledge and statistical aspects to abstract the combinations efficiently. E.g. Consider 2 combinations leading to Cluster M. M1: Quenchant Name = “Synabol-2000”, Part Material = “ST4140”, Quenchant Temperature = “(40 – 90]”. M2: Quenchant Name = “Bio-Quench”, Part Material = “ST4140”, Quenchant Temperature = “(90-140]”. Domain Knowledge: Synabol-2000 & Bio-Quench are both “Bio-Oils”. Thus abstract these into one combination: Quenchant Category = “Bio-Oil”, Part Material = “ST4140”, Quenchant Temperature = “( ]”. 12/26/2018

Designing Representative Graph
Graphs in the Same Cluster Significant Aspects Learnt Distance Metric. Other visual features, e.g., 2 graphs in same cluster, left graph has shorter region at “T= ” and more wavy appearance. Dimensionality Reduction How do significant aspects get affected. Redrawing a Graph Types of Averages, Pixel to point conversion. Proposed Task: Outlining a strategy for designing representative graph. 12/26/2018

Dissertation Contributions
AutoDomainMine basic approach: Proposing clustering followed classification as a learning strategy for estimation. This automates one typical learning method of scientists. LearnMet technique: Learning a domain-specific distance metric for graphs under the effect of dimensionality reduction. This improves clustering accuracy. DesCase strategy: Designing representative cases of input conditions and graph, incorporating domain semantics. This provides better classifiers. 12/26/2018

System Development Development of AutoDomainMine in three stages.
Stage 1: Integrating Clustering and Classification. Stage 2: Clustering with LearnMet. Stage 3: Classification with DesCase. Implementation details. Java for majority of the coding with JSP for user interfaces. WEKA for clustering (k-means) and decision tree classification (J4.8) techniques. Maple, Matlab for dimensionality reduction etc. Pixel to point conversion tools for graphs stored as bitmaps. MySQL for Database Development. 12/26/2018

Evaluation Plan AutoDomainMine Experiments Criteria for Comparison
Stage 1: To judge effectiveness of learning strategy. Stage 2: To evaluate accuracy of clustering with domain-specific distance metric. Stage 3: To assess the whole AutoDomainMine approach with LearnMet and DesCase. Criteria for Comparison Real Laboratory Experiments in Heat Treating. Additional Experiments with data from other domains in online Machine Learning Repositories. 12/26/2018

Schedule for Completion
AutoDomainMine Learning Strategy: August 2003 to February 2004. Clustering with LearnMet + Proposal + Comprehensive Exam: March 2004 to December 2004. Classification with DesCase: December 2004 to May 2005. Dissertation Writing + Additional Experiments: May 2005 onwards. Expected Date of Graduation: August 2005. 12/26/2018

Demo of Pilot Tool This incorporates basic learning strategy in AutoDomainMine. It does not include LearnMet and DesCase. Clustering is done with Euclidean distance. Any one representative case is selected as a classifier. 12/26/2018

Example 1: Satisfactory
DEMO OF PILOT TOOL Example 1: Satisfactory 12/26/2018

Example 2: Incorrect Estimation
DEMO OF PILOT TOOL Example 2: Incorrect Estimation 12/26/2018

Example 3: Can be Improved
DEMO OF PILOT TOOL Example 3: Can be Improved 12/26/2018

Thank You 12/26/2018

Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal.

Similar presentations

Presentation on theme: "Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal.

Similar presentations

Presentation on theme: "Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal."— Presentation transcript:

Similar presentations

About project

Feedback