Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal.

Similar presentations


Presentation on theme: "Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal."— Presentation transcript:

1 Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal Aparna S. Varde Department of Computer Science Worcester Polytechnic Institute September, 2004 Dissertation Committee Prof. Elke A. Rundensteiner (Advisor) Prof. David C. Brown Prof. Carolina Ruiz Prof. Neil T. Heffernan Prof. Richard D. Sisson Jr., CHTE (External Member) 12/26/2018

2 Outline Introduction Problem Definition State-of-the-art
Proposed Approach and Tasks 1: Integrating Clustering and Classification 2: Domain-Type-Dependent Clustering 3: Domain-Type-Dependent Classification Summarizing Dissertation Contributions Schedule for Completion 12/26/2018

3 Introduction Experimental results in science / engineering often plotted as graphs. Graphs represent functional behavior of process parameters. They are used for analysis to assist decision-making. Lab experiment consumes time and resources. Desirable to estimate graphs given input conditions. 12/26/2018

4 Science and Engineering Experiments
Consider the domain “Heat Treating of Materials”. Controlled heating & cooling of materials to achieve desired mechanical & thermal properties. Equipment cost: $1000s. Recurrent costs: $100s. Results plotted as graphs. CHTE Experimental Setup 12/26/2018

5 Graphs plotted from Experiments
Heat Transfer Coefficient (hc) Curve: Plot of heat transfer coefficient (hc) v/s part temperature. hc: Heat extraction capacity determined by density, cooling rate etc. Domain semantics in hc curve. Heat Transfer Coefficient Curve with domain semantics LF: Leidenfrost point BP: Boiling point of quenchant MAX: Maximum heat transfer 12/26/2018

6 Problem Definition To develop an estimation technique for decision support. Goals: Given input conditions in an experiment, estimate resulting graph. 2. Given desired graph in an experiment, estimate conditions to obtain it. 12/26/2018

7 Problem Definition (Contd.)
Assumptions: Graph is a 2-D plot of 1 dependent v/s 1 independent variable. There exists a correlation between input conditions & results. Details of performed experiments are stored in a database. Evaluation Criteria: Domain expert intervention should not be needed in estimation. Estimation accuracy should be satisfactory for decision support. Estimation should be much faster than real lab experiment. 12/26/2018

8 State-of-the-art Naïve Similarity Searching: [HK-01]
Non-matching conditions could be significant in the domain. Weighted Similarity Searching: [WF-00] [M-97] Precise weights denoting relative importance not known apriori. Case-based Reasoning: [AP-03], [K-88], [AV-01], [S-88]. Regular CBR: Involves human intervention. Exemplar Reasoning: Problem similar to naïve search. Instance-based Reasoning: Problem similar to weighted search. Mathematical Modeling: [M-95], [S-60], [PG-60] Variables not known in existing models in some cases. Precise equations not known in some cases. 12/26/2018

9 Proposed Approach: AutoDomainMine
Knowledge Discovery: Cluster existing experiments based on graphs, incorporating domain semantics. Using classification to learn clustering criteria, build domain-specific representative cases. Estimation: Use learnt criteria and representative cases to estimate new cases. 12/26/2018

10 Knowledge Discovery in AutoDomainMine
12/26/2018

11 Estimation in AutoDomainMine
12/26/2018

12 Proposed Tasks Integrating Clustering and Classification.
Domain-Type-Dependent Clustering. Domain-Type-Dependent Classification. 12/26/2018

13 Integrating Clustering and Classification
Task 1 Integrating Clustering and Classification 12/26/2018

14 Learning Analogous to Scientists
Grouping Reasoning Cause of Similarity Input Conditions 12/26/2018 Source: CHTE

15 Motivating Example The following facts were learnt by Materials Scientists from results of experiments. Thin oxide layer around a part causes vapor blanket around part to break, resulting in faster cooling. Thick oxide layer around a part acts as insulator, resulting in slower cooling. This learning was done by: Performing lab experiments with the given input conditions, Grouping based on resulting graphs, Reasoning based on input conditions. Reference: CHTE, WPI. 12/26/2018

16 Clustering Definition: Unsupervised Learning: Analogous to:
Process of placing a set of physical or abstract objects into groups of similar objects [HK-01]. Unsupervised Learning: Learning by observation. No pre-defined labels. Analogous to: Scientist grouping similar graphs by observation. 12/26/2018

17 Classification Definition: Supervised Learning: Analogous to:
Form of data analysis that can be used to extract models to predict categorical labels [HK-01]. Supervised Learning: Class labels of training samples provided. Classification target is pre-defined. Analogous to: Scientist reasoning what combination of input conditions characterize a group, after the groups are formed. 12/26/2018

18 Clustering followed by Classification
Clustering useful to group graphs. Before clustering, no labels. After clustering, labels available as Cluster IDs. Cluster IDs form classification target. …. Clustering 12/26/2018

19 Clustering followed by Classification (Contd.)
Input Conditions Classification Cluster ID Output of Clustering Classification is used to identify the combinations of input conditions that characterize each cluster. 12/26/2018

20 Reason for Clustering Graphs
Possible Alternative: Cluster input conditions, and learn clustering criteria. Problem with above Alternative: Clustering technique attaches same weight to all conditions. This adversely affects accuracy. Cannot be corrected by introducing relative weights. Since weights are not known in advance. They need to be learnt. We propose to learn from results (graphs) of experiments. 12/26/2018

21 State-of-the-art Integrating Rule-based and Case-based Reasoning [PC-97]. Combined approach better than either one individually in some domains. Example: Medicine, Law. Integrating Classification and Association Rule Mining [LHM-98]. Discover associations between items and use them to mine classification rules. Derive rules with confidence and support above threshold such that they classify a target. Neural Networks can be used for clustering and also for classification [HK-01], [NIPS-98]. 12/26/2018

22 Domain-Type-Dependent Clustering
Task 2 Domain-Type-Dependent Clustering 12/26/2018

23 Clustering Graphs Clustering algorithms: kmeans, EM etc. [HK-01].
Default notion of similarity: Euclidean Distance. Problem: Graphs below placed in same cluster. Should be in different clusters as per domain. Other distance metrics in the literature. Not known which one(s) works best in given domain. Domain semantics may be available in subjective form. Issue: Learning domain-specific distance metric. 12/26/2018

24 Categories of Distance Metrics
In the literature Position-based: e.g., Euclidean distance, Manhattan distance [HK-01]. Statistical: e.g., Mean distance, Max distance [PNC-99]. Shape-based: e.g., Tri-plots [TTPF-01]. Others: e.g., Order-based in DNA data [KB-04]. Statistical: Example DMax(A,B) Graph A Graph B 12/26/2018

25 Categories of Distance Metrics (Contd.)
Defined by us Critical Distance: Distance between domain-specific critical regions in two objects in a given domain. Critical Distance Example Dcritical(A,B) Graph B Graph A 12/26/2018

26 LearnMet: Learning Domain-Specific Distance Metric for Graphs
Goal: To learn a domain-specific distance metric for accurately clustering graphs. Assumptions: Training Set: Correct clusters of graphs in the domain. Possible Additional Input from Domain Experts: Example: Relative importance of distance types in subjective manner. 12/26/2018

27 General Definition of Distance Metric
Distance metric defined in terms of Weights*Components Components: Position-based, Statistical, etc. Weights: Values denoting relative importance. Formula: Distance “D” defined as, D = w1*c1 + w2*c2 + …….. wn*cm D = Σ{i=1 to m} ws*cs Example D = 4*Euclidean + 3*Mean + 5*Critical 12/26/2018

28 Steps of LearnMet Guess Initial Metric. Do Clustering.
Compare With Correct Clusters. Adjust And Re-Execute. Output Final Metric. Alternative A: Additional Domain Expert Input. Alternative B: No Additional Input. 12/26/2018

29 Alt A, Step 1: Guess Initial Metric
Domain Expert Input: Significant Distance Types. E.g.: Euclidean, Mean, Critical. Consider this as guess of components. If input about relative importance available, Then use this input to guess weights. Else randomly guess weights for each component. Thus guess initial metric, e.g., D = 4*Euclidean + 3*Mean + 5*Critical Proposed Task: Defining a heuristic to guess weights. Considering relative importance of components. Using fundamental knowledge of distance types in the literature. 12/26/2018

30 Alt A, Step 2: Do Clustering
Use guessed metric as “distance” in clustering. Example: If clustering using k-means. k points chosen as random cluster centers. Repeat Instances assigned to closest cluster center by “D = Σ{i=1 to m} w*c”. Mean of each cluster calculated. Means form new cluster centers. - Until same points assigned to each cluster in consecutive iterations. 12/26/2018

31 Alt A,Step 3: Compare With Correct Clusters
Measure Error(E) between predicted & actual clusters. Predicted: Clusters with LearnMet. Actual: Correct clusters in training set. E α D(p,a) with this metric p: predicted and a: actual clusters. “α” denotes “proportional to”. In the literature: Error Functions [WF-00] : for “n” values, - Mean squared error E = [ (p1-a1)^2 + …. + (pn-an)^2 ] / n - Root mean squared error E = √ { [ (p1-a1)^2 + …. + (pn-an)^2 ] / n } - Mean absolute error E = [ |p1-a1| + …. + |pn-an| ] / n Proposed Task: Defining suitable error function. Considering error functions in the literature. Using domain knowledge and notion of distance types. Taking into account efficiency and accuracy in learning. 12/26/2018

32 Alt A, Step 4: Adjust And Re-Execute
Use error to adjust weights of components for next iteration. Apply general principle of error back-propagation [RN-95]. Thus make next guess for metric, e.g., Old D = 4*Euclidean + 3*Mean + 5*Critical New D = 5*Euclidean + 1*Mean + 6*Critical Use this guessed metric to re-do clustering. Repeat Until error is minimum OR max # of epochs reached. Proposed Task: Defining heuristic to distribute error. Considering impact of components on error. Using some domain knowledge if applicable. Taking into account efficiency and accuracy in learning. 12/26/2018

33 Alt A, Step 5: Output Final Metric
If error minimum, then distance D likely to give high accuracy in clustering. Hence output this D as learnt distance metric. Example: D = 3*Euclidean + 2*Mean + 6*Critical If maximum number of training epochs reached, go to Alternative B. 12/26/2018

34 Alternative B No input about significant aspects.
Use Occam’s Razor [M-97] to guess metric. Select simplest hypothesis that fits the data. Example: Initially guess only Euclidean distance. Do clustering & compare with correct clusters as in Alt A. To adjust and re-execute Pass 1: Alter weights. Repeat as in alternative A until error min. OR max. # of epochs. Pass 2: Alter one component at a time. Repeat whole process until error min. OR max. # of epochs. Output corresponding metric D as learnt distance metric. Proposed Tasks: Determining the simplest hypothesis. Outlining a strategy for altering components. 12/26/2018

35 State-of-the-art Learning nearest neighbor in high-dimensional spaces: [HAK-00]. Given metric, learn relative importance of dimensions. Distance metric learning given similar pairs of points: [XNJR-03]. Position-based distances only using general formula of Mahalanobis distance. Similarity search in multimedia databases: [KB-04]. Use various metrics in different applications, do not learn a single distance metric. 12/26/2018

36 Dimensionality Reduction
Each graph has thousands of points. Dimensionality reduction needed. Selective Sampling [HK-01], [WF-00]. Consider points at regular intervals, e.g. every 10th point. Include all significant aspects. Fourier Transforms. [B-68], [H-03], [AFS-93] Map data to frequency sinusoids. Xf = (1/√n)Σ{t = 0 to n-1} exp(-j2πft/n) where f = 0,1… (n-1) and j = √ -1 Retain most significant Fourier Coefficients. Selective Sampling Fourier Transforms 12/26/2018

37 Effect of Dimensionality Reduction in Learning Metric
Learnt distance metric involves components. Issue: How do components correspond to reduced space. Possible Solutions: Option 1: Learn metric in original space, then map to reduced space. Option 2: Reduce dimensionality, then learn metric in reduced space. Proposed Tasks: Exploring options 1 and 2 with domain semantics and properties of reduction techniques. Comparing Selective Sampling with Fourier Transforms given the learnt distance metric. 12/26/2018

38 Domain-Type-Dependent Classification
Task 3 Domain-Type-Dependent Classification 12/26/2018

39 Classification Techniques
Classification Techniques [HK-01]. Neural Networks. Decision Trees. Case Based Reasoning. We use Decision Tree Classification because: Eager learning, so pre-classifies experiments. Provides reasons for decisions, as basis for representative cases. Partial Decision Tree 12/26/2018

40 Decision Tree Classification
Sample Partial Output of Clustering Decision Tree Classifier Snapshot of Partial Decision Tree created 12/26/2018

41 Selecting Representative Cases
Decision Trees: Basis for selecting representative cases. Process: From all paths to a cluster, select one as representative conditions. From all graphs in the cluster, select any one as a representative graph. Sample Partial Decision Tree Sample Representative Case 12/26/2018

42 Need for Designing Representative Cases
Selecting any one representative not good. May not incorporate significant aspects of cluster. E.g, several combinations of input conditions may lead to one graph. Simple average of conditions not good. E.g. agitation1 = “high” and agitation2 = “low”. Common condition agitation = “medium” is not a good representation. Simple average of graphs not good. Some features on the graph may be more significant than others. Issue: Designing a good representative case of input conditions and graph to serve as a better classifier. 12/26/2018

43 DesCase: Designing Domain-Specific Representative Cases as Classifiers
Goal: To design a representative case of input conditions and graph per cluster serving as a classifier for new cases. Assumptions: Given correct clusters in the domain with clustering criteria learnt. 12/26/2018

44 Designing a Representative of Conditions
Following alternatives being explored. Alternative 1: Storing the most frequent combination. Alternative 2: Re-Clustering within each cluster. Alternative 3: Retaining all possible combinations with abstraction. One of more of them likely to be used. Domain constraints should be taken into account in each alternative. 12/26/2018

45 Alternative 1: Most Frequent Combination
Frequency denotes likelihood of occurrence. Thus if two or more paths lead to the same cluster, store the one that resulted from a greater number of experiments. E.g., Consider two combinations that lead to cluster E. E1: Quenchant Name = “Argon”, Part Material = “ST4140”, Agitation = “Low”, Quenchant Temp = “(20 – 30]”. E2: Quenchant Name = “Air”, Part Material = “SS304”, Agitation = “Absent”, Quenchant Temperature = “(20-30]”. If E1 resulted from 5 experiments and E2 from 7 experiments, Then store combination E2. 12/26/2018

46 Alternative 2: Re-Clustering
Each cluster has many combinations. Re-cluster based on combinations to form sub-clusters. For each sub-cluster select a representative. Each original cluster now has fewer and more meaningful combinations. Repeat this until a stopping criterion is met. Proposed Task: Defining a suitable stopping criterion. 12/26/2018

47 Alternative 3: All with Abstraction
Store maximum information at minimum cost. Use domain knowledge and statistical aspects to abstract the combinations efficiently. E.g. Consider 2 combinations leading to Cluster M. M1: Quenchant Name = “Synabol-2000”, Part Material = “ST4140”, Quenchant Temperature = “(40 – 90]”. M2: Quenchant Name = “Bio-Quench”, Part Material = “ST4140”, Quenchant Temperature = “(90-140]”. Domain Knowledge: Synabol-2000 & Bio-Quench are both “Bio-Oils”. Thus abstract these into one combination: Quenchant Category = “Bio-Oil”, Part Material = “ST4140”, Quenchant Temperature = “( ]”. 12/26/2018

48 Designing Representative Graph
Graphs in the Same Cluster Significant Aspects Learnt Distance Metric. Other visual features, e.g., 2 graphs in same cluster, left graph has shorter region at “T= ” and more wavy appearance. Dimensionality Reduction How do significant aspects get affected. Redrawing a Graph Types of Averages, Pixel to point conversion. Proposed Task: Outlining a strategy for designing representative graph. 12/26/2018

49 Dissertation Contributions
AutoDomainMine basic approach: Proposing clustering followed classification as a learning strategy for estimation. This automates one typical learning method of scientists. LearnMet technique: Learning a domain-specific distance metric for graphs under the effect of dimensionality reduction. This improves clustering accuracy. DesCase strategy: Designing representative cases of input conditions and graph, incorporating domain semantics. This provides better classifiers. 12/26/2018

50 System Development Development of AutoDomainMine in three stages.
Stage 1: Integrating Clustering and Classification. Stage 2: Clustering with LearnMet. Stage 3: Classification with DesCase. Implementation details. Java for majority of the coding with JSP for user interfaces. WEKA for clustering (k-means) and decision tree classification (J4.8) techniques. Maple, Matlab for dimensionality reduction etc. Pixel to point conversion tools for graphs stored as bitmaps. MySQL for Database Development. 12/26/2018

51 Evaluation Plan AutoDomainMine Experiments Criteria for Comparison
Stage 1: To judge effectiveness of learning strategy. Stage 2: To evaluate accuracy of clustering with domain-specific distance metric. Stage 3: To assess the whole AutoDomainMine approach with LearnMet and DesCase. Criteria for Comparison Real Laboratory Experiments in Heat Treating. Additional Experiments with data from other domains in online Machine Learning Repositories. 12/26/2018

52 Schedule for Completion
AutoDomainMine Learning Strategy: August 2003 to February 2004. Clustering with LearnMet + Proposal + Comprehensive Exam: March 2004 to December 2004. Classification with DesCase: December 2004 to May 2005. Dissertation Writing + Additional Experiments: May 2005 onwards. Expected Date of Graduation: August 2005. 12/26/2018

53 Demo of Pilot Tool This incorporates basic learning strategy in AutoDomainMine. It does not include LearnMet and DesCase. Clustering is done with Euclidean distance. Any one representative case is selected as a classifier. 12/26/2018

54 Example 1: Satisfactory
DEMO OF PILOT TOOL Example 1: Satisfactory 12/26/2018

55 Example 1: Satisfactory
DEMO OF PILOT TOOL Example 1: Satisfactory 12/26/2018

56 Example 2: Incorrect Estimation
DEMO OF PILOT TOOL Example 2: Incorrect Estimation 12/26/2018

57 Example 2: Incorrect Estimation
DEMO OF PILOT TOOL Example 2: Incorrect Estimation 12/26/2018

58 Example 3: Can be Improved
DEMO OF PILOT TOOL Example 3: Can be Improved 12/26/2018

59 Example 3: Can be Improved
DEMO OF PILOT TOOL Example 3: Can be Improved 12/26/2018

60 Thank You 12/26/2018


Download ppt "Automating Domain-Type-Dependent Data Mining as a Computational Estimation Technique for Decision Support in Materials Science Ph.D. Dissertation Proposal."

Similar presentations


Ads by Google