Clustering and Term Project Plan for this week
Term Project Questions? Examples: Research problems in Data Mining Industry problems in Data Mining Explore new data with existing/new tools Explore data with different process (tools, data selection, preprocessing) Focus on solving a problem (application or technical)
Data exploration Process (time%, importance%) --Dorian Pyle Exploring the problem space (10, 15) Exploring the solution space (9, 14) Specifying the implementation (1, 51) method (increases profitability, reduces waste, decreases fraud, or meets X goal) Mining the data Preparing the data (60, 15) Surveying the data (15, 3) Modeling the data (5, 2)
Ten Golden Rules for Miners --Dorian Pyle Select clearly defined problems that will yield tangible benefits. Specify the required solution. Define how the solution delivered is going to be used. Understand as much as possible about the problem and data set (the domain). Let the problem drive the modeling (tool and data preparation for model building)
Ten Golden Rules for Miners (cont.) 6. Stipulate assumptions. 7. Refine the model iteratively. 8. Make the model as simple as possible. 9. Define instability in the model (critical areas where changes in output vs. input). 10. Define uncertainty in the model (low confidence areas)
Selection of Research Paper for Review Algorithm-centered Application-centered Survey-centered Selection Due Mar. 24
Plan of the Week Monday (Dunham’s ppt Part II clustering 74-128) Similarity and distance measures Hierarchical algorithms (single link…) Partition algorithms (K-Means, MST,…)
Plan of the Week (cont.) Wednesday (Witten’s book 218-224, pdf 94-104; Dunham’s book 47-51) Statistical based clustering (EM algorithm) Case study: a data mining application using Cubist Term Project: directions and discussion