Data Mining (and Machine Learning) With Microsoft Tools Michael Lisin, Plaster Group May 8, 2014
Why Reinvent a Toilet? Page 2
Definitions Page 3 ConceptDefinition / Solution For Data Mining Algorithms to discover unknown data patterns Machine Learning Algorithms to predict based on data patterns StatisticsBranch of mathematics, methods of data collection and interpretation Data Science All of the above + Data Visualization
What Do You Think? Page 4 Is Linear Regression? Data Mining Machine Learning Statistics All of the above Linear Regression is a straight line describing how variable Y responds to changes in variable X
MS DM Environment SQL Server Excel Data Mining Add-Ins (optional, recommended) Interact with: Excel (add-ins), SQL Management Studio, SQL Server Data Tools (SSDT), Custom Code Page 5 SQL Edition Component: CapabilityEnterpriseBIStandard SSIS: Text Mining SSAS: DM basic SSAS: DM advanced (CV, prediction queries, …) SSDT Custom Code
Start With a Question Page 6
7 Many Potential Questions MS DM Capabilities How do we combine our products to increase profits? How do we predict the demand for a product / service? Why are customers buying from us? Where can we best cut costs? What are the opportunities to reduce risks? Who are our best customers? … Generic question: What are the data patterns? Best if more specific and directed at a problem, for example:
Approach Define problem / questions Prepare data Build model Validate model Implement predictions Automate model refresh Extend / custom applications Page 8 More Technical
SQL DM Algorithms Summary Discrete Continuous Sequence Common Group Similar Group TXT Semantic Decision Trees [Classify, Estimate] Linear Regression [Advanced] Time Series [Forecast (T), Forecast] Clustering [Detect Categories(T), Except, Cluster] Sequence Clustering [Advanced] Neural Network [Advanced] Logistic Regression [Fill From Sample (T), Scenario Analysis(T), Prediction Calculator (T)] Association Rules [Shopping Basket (T), Associate] Naïve Bayers [Analyze Key Influencers(T)] Text Mining (matching, grouping, extracting) Page 9
Predict Using Models SELECT Model.[Bike Buyer], PredictProbability( Model.[Bike Buyer]), NewData. FROM [Model] NATURAL PREDICTION JOIN (SELECT Age, [Commute Distance], FROM … ) As NewData Page 10 DMX = Data Mining Extensions to query models for predictions … Output: DMX Query:
Demo Page 11
Questions Page 12
Appendix Page 13
SQL Server Data Mining Algorithms Page 14 Decision Tree Linear Regression Clustering Sequence Clustering Association Naive Bayes Neural Network Time Series Text Mining Fuzzy Grouping Term Extraction Term Lookup
Key SQL Server Algorithms - 1 Page 15 Decision Tree - makes predictions based on the relationships between input columns in a dataset. The decision tree makes predictions based on this tendency toward a particular outcome. Example: predict which customers are likely to be satisfied with a company, based on some input variables (# purchases, avg. transaction size). Linear Regression - is a variation of the Decision Trees calculates a linear relationship between a dependent and independent variable, and then use that relationship for prediction. The algorithm is most applicable to predict continuous attribute. Example: product demand, price, site visitors. Clustering is a segmentation algorithm that uses iterative techniques to group cases in a dataset into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, customer segmentation.
Key SQL Server Algorithms - 2 Page 16 Sequence Clustering – is similar to Clustering algorithm; however, instead of finding clusters of cases that contain similar attributes, this algorithm finds clusters of cases that contain similar paths in a sequence. It is used to explore data that contains events that can be linked by following paths, or sequences. For example: the click paths that are created when users navigate a Web site; the order in which a user follows a process. Association is useful to recommends products to customers (recommendation engine) based on items they have already bought, or in which they have indicated an interest. Example: market basket analysis. Naive Bayes is a classification algorithm, it uses Bayes theorem but does not take into account dependencies that may exist, thus its assumptions are said to be naive. Can be used to do initial explorations of data where later you can apply the results to create additional mining models with other more computationally intense and more accurate algorithms. Example: send mailers only to those customers who are likely to respond.
Key SQL Server Algorithms - 3 Page 17 Neural Network algorithm combines each possible state of the input attribute with each possible state of the predictable attribute, and uses the data to calculate probabilities. useful for analyzing complex input data, such as from a manufacturing or commercial process, or business problems for which a significant quantity of data is available but for which rules cannot be easily derived by using other algorithms. Time Series algorithm provides regression algorithms that are optimized for the forecasting of continuous values, such as product sales, over time. Whereas other Microsoft algorithms, such as decision trees, require additional columns of new information as input to predict a trend, a time series model does not. Text Mining algorithm analyzes unstructured text data. This allows companies to analyze unstructured data such as a "comments" section on a customer satisfaction survey. This algorithm is available in SQL Server Integration Services. TEXT
SQL Text Mining Page 18 Term Extraction Transformation Creates (extracts) a list of terms discovered in the source Writes the terms (+score) to a transformation output column Limitations: English only Nouns or noun phrases only Term Lookup Transformation Matches terms extracted from text in an input with terms in a reference table. Counts the number of times a term in the lookup table occurs in the input data set, writes the count together with the term from the reference table to columns in the transformation output. Fuzzy Grouping Transformation Select canonical row, identify fuzzy (to exact) text fragment match. Output: UID, Group ID, Similarity Score 0..1 Supplemental Sampling (training and test sets, uniform representation): Row (Quantity) Sampling Transformation Percentage Sampling Transformation Sort Transformation
Interesting Links Sources of free data for research – – – – Algorithms – – – – – – – – – – – – – – 69/Default.aspx 69/Default.aspx Page 19
Useful Terms Page 20