Download presentation
Presentation is loading. Please wait.
Published byDonald Nichols Modified over 6 years ago
1
Business Intelligence for a Tough Economy: Data Mining
Derek Comingore, Senior Architect Business Intelligence for a Tough Economy: Data Mining Welcome. Thank you for joining us today.
2
Where you can find me… Prior Wrox Author /TE for SQL Topics
PASS Participation Blog: Linkedin: “MSFT BI Professionals” Group Founder SQL Server Magazine Author TechNet Magazine Author (Upcoming) 1 of 4 Microsoft SQL Server 2008 R2 Expert Content Providers (ECPs)
3
BI for a Tough Economy Finance & Cost Management
Identify Cost Cutting Opportunities Operational Improvement Free cash from operations Strengthen Sales & Marketing Increase your customer base Elegant Enterprise Enablement
4
Agenda Introduction The Data Mining Development Process with SQL Server 2008 SQL Server 2008 Data Mining Algorithms Excel 2007 Data Mining Add-in Summary Reference Q&A Elegant Enterprise Enablement Here is our agenda of this presentation: First, we are going to look at a brief definition of data mining and talk about the current Microsoft stack for data mining as the introduction. After the introduction, we will dive into the overview of all native data mining algorithms in SQL Server Next, we are going to cover the Data Mining Development Process with the DIBS Then, we will have two demonstrations of applying some mining algorithms and showing the result through the excel data mining add-ins. At last, we will recap what we have talked about as well as having a Q&A section.
5
Introduction What is data mining? Current Microsoft Data Mining Stack
Analysis Services 2008 Excel 2007 Data Mining add-in Integration Services 2008 Reporting Applications with ADOMD & DMX Administrations Applications with AMO According to wikipedia, “Data mining” is the process of extracting hidden patterns from data. As more data is gathered, with the amount of data doubling every three years,[1] data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.” So data mining is to extract hidden patterns, discover business trend, or retrieve valuable information hidden from raw data. These valuable information could be a prediction of sales, customer purchasing pattern, business behavior, etc. Microsoft set its foot on this market for a long time already with SQL Server 2000 and 2005, now with SQL Server 2008, it is offering a much more complete tool set for task the BI\Data Mining domain. It has the Analysis Services (SSAS) 2008 for building data cubes and data mining models. The following are a list of new features Core Server Components Mining Structures Mining Models DMX AMO/ADOMD Tools: BIDS2008 Extensibility: Plug-in algorithms and DMX functions A new Excel 2007 with the Data Mining add-ins as a common reporting tool of the Data mining model. Server-Based Models: Reporting using existing Analysis Services 2008 Mining Models Client-Based Models: Creating Mining Models w/Data Mining add-in We are going to look closer to it later in those demos. It has Integration Services (SSIS) 2008 for the ETL process and data transformation solution. Tasks: Data Mining Query Transformations: Data Mining Query Destination: Data Mining Model Training It supports Native and Custom Reporting Applications via ADOMD & DMX such as SSRS, etc. gets transformed to XMLA/http requests It also works with both native and custom administrations application via AMO.
6
Core Concepts Cases & Nested Cases (Records)
Attributes (Columns): Discrete & Continuous Key Input Output Predictive Structures Defining Data Domain (Relational or OLAP) Data Source View Number & Type of Columns Optional Partitioning: Training & Test Sets Models Create one model per algorithm Algorithm’s have specific parameters Training Vs. Testing (Prediction)
7
The Data Mining Development Process with SQL Server 2008
This diagram might looks familiar to most of you. I actually borrowed it from the MSDN Library / SQL Server Books Online. Generally, Data mining development process includes the following steps: Step 1: Defining the Problem Step 2: Preparing Data Step 3: Exploring Data Step 4: Building Mining Models Step 5: Validating Mining Models Step 6: Deploying and Updating Mining Models Step 1: Defining the problem In this step, we want to understand the business problem and identify the goal of this process. We should attempt to find the answer for questions like: What business problem are you trying to solve? What type of relationship from your data that you are trying to find? Are trying to come up with a prediction or just explore patterns? Where could patterns possibility exist? How is your data related? Etc. After discovered the answer of these question, we could move into the Preparing Data step. In this step, we are going to consolidate and cleanup data for Data Mining Models based on the result from step 1. With BIDS 2008, the SSIS contains all kinds of tools to support this ETL process. One of an important highlight here is that the result data of this step could be stored in OLAP cube, database, excel workbooks, text, etc as long as it is a valid data source of the SSAS. After completed the preparing of data, we should start digging into the data in step 3. In this step, we would like to explore hidden information from data collected in step 2 through various measurement/calculation. Some of the most common measurements include minimum, maximum, mean, standard deviations, etc. These measurements and the distribution of the data serve as an excellent indicator of whether we are having a good dataset for building mining models for the business problem as well as whether the dataset is in need of additional data. This is also an opportunity the deeper the understanding toward the behavior of the business. Data Source View Designer in BIDS 2008 contains several tools that you can use to explore data. Sometimes the result of step 3 might indicates that a better definition of the business problem is needed or improved dataset is needed. When that happen, we should go back to step 1 for a better strategy. If the result of step 3 is satisfied, we should move on to build Data Mining Models in step 4. Step 4: Building Models In this step, we should add mining structures and then build mining model(s) in SSAS. Mining model is based on a mining structure. Each mining structure could have multiple mining models. Before we actually process a mining model, it only contains the definition of the where to get data, what data is needed, how data is related, and how they should be processed mathematically. Processing a model is also referred as training of the mining model which applying the definition of the mining model on the data for extracting desirable pattern(s). Mining models should be reprocessed when more recent data becomes available in order to reflect the latest of the business status. Various Data Mining Algorithms as described earlier could be applied in these mining models as appropriate. In BIDS 2008, we could define mining model with Data Mining Wizard or DMX language. Data Mining Wizard helps us to define new mining structures, choose data sources, partition data in the mining structure into training and testing sets, and add an initial mining model for the mining structure. After created and processed mining structures and mining models, we should validate all mining models. It is recommended to build multiple mining models with different configurations in step 4 so that we could discover the best performance model through comparison in the next step. Step 5: Validating Models After the construction of mining models, we should evaluate and validate their performance before deployment. We should also discover the best performance (accuracy, reliability, and usefulness) model through examine all mining models. Data Mining Designer in BIDS 2008 has various viewers for exploring the output of the mining model such as discovered pattern, prediction, etc. It also supports data partition, mining model cross-validation, and mining model accuracy charts for validation of mining models. If none of those mining models yield acceptable result, we should go back to step 1 for further investigation of the dataset and better definition of the business problem. If the best performed mining model yield an acceptable result, we should move on to deploy or update the mining model in a production environment in step 6. Step 6: Deploying and Updating Models Once we are ready to deploy the mining model, we could deploy it through SSIS in BIDS 2008. For further details, please refer to the MSDN reference is in the following :
8
SQL Server 2008 Data Mining Algorithms
Recommendations Association Algorithm Segmentation Clustering Algorithm Classification Decision Trees Algorithm Linear Regression Algorithm Neural Network Algorithm Logistic Regression Algorithm Naive Bayes Algorithm Sequencing Sequence Clustering Algorithm Forecasting Time Series Algorithm ARTxp ARIMA Lets talk about one of the critical piece of the Data Mining tool set: the collection of data mining algorithms in SSAS We are going to walk through all of them quickly by their category which corresponding to the type of business problem / question they address. Category - Recommendations: We have the Association Algorithm which originated from POS, however, it can be applied in many fields. The market basket problem, which is a collection of product items purchased by a customer in a single transaction, is the famous business activity to use the Association Algorithm. By using the Association Algorithm to analyze the customer’s transactions, we could learn which products are commonly purchased together. This will help us to change the product layout to increase sales and manage the stock. Category - Segmentation: The Clustering Algorithm which finds natural groups inside a dataset when these groups are not easy to be observed by eyes or simply analyzing. The samples in one group are similar and the samples belonging to different groups are different in the pre-defined input variables. Category – Classification: The Decision Trees algorithm is probably the most popular data mining technique for classification. It could be used for both discrete and continuous attributes. This algorithm creates a split (node) when an input column is found to be correlated obviously with the predictable column. The Linear Regression Algorithm is to find the linear dependent relationship among more than two variables, which is used widely in practice. It is a variation from Microsoft Decision Tree Algorithm. What does it distinguish from the Decision Tree is that regression equation is used to describe the linear relationship between dependent and independent variables. The Microsoft Neural Network Algorithm is popular in classification application. It is originated from building a model to simulate how biological neurons work. It creates classification and regression mining models by constructing multi-layers sensor networks. The Neural Network Algorithm possesses the abilities to learn, remember and conclude. Similar to the decision tree algorithm, it could find nonlinear relationship between variables. The Logistic Regression Algorithm is a well-known statistical method for binary data, for instance, Yes or No. It is a variation of Microsoft Neural Network Algorithm. It uses a modified neural network to find the relationship between inputs and outputs in which a logistic transformation is used to minimize the effect of extreme values. The Navie Bayes algorithm is a relatively simple classification algorithm that could create the model quickly for predictive modeling. The Bayes theorem is used in this Algorithm which does not take the dependent relationships between variables into account. The Bayes theorem uses a combination of conditional and unconditional probabilities. Category – Sequencing The Sequence Clustering is an integration with sequence and clustering techniques. The goal of the Microsoft Sequence Clustering Algorithm is to find the common sequence from some specific activities. This algorithm finds the most common sequence by allocating the same sequences into one group and comparing them with clustering and Markov Chain models. First step is to identify the classification. Second step is to find the common sequences. Category – Forecasting The Time Series Algorithm is used in analyzing and forecasting the time-based data. Comparing to the forecasting of the trend by Microsoft Decision Tree Algorithm which needs other input columns, the Time Series Algorithm forecasts the trend based on itself. The basic ideology of Microsoft Time Series Algorithm is to find the changing rule as time goes by and extend the rule to the future for forecasting. It includes two independent algorithms: ARTXP It is a hybrid of the decision tree and auto-regression techniques which is used in short time forecasting and supports the crossed forecasting. ARIMA It is used for long time forecasting.
9
Excel 2007 Data Mining Add-in
What’s in the Add-in? Table Analysis Tools Uses Algorithms on Server Mining Results in Client Data Mining Client Persist Objects on Server Visio Data Mining Templates Generate Diagrams from Mining Results Table Analysis Tools: Create session/client-based data mining analysis without persisting objects on SSAS server (still requires an SSAS connection) Data Mining Client: Leverage SSAS Storage for persisting shared Data Mining Objects Visio Data Mining Templates: Draw diagrams from existing Mining Models in Visio
10
Table Analysis Tools Analyze Key Influencers Detect Categories
Fill From Example Highlight Exceptions Scenarios: Goal Seek & What-if Table Analysis Tools: Create session/client-based data mining analysis without persisting objects on SSAS server (still requires an SSAS connection) Data Mining Client: Leverage SSAS Storage for persisting shared Data Mining Objects Visio Data Mining Templates: Draw diagrams from existing Mining Models in Visio
11
Data Mining Client Data Preparation Data Modeling
Explore Data (Data Profiling) Clean Data Partition Data Data Modeling Classify Estimate Cluster Associate Forecast Accuracy & Validation (Profit & Lift Charts) Model Usage (DMX Queries) Management & Connections
12
Visio Data Mining Diagrams
Dependency Network Cluster Decision Tree
13
Summary Data Mining is the process of extracting hidden patterns (information) from large volumes of data Microsoft SQL Server 2008, through BIDS and other tools, supports the entire Data Mining development process SQL Server 2008 contains many advanced data mining algorithms that solve a variety of business problems The Excel 2007 data mining add-in is a reporting tool for SQL Server Mining Elegant Enterprise Enablement
14
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.