Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department
Data Mining
What is Data Mining ?? Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. It is the discovery of knowledge (in the form of rules, trees, frequent patterns etc.) from large volumes of data. It is the automated process of finding relationships and patterns in stored data. It is different from the use of SQL queries and other business intelligence tools.
Data Mining – Why is it important? The explosive growth in data collection. Data are being generated in enormous quantities. Data are being collected over long periods of time. Data are being kept for long periods of time. Computing power is formidable and cheap. A variety of Data Mining software is available.
Data Mining: On What Kind of Data? Relational databases. Data warehouses. Transactional databases. Advanced DB and information repositories. Object-oriented and object-relational databases. Spatial databases. Time-series data and temporal data. Text databases and multimedia databases. WWW
Knowledge discovery in databases (KDD) Knowledge discovery in databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. KDD is the automatic extraction of non-obvious, hidden knowledge from large volumes of data. KDD is the process of extracting previously unknown, valid, and actionable (understandable) information from large databases while Data mining is a step in the KDD process of applying data analysis and discovery algorithms.
The Knowledge Discovery Process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. non-trivial process Multiple process valid Justified patterns/models novel Previously unknown useful Can be used understandable by human and machine
Data Mining & KDD Data Mining Step in KDD process Consists of particular Data Mining algorithms. Under specified computational efficiency limitations. produces specific enumeration of patterns. KDD process The process of using Data Mining methods (Algorithms) to extract knowledge according to the specifications of measures using the database along with any required preprocessing, and transformations of that database.
What are basic steps of data mining for knowledge discovery? Define business problem. Build data mining database.( not easy) Explore data. Prepare data for modeling. ( select variables,rows Constant N-variables Trans variables) Build model. Evaluate model. Deploy model and results.
The whole process of extraction of implicit, previously unknown and potentially useful knowledge from a large database. It includes data selection, cleaning, coding, data mining, and reporting. Data Mining is the key stage of Knowledge Discovery Process. The process of finding the desired information from large database. Knowledge Discovery Process
Stages of KDD
The Knowledge Discovery Process KDD is inherently interactive and iterative a step in the KDD process consisting of methods that produce useful patterns or models from the data, under some acceptable computational efficiency limitations Understand the domain and Define problems Collect and Preprocess Data Data Mining Extract Patterns/Models Interpret and Evaluate discovered knowledge Putting the results in practical use
Knowledge Discovery in Databases Process
Data Cleaning and Integration: Integration of data from different sources Mapping of attribute names. Joining different tables. Elimination of inconsistencies Imputation of Missing Values (if necessary and possible) Fill in missing values by some strategy (e.g. default value, average value) Normalization.
Focusing on task-relevant data: Selections Select the relevant rows from the database tables. Projections Select the relevant attributes/columns from the database tables. Transformations Computation of numerical attributes. Computation of derived rows and derived attributes/columns. New attributes.
Basic Data Mining Tasks: Clustering Classification Association Rules Concept Characterization and Discrimination Other methods
Evaluation of patterns: Interestingness of patterns “ A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validation”
Visualization: Visual Data Mining Present the data in some visual form, allowing the human to get insight into the data, draw conclusions, and directly interact with the data.
Common Types of Information from Data Mining Associations : identifies occurrences that are linked to a single event. Sequences : identifies events that are linked over time. Classification : recognizes patterns that describe the group to which an item belongs. Clustering : discovers different groupings within the data. Forecasting : estimates future values.
The Data Mining Process Required: Personnel with domain, Data warehousing, and Data mining expertise. Required data selection, data extraction, data cleaning, and data transformation. Is an iterative and interactive process.
The Data Mining Process Based on the questions being asked and the required ”form” of the output. 1. Select the data mining mechanisms that will use. 2. Make sure the data is properly coded for the selected mechanisms. Ex. A tool may accept numeric input only 3. Perform rough analysis using traditional tools. Create a simple prediction using statistics. The data mining tools must do better than the prediction. 4. Run the tool and examine the results.
Data Mining Tasks Data Mining is generally divided into two tasks: 1. Predictive tasks: Predict the value of a specific attribute based on the value of other attributes. Prediction Method uses some variables to predict unknown or future values of other variables. 2. Descriptive tasks: To derive patterns that summarizes the underlying relationship between data. Description Method uses human-interpretable patterns that describe the data.
Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive] Deviation Detection [Predictive]
Classification Data defined in terms of attributes, one of which is the class Find a model for class attribute as a function of the values of other attributes. Given data is usually divided into training and test sets. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets. Training Data: used to build the model. Test data: used to validate the model (determine accuracy of the model).
Classification Example categorical continuous class Test Set Training Set Model Learn Classifier
Classification: Application Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. derive a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.
Clustering Clustering: Partition data set into clusters. Cluster: a collection of data objects Similar to one another within the same cluster. Dissimilar objects are in different clusters. Example: Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: data points in one cluster are more similar to one another. data points in separate clusters are less similar to one another.
Data Warehouse A Warehouse : A storage place for data awaiting use. Data warehousing is a process for assembling and managing data from various resources for the purpose of gaining a single detailed view of part or all of a business. Integrated diverse data sources. Provide support to decision making operations. Usually based on a relational database and DBMS.
Data Warehouse – why? For organisational learning to take place data from many sources must be gathered together over time and organised in a consistent and useful way. Data Warehousing allows an organisation to remember its data and what it has learned about its data. Data Mining techniques make use of the data in a Data Warehouse and subsequently add their results to it.
Data Warehouse - Contents A Data Warehouse is a copy of transaction data specifically structured for querying, analysis and reporting. The data will normally have been transformed when it was copied into the Data Warehouse. The contents of a Data Warehouse, once acquired, are fixed and cannot be updated or changed later by the transaction system - but they can be added to of course.
Questions? ?