CS157A Spring 05 Data Mining Professor Sin-Min Lee
Today's Presentation covers: 1.What is Data Mining? 2.Data Mining Objectives 3.Data Mining Operations 4.Knowledge Discovery 5.Application of Data Mining 6.Summary 7.References
Statistics Databases Artificial Intelligence Visualization Data Mining Overview of Data Mining
1. What is Data Mining? ➔ We usually use Data Mining to: – Discovering useful, previously unknown knowledge by analyzing large and complex databases. – Knowledge discovery, exploratory data analysis, applied statistics, machine learning – Search for valuable Information in Large Databases
2. Data Mining Objectives ➔ Find rules and patterns in large volumn databases ➔ Discovery – Finding human understandable patterns describing the data ➔ Prediction – Using some variables or fields in database to predict unknown or future values or other variables of interest
Data Mining Objectives ➔ Knowledge Discovery – Stage somewhat prior to prediction where information is insufficient – It's close to decision support
3. Data Mining Operations ➔ Associations ➔ Sequential Patterns ➔ Time-Series Clustering ➔ Classification ➔ Segmentation ➔ And many more!
Association ● Used to find all rules in a basket data ● Basket data also called transaction data ● Analyze how items purchased by customers in a shop
Association... ● A formal definition: ● Let I = {i 1, i 2, …i m } be a total set of items D a set of transactions d is one transaction consists of a set of items d I ● Association rule:- ● X Y where X I,Y I and X Y = ● Support = (#of transactions contain X Y ) / D ● Support: number of instances predicted correctly ● Confidence: number of correct predictions, as proportion of all instances ● Confidence = (#of transactions contain X Y) / #of transactions contain X
Association... ● Example of transaction data: – Transaction 1: CD player, music's CD, music's book – Transaction 2: CD player, music's CD – Transaction 3: Music's CD, music's book – Transaction 4: CD player ● I = {CD player, music's CD, music's book} ● D = 4 ● # of transactions contain both CD player, music's CD = 2 ● # of transactions contain CD player = 3 ● Support = 2 /4, Confidence: 2 /3
Applying Association Rule... ● Example: Books that tend to be bought together. If a customer buys a book, an online bookstore may suggest other associated books. (ie. Amazon.com) ● Example: If a person buys a laptop, the salesperson may suggest accessories that tend to be bought along with laptop.
Time Series Clustering ● Given: – A database of time series ● Find: – Groups of similar time series ● Sample Applications: – Determine products with similar selling patterns – Identify companies with similar pattern of grown – Find stocks with similar price movements
Classification ● Classification – Problem: Given that items belong to one of several classes, and given past instances (aka training instances) of items along with the classes to which they belong, the problem is to PREDICT the class to which a new item belongs – The class of the new instance is not known, so other attributes of the instance must be used to predict the class. – It can be done by finding rules that partition the given data into disjoint groups
Classification... ● Dataset is usually in the form of a relation table. ● Data has a set of distinct attributes. ● Each data record is also labeled with a class. ● Goal : To build a model or learn rules that can be used to predict the classes of new cases. ● Training Data are used to build this model.
Classification... ● For example – Suppose that a credit card company wants to decide whether or not to give a credit card to an applicant ● The company has a variety of information about the person, such as their age, education background, income, etc.. ● Then they will rank the applicants (catogorized them into classes) ● Forall person P, P.degree=masters AND P.income > 75,000 ==> P.credit = excellent ● Forall person P, P.degree=bachelors OR (P.income >= 25,000 AND P.income P.credit = good
Classification... ● Table: Age Smoke Risk No Low 25 Yes High 44 Yes High 18 No Low 55 No High 35 No Low ● To identify the risk (we have two groups): – Risk = Low and Risk = High
Classification... ● The following techniques could be used to analyze the classification: – Decision Tree – Predictive Modeling – Using association rule – Neural networks – etc...
Decision Trees ● “Divide-and-conquer” approach produce tree ● Nodes involve testing a particular attribute ● Usually, attribute value is compared to constant ● Other possibilities: – Comparing values of two attributes – Using a function of one or more attributes ● Leaves assign classification, set of classifications, or probability distrbution to instances ● Unknown instance is routed down the tree
Decision Tree ● In short, Decision tree is just a series of nested if/then rules. Smoke Age Yes No 0-35 High Low High Our previous example
Predictive Modeling ● Predict values based on similar groups of data ● Pattern Recognition – Association of an observation to past experience or knowledge – Interchangeable with classification ● Estimation – Assign infinite number of numeric labels to an observation
4. Knowledge Discovery ● Find Patterns in database – For example, if someone buys one thing, what else will he buy next ● Interesting + Certain = Knowledge – Usually the output called “Discovered Knowledge” ● KDD – Knowledge Discovery in Database ● A non-trivial process of identifying valid, potentially useful, and understandable patterns in data
KDD – Knowledge Discovery in Database... ● Advances in traditional tasks in data analysis – Classification, Clustering – New Data Mining operations ● Association rules ● Sequential patterns ● Deviation /Exceptions ● New Application areas – Spatial, Text, Web, Image,....
KDD – Knowledge Discovery in Database ● Applications – Most large companies have data warehouses: platforms for Data Mining Projects – Trend towards integrated vertical solutions such as financial and telecom areas ● Back-end: integration with databases ● Front-end: Campaign Management or CRM (Customer Relationship Management)
KDD – Knowledge Discovery in Database ● Next Generation Knowledge Discovery Systems: – Have integrated front-end access to knowledge delivery tools – Have integrated back-end access to enterprise and external databases – Have knowledge discovery engine as embedded part of the overall solution – Be oriented to solving a business problem, not a data analysis problem
5. Application of Data Mining ● Medical ● Control Theory ● Engineering ● Marketing and Finance ● Data Mining on the web ● Scientific Data Base ● Fraud Dectection ● And many more!
6. Summary ● Data Mining IS.... – Decision Trees, Nearest Neighbor Classification, Neural networks, Rule Induction, K-means Clustering – Decision support process in which we search patterns of information in data ● Data Mining is NOT... – Retrieving data (ie. Google) ● “Information retrieval” or “Database querying” ● Data Mining infers “the right query” from data – Merging many small databases into a large one
Summary ● Data Mining is not... – Data warehousing – SQL / Ad Hoc Queries / Reporting – Software Agents – Online Analytical Processing (OLAP) – Data Visualization
Referneces ● Dr. Lee's Presentation – ● Data Mining Section ● Dr. Kurt Thearling's website – ● An Introduction to Data Mining