Download presentation
Presentation is loading. Please wait.
1
Business Intelligence
Core Subject – 15 Unit Credits
2
Lecture 10 Tools and Techniques : Descriptive and Predictive analysis
Predictive Modelling eg forecasting, use of statistical models to predict and identify trends. Data mining techniques to find anomalies Cluster patterns and/or relationships between data sets Converting data into visual information using charts, graphs, histograms and other visual mediums.
3
Data Mining The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases Fayyad et al., (1996) Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging,… Data: a collection of facts usually obtained as the result of experiences, observations, or experiments Data in Data Mining may consist of numbers, words, images, … Data: lowest level of abstraction (from which information and knowledge are derived)
4
What does DM do? DM extract patterns from data Data Mining Tasks…
Pattern? A mathematical (numeric and/or symbolic) relationship among data items Data Mining Tasks… Predictive Method : Use some variables to predict unknown or future values of other variables. Descriptive Method : Find human-interpretable patterns that describe the data. Classification (Predictive) Clustering (Descriptive) Association Rule Discovery (Descriptive) Sequential Pattern Discovery (Descriptive) Regression (Predictive) Deviation Detection (Predictive)
5
Classification : Definition
Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be _________________________________. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
6
Classification Example
7
Classification : Application 1
Direct Marketing Goal : Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach : Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as ____________________ to learn a classifier model.
8
Classification : Application 2
Fraud Detection Goal : Predict fraudulent cases in credit card transactions. Approach : Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc. Label past transactions as fraud or fair transactions. This forms the class attributes. Learn a model for the class of the transactions. Use this model to detect fraud by _______________________ on an account.
9
Clustering : Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.
10
Clustering Based Techniques
Key assumption: normal data records belong to large and dense clusters, while anomalies do not belong to any of the clusters or form very small clusters Anomalies detected using clustering based methods can be: – Data records that do not fit into any cluster (residuals from clustering) – Small clusters – Low density clusters or local anomalies (far from other points within the same cluster)
11
Major Clustering Approaches (I)
Partitioning approach: k-means, k-medoids, CLARANS Construct k-partitions for the given n-objects (k ≤ n). Each group contains at least one object. Each object must belong to exactly one group. Density-based approach: DBSCAN, OPTICS, DenClue Based on connectivity and density functions. i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points. Hierarchical approach: Diana, Agnes, BIRCH, ROCK, CAMELEON Create a hierarchical ____________________ using some criterion (linkage function ) Agglomerative Approach: bottom-up merging Divisive Approach: top-down splitting
12
Major Clustering Approaches (II)
Grid-based approach: Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOFM, COBWEB Frequent pattern-based: Based on the analysis of frequent pattern Typical methods: pCluster User-guided or constraint-based: Clustering by considering _______________________ constraints Typical methods: COD (obstacles), constrained clustering
13
Clustering : Application 1
Market Segmentation Goal : subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach : Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster VS. those from different clusters.
14
Clustering : Application 2
Document Clustering Goal : To find groups of documents that are similar to each other based on the important terms appearing in them. Approach : To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain : Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.
15
Association Rule Discovery : Definition
Given a set of records each of which contain some number of items from a ___________________; Anomalous data records which will predict occurrence of an item based compared to occurrences of other normal items. Also known as Market Basket Analysis
16
Association Rule Discovery: Application 1
Marketing and Sales Promotion • Let the rule discovered be {Bagels, … } --> {Potato Chips} • Potato Chips as consequent => Can be used to determine what should be done to boost its sales. • Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. • Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
17
Association Rule Discovery: Application 2
Supermarket shelf management Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with ________________ to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers!
18
Association Rule Mining
Apriori Algorithm Finds subsets that are common to at least a minimum number of the item sets uses a bottom-up approach frequent subsets are extended one item at a time (the size of frequent subsets increases from one-item subsets to two-item subsets, then three-item subsets, and so on), and groups of candidates at each level are tested against the data for minimum support
19
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own _____________ , find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.
20
Regression Predict a value of a _____________ variable based on the values of other variables, assuming a linear or nonlinear model of dependency. • Greatly studied in statistics, neural network fields. • Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.
21
Deviation /Anomaly Detection
Anomaly is a pattern in the data that does not conform to the expected behavior • Real World Anomalies: Credit Card Fraud Detection – An abnormally high purchase made on a credit card Cyber Intrusion Detection – A web server involved in ftp traffic
22
Type of Anomaly Point Anomalies Contextual Anomalies
Collective Anomalies
23
Point Anomalies An individual data instance is anomalous w.r.t. the data
24
Contextual Anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies
25
Collective Anomalies A collection of related data instances is anomalous Requires a relationship among data instances – Sequential Data – Spatial Data – Graph Data The individual instances within a collective anomaly are not anomalous by themselves
26
Anomaly Detection Applications
Credit card fraud detection Image Processing / Video surveillance Network Intrusion detection Healthcare informatics
27
Intrusion Detection Process of monitoring the events occurring in a computer system or network and analyzing them for intrusions Intrusions are defined as attempts to bypass the security mechanisms of a computer or network • Challenges : Traditional signature-based intrusion detection systems are based on signatures of known attacks and _________________ cyber threats Substantial latency in deployment of newly created signatures across the computer system • Anomaly detection can alleviate these limitations
28
Fraud Detection Fraud detection refers to detection of criminal activities occurring in commercial organizations – ________________ might be the actual customers of the organization or might be posing as a customer (also known as identity theft). • Types of fraud : – Credit card fraud – Insurance claim fraud – Mobile / cell phone fraud – Insider trading • Challenges : – Fast and accurate real-time detection – Misclassification cost is very high
29
Data Visualization Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a pictorial or graphical format (3-D images) Data visualization provides _________________ to aid the user during both data preprocessing and the actual data mining. Data Transformation is an important data preprocessing step. During data transformation, visualizing data can help the user to ensure the correctness of the transformation. Data from: satellite photos, sonar measurements, surveys, or computer simulations
30
Data Visualization Common Display Types : Bar Charts Line Charts
Pie Charts Bubble Chart Stacked Charts Scatterplots
31
Visualization Based Techniques
Use visualization tools to observe the data • Provide ________________ for manual inspection • Anomalies are detected visually • Advantages : – Keeps a human in the loop • Disadvantages : – Works well for low dimensional data – Can provide only aggregated or partial views for high dimension data
32
Graphical Approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations : – Time consuming – Subjective
33
Application of Dynamic Graphics
Apply dynamic graphics to the exploratory analysis of spatial data. Visualization tools are used to ___________________to detect anomalies Manual inspection of plots of the data that display its marginal and multivariate distributions
34
Visual Data Mining The process of discovering implicit but useful knowledge from large data sets using visualization techniques. Detecting Telecommunication fraud • Display telephone call patterns as a graph • Use colors to identify fraudulent telephone calls (anomalies)
35
Spreadsheet Most popular end-user modelling tool
Static Model example : Simple Loan calculation of monthly payments
36
Spreadsheet Excel spreadsheet – Dynamic Model example :
Simple Loan calculation of monthly payments And effects of prepayment
37
End of the chapter REFERENCE :
ing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.