Business Intelligence

Slides:



Advertisements
Similar presentations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
Advertisements

DATA MINING Introductory
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Week 9 Data Mining System (Knowledge Data Discovery)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
University of Minnesota
Data Mining By Archana Ketkar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining – Intro.
Why Mine Data? Commercial Viewpoint
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Knowledge Discovery & Data Mining
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Anomaly Detection.
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
An Introduction to Data Mining
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining – Intro.
What Is Cluster Analysis?
Data Mining: Introduction
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Data Mining Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Adrian Tuhtan CS157A Section1
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Warehousing Data Mining Privacy
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Presentation transcript:

Business Intelligence Core Subject – 15 Unit Credits

Lecture 10 Tools and Techniques : Descriptive and Predictive analysis Predictive Modelling eg forecasting, use of statistical models to predict and identify trends. Data mining techniques to find anomalies Cluster patterns and/or relationships between data sets Converting data into visual information using charts, graphs, histograms and other visual mediums.

Data Mining The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996) Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging,… Data: a collection of facts usually obtained as the result of experiences, observations, or experiments Data in Data Mining may consist of numbers, words, images, … Data: lowest level of abstraction (from which information and knowledge are derived)

What does DM do? DM extract patterns from data Data Mining Tasks… Pattern? A mathematical (numeric and/or symbolic) relationship among data items Data Mining Tasks… Predictive Method : Use some variables to predict unknown or future values of other variables. Descriptive Method : Find human-interpretable patterns that describe the data. Classification (Predictive) Clustering (Descriptive) Association Rule Discovery (Descriptive) Sequential Pattern Discovery (Descriptive) Regression (Predictive) Deviation Detection (Predictive)

Classification : Definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be _________________________________. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example

Classification : Application 1 Direct Marketing Goal : Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach : Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as ____________________ to learn a classifier model.

Classification : Application 2 Fraud Detection Goal : Predict fraudulent cases in credit card transactions. Approach : Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc. Label past transactions as fraud or fair transactions. This forms the class attributes. Learn a model for the class of the transactions. Use this model to detect fraud by _______________________ on an account.

Clustering : Definition Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures.

Clustering Based Techniques Key assumption: normal data records belong to large and dense clusters, while anomalies do not belong to any of the clusters or form very small clusters Anomalies detected using clustering based methods can be: – Data records that do not fit into any cluster (residuals from clustering) – Small clusters – Low density clusters or local anomalies (far from other points within the same cluster)

Major Clustering Approaches (I) Partitioning approach: k-means, k-medoids, CLARANS Construct k-partitions for the given n-objects (k ≤ n). Each group contains at least one object. Each object must belong to exactly one group. Density-based approach: DBSCAN, OPTICS, DenClue Based on connectivity and density functions. i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points. Hierarchical approach: Diana, Agnes, BIRCH, ROCK, CAMELEON Create a hierarchical ____________________ using some criterion (linkage function ) Agglomerative Approach: bottom-up merging Divisive Approach: top-down splitting

Major Clustering Approaches (II) Grid-based approach: Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOFM, COBWEB Frequent pattern-based: Based on the analysis of frequent pattern Typical methods: pCluster User-guided or constraint-based: Clustering by considering _______________________ constraints Typical methods: COD (obstacles), constrained clustering

Clustering : Application 1 Market Segmentation Goal : subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mix. Approach : Collect different attributes of customers based on their geographical and lifestyle related information. Find clusters of similar customers. Measure the clustering quality by observing buying patterns of customers in same cluster VS. those from different clusters.

Clustering : Application 2 Document Clustering Goal : To find groups of documents that are similar to each other based on the important terms appearing in them. Approach : To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. Gain : Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.

Association Rule Discovery : Definition Given a set of records each of which contain some number of items from a ___________________; Anomalous data records which will predict occurrence of an item based compared to occurrences of other normal items. Also known as Market Basket Analysis

Association Rule Discovery: Application 1 Marketing and Sales Promotion • Let the rule discovered be {Bagels, … } --> {Potato Chips} • Potato Chips as consequent => Can be used to determine what should be done to boost its sales. • Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels. • Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

Association Rule Discovery: Application 2 Supermarket shelf management Goal: To identify items that are bought together by sufficiently many customers. Approach: Process the point-of-sale data collected with ________________ to find dependencies among items. A classic rule -- If a customer buys diaper and milk, then he is very likely to buy beer. So, don’t be surprised if you find six-packs stacked next to diapers!

Association Rule Mining Apriori Algorithm Finds subsets that are common to at least a minimum number of the item sets uses a bottom-up approach frequent subsets are extended one item at a time (the size of frequent subsets increases from one-item subsets to two-item subsets, then three-item subsets, and so on), and groups of candidates at each level are tested against the data for minimum support

Sequential Pattern Discovery: Definition Given is a set of objects, with each object associated with its own _____________ , find rules that predict strong sequential dependencies among different events. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints.

Regression Predict a value of a _____________ variable based on the values of other variables, assuming a linear or nonlinear model of dependency. • Greatly studied in statistics, neural network fields. • Examples: Predicting sales amounts of new product based on advertising expenditure. Predicting wind velocities as a function of temperature, humidity, air pressure, etc. Time series prediction of stock market indices.

Deviation /Anomaly Detection Anomaly is a pattern in the data that does not conform to the expected behavior • Real World Anomalies: Credit Card Fraud Detection – An abnormally high purchase made on a credit card Cyber Intrusion Detection – A web server involved in ftp traffic

Type of Anomaly Point Anomalies Contextual Anomalies Collective Anomalies

Point Anomalies An individual data instance is anomalous w.r.t. the data

Contextual Anomalies An individual data instance is anomalous within a context Requires a notion of context Also referred to as conditional anomalies

Collective Anomalies A collection of related data instances is anomalous Requires a relationship among data instances – Sequential Data – Spatial Data – Graph Data The individual instances within a collective anomaly are not anomalous by themselves

Anomaly Detection Applications Credit card fraud detection Image Processing / Video surveillance Network Intrusion detection Healthcare informatics

Intrusion Detection Process of monitoring the events occurring in a computer system or network and analyzing them for intrusions Intrusions are defined as attempts to bypass the security mechanisms of a computer or network • Challenges : Traditional signature-based intrusion detection systems are based on signatures of known attacks and _________________ cyber threats Substantial latency in deployment of newly created signatures across the computer system • Anomaly detection can alleviate these limitations

Fraud Detection Fraud detection refers to detection of criminal activities occurring in commercial organizations – ________________ might be the actual customers of the organization or might be posing as a customer (also known as identity theft). • Types of fraud : – Credit card fraud – Insurance claim fraud – Mobile / cell phone fraud – Insider trading • Challenges : – Fast and accurate real-time detection – Misclassification cost is very high

Data Visualization Data visualization is a general term that describes any effort to help people understand the significance of data by placing it in a pictorial or graphical format (3-D images) Data visualization provides _________________ to aid the user during both data preprocessing and the actual data mining. Data Transformation is an important data preprocessing step. During data transformation, visualizing data can help the user to ensure the correctness of the transformation. Data from: satellite photos, sonar measurements, surveys, or computer simulations

Data Visualization Common Display Types : Bar Charts Line Charts Pie Charts Bubble Chart Stacked Charts Scatterplots

Visualization Based Techniques Use visualization tools to observe the data • Provide ________________ for manual inspection • Anomalies are detected visually • Advantages : – Keeps a human in the loop • Disadvantages : – Works well for low dimensional data – Can provide only aggregated or partial views for high dimension data

Graphical Approaches Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) Limitations : – Time consuming – Subjective

Application of Dynamic Graphics Apply dynamic graphics to the exploratory analysis of spatial data. Visualization tools are used to ___________________to detect anomalies Manual inspection of plots of the data that display its marginal and multivariate distributions

Visual Data Mining The process of discovering implicit but useful knowledge from large data sets using visualization techniques. Detecting Telecommunication fraud • Display telephone call patterns as a graph • Use colors to identify fraudulent telephone calls (anomalies)

Spreadsheet Most popular end-user modelling tool Static Model example : Simple Loan calculation of monthly payments

Spreadsheet Excel spreadsheet – Dynamic Model example : Simple Loan calculation of monthly payments And effects of prepayment

End of the chapter REFERENCE : https://www.researchgate.net/publication/314529706_Visualization_Techniques_for_Data_Min ing http://www.cs.put.poznan.pl/jstefanowski/sed/DM14-visualisation.pdf