1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο.

1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο Πανεπιστήμιο

2 Data Mining  Εξόρυξη γνώσης από πολύ μεγάλες συλλογές δεδομένων  Γνώση: κανόνες, πρότυπα συμπεριφοράς και συσχετίσεις μεταξύ αντικειμένων (όχι προφανής, λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη)  Αντικείμενο: Αποτελείται από ένα σύνολο χαρακτηριστικών  Δεν είναι:  (Deductive) query processing.  Expert systems, small machine learning /statistical programs

3 Why Data Mining? Potential Applications  Database analysis and decision support  Market analysis and management  target marketing, customer relation management, market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and management  Other Applications  Text mining (news group, email, documents) and Web analysis.  Intelligent query answering

4 Market Analysis and Management (1)  Where are the data sources for analysis?  Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies  Target marketing  Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc.  Determine customer purchasing patterns over time  Conversion of single to a joint bank account: marriage, etc.  Cross-market analysis  Associations/co-relations between product sales  Prediction based on the association information

5 Market Analysis and Management (2)  Customer profiling  data mining can tell you what types of customers buy what products (clustering or classification)  Identifying customer requirements  identifying the best products for different customers  use prediction to find what factors will attract new customers  Provides summary information  various multidimensional summary reports  statistical summary information (data central tendency and variation)

6 Corporate Analysis and Risk Management  Finance planning and asset evaluation  cash flow analysis and prediction  contingent claim analysis to evaluate assets  cross-sectional and time series analysis (financial- ratio, trend analysis, etc.)  Resource planning:  summarize and compare the resources and spending  Competition:  monitor competitors and market directions  group customers into classes and a class-based pricing procedure  set pricing strategy in a highly competitive market

7 Steps of a KDD Process  Learning the application domain:  relevant prior knowledge and goals of application  Creating a target data set: data selection  Data cleaning and preprocessing: (may take 60% of effort!)  Data reduction and transformation:  Find useful features, dimensionality/variable reduction, invariant representation.  Choosing functions of data mining  summarization, classification, regression, association, clustering.  Choosing the mining algorithm(s)  Data mining: search for patterns of interest  Pattern evaluation and knowledge presentation  visualization, transformation, removing redundant patterns, etc.  Use of discovered knowledge

Data Mining: A KDD Process  Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation

9 Data pre-processing  Data preparation is a big issue for data mining  Data preparation includes  Data cleaning and data integration  Data reduction and feature selection  Discretization  A lot a methods have been developed but still an active area of research

10 Data pre-processing

11 Clustering  Partition data set into clusters, and one can store cluster representation only  Can have hierarchical clustering and be stored in multi-dimensional index tree structures  There are many choices of clustering definitions and clustering algorithms

12 Cluster Analysis

13 Classification  Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)  Classification is probably one of the most widely used data mining techniques with a lot of extensions  Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic  Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..

14 Classification process  Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction: training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: for classifying future or unknown objects  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set, otherwise over-fitting will occur

15 Classification Process (1): Model Construction Training Data Classification Algorithms IF rank = professor OR years > 6 THEN tenured = yes Classifier (Model)

16 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

17 Supervised vs. Unsupervised Learning  Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering)  The class labels of training data is unknown  Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

18 Document category modelling  Example: Filtering spam email.  Task: classify incoming email as spam and legitimate (2 document categories).  Simple blacklist and keyword-based methods have failed.  More intelligent, adaptive approaches are needed (e.g. naive Bayesian category modeling).

19 Document category modelling  Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization.  Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3).  Step 3 (feature selection): information gain evaluation.  Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.

20 What Is Association Mining?  Association rule mining:  Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.  Applications:  Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.  Example.  Rule form: "Body  ead [support, confidence].  buys(x, "diapers )  buys(x, "beers ) [0.5%, 60%]

21 Association Rule: Basic Concepts  Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)  Find: all rules that correlate the presence of one set of items with that of another set of items  E.g., 98% of people who purchase tires and auto accessories also get automotive services done  Applications  *  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales)  Home Electronics  * (What other products should the store stocks up?)

22 Rule Measures: Support and Confidence  Find all the rules X & Y  Z with minimum confidence and support  support, s, probability that a transaction contains {X & Y & Z}  confidence, c, conditional probability that a transaction having {X & Y} also contains Z Customer buys diaper Custome r buys both Customer buys beer Find the rules with support and confidence equal or grater than a given threshold

23 Mining Association Rules An Example For rule A  C: support = support({A =>C}) = 50% confidence = support({A =>C})/support({A}) = 66.6% Min. support 50% Min. confidence 50%

24 References  U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.  J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.  T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.  G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.  G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.

1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο.

Similar presentations

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο.

Similar presentations

Similar presentations

About project

Feedback