Download presentation
Presentation is loading. Please wait.
1
1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο Ψηφιακών Βιβλιοθηκών & Ηλεκτρονικής Δημοσίευσης Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας, Ιόνιο Πανεπιστήμιο
2
2 Data Mining Εξόρυξη γνώσης από πολύ μεγάλες συλλογές δεδομένων Γνώση: κανόνες, πρότυπα συμπεριφοράς και συσχετίσεις μεταξύ αντικειμένων (όχι προφανής, λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη) Αντικείμενο: Αποτελείται από ένα σύνολο χαρακτηριστικών Δεν είναι: (Deductive) query processing. Expert systems, small machine learning /statistical programs
3
3 Why Data Mining? Potential Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering
4
4 Market Analysis and Management (1) Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information
5
5 Market Analysis and Management (2) Customer profiling data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements identifying the best products for different customers use prediction to find what factors will attract new customers Provides summary information various multidimensional summary reports statistical summary information (data central tendency and variation)
6
6 Corporate Analysis and Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial- ratio, trend analysis, etc.) Resource planning: summarize and compare the resources and spending Competition: monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market
7
7 Steps of a KDD Process Learning the application domain: relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
8
Data Mining: A KDD Process Data mining: the core of knowledge discovery process. Data Cleaning Data Integration Databases Data Warehouse Task-relevant Data Selection Data Mining Pattern Evaluation
9
9 Data pre-processing Data preparation is a big issue for data mining Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization A lot a methods have been developed but still an active area of research
10
10 Data pre-processing
11
11 Clustering Partition data set into clusters, and one can store cluster representation only Can have hierarchical clustering and be stored in multi-dimensional index tree structures There are many choices of clustering definitions and clustering algorithms
12
12 Cluster Analysis
13
13 Classification Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks) Classification is probably one of the most widely used data mining techniques with a lot of extensions Scalability is still an important issue for database applications: thus combining classification with database techniques should be a promising topic Research directions: classification of non-relational data, e.g., text, spatial, multimedia, etc..
14
14 Classification process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur
15
15 Classification Process (1): Model Construction Training Data Classification Algorithms IF rank = professor OR years > 6 THEN tenured = yes Classifier (Model)
16
16 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?
17
17 Supervised vs. Unsupervised Learning Supervised learning (classification) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
18
18 Document category modelling Example: Filtering spam email. Task: classify incoming email as spam and legitimate (2 document categories). Simple blacklist and keyword-based methods have failed. More intelligent, adaptive approaches are needed (e.g. naive Bayesian category modeling).
19
19 Document category modelling Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization. Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3). Step 3 (feature selection): information gain evaluation. Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.
20
20 What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Example. Rule form: "Body ead [support, confidence]. buys(x, "diapers ) buys(x, "beers ) [0.5%, 60%]
21
21 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics * (What other products should the store stocks up?)
22
22 Rule Measures: Support and Confidence Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X & Y & Z} confidence, c, conditional probability that a transaction having {X & Y} also contains Z Customer buys diaper Custome r buys both Customer buys beer Find the rules with support and confidence equal or grater than a given threshold
23
23 Mining Association Rules An Example For rule A C: support = support({A =>C}) = 50% confidence = support({A =>C})/support({A}) = 66.6% Min. support 50% Min. confidence 50%
24
24 References U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996. G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.