1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining.

Slides:



Advertisements
Similar presentations
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
Advertisements

CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
DATA MINING CS157A Swathi Rangan. A Brief History of Data Mining The term “Data Mining” was only introduced in the 1990s. Data Mining roots are traced.
Week 9 Data Mining System (Knowledge Data Discovery)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
University of Minnesota
© Vipin Kumar CSci 8980 (Data Mining) Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining: Introduction
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data mining By Aung Oo.
Why Mine Data? Commercial Viewpoint
Data Mining and Business Intelligence
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
Chapter 5: Data Mining for Business Intelligence
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Knowledge Discovery and Data Mining Evgueni Smirnov.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
CSE4334/5334 DATA MINING CSE 4334/5334 Data Mining, Fall 2011 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
MIS2502: Data Analytics Advanced Analytics - Introduction.
Introduction to Data Mining Jinze Liu April 8 th, 2009.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Waqas Haider Bangyal. 2 Source Materials “ Data Mining: Concepts and Techniques” by Jiawei Han & Micheline Kamber, Second Edition, Morgan Kaufmann, 2006.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Introduction to Data Mining Mining Association Rules Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
An Introduction to Data Mining
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Data Mining: Confluence of Multiple Disciplines Data Mining Database Systems Statistics Other Disciplines Algorithm Machine Learning Visualization.
Data Mining: Introduction
MIS2502: Data Analytics Advanced Analytics - Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
Data Mining: Introduction
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Mining: Introduction
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining: Introduction
Presentation transcript:

1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining can do 3rd International Conference Frontiers in Diagnostic Technologies November 25-27, 2013, Laboratori Nazionali di Frascati

What is Data Mining ? Several definitions: “Automated yet non-trivial extraction of implicit, previously unknown, and potentially useful information from data” “Automated exploration and analysis of large quantities of data in order to discover meaningful patterns” “Computational process of automatically extracting useful knowledge from large amounts of data” Keywords: large amounts of data, automation, knowledge

What is Data Mining ? The analysis step of the "Knowledge Discovery in Databases" (KDD) process

Why Data Mining ? Lots of data is being collected/stored web-data e-commerce data purchases bank transactions Lots of data is being processed at enormous speeds (GB/minutes) remote sensors on a satellite telescopes scanning the skies microarray generating gene expression data scientific simulations generating terabytes of data Data analysis in such a challenging contexts cannot be performed with traditional data-analysis techniques, neither manual nor automated

Data Mining: an inter-disciplinary field Database systems Data Mining Artificial Intelligence Statistics Machine Learning

Data-Mining Tasks Predictive tasks Use some variables to predict unknown or future values of other variables Classification Regression Deviaton detection Descriptive tasks Find human-interpretable patterns that well-describe the data Clustering Association-rule discovery Pattern discovery

Data-Mining Tasks Predictive tasks Use some variables to predict unknown or future values of other variables Classification Regression Deviaton detection Descriptive tasks Find human-interpretable patterns that well-describe the data Clustering Association-rule discovery Pattern discovery

Classification Given a collection of records (i.e., the training set) Each record contains a set of attributes, one of the attributes denotes the class of the record Find a model (i.e., train a classifier) for class attribute as a function of the values of the other attributes Goal: predict the class attribute of previously unobserved records based on the model found A test set of records is often used in order to evaluate the accuracy of the model

Classification: Example Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Model construction :

Classification: Example Using the model for prediction : Classifier Unseen Data (Jeff, Professor, 4) Tenured?

Classification: Application 1 Fraud Detection Goal: Predict fraudolent cases in credit-card transactions Approach: Use credit-card transactions and the information about its account-holder as attributes e.g., when/what/where does the account-holder buy ? Assign a {fraud, fair} class attribute value to each transaction based on historical data Learn a model based on this data Process each new transaction with this model in order to predict whether the transaction is fraudolent or fair

Classification: Application 2 Sky Survey Cataloging Goal: Predict class type (e.g., star or galaxy) of sky objects based on telescopic-survey images Approach: Segment each image and represent each segment as a set of attributes, such as RGB values, color intensity, brightness Assign a {star,galaxy} class attribute value to each image Learn a model based on this data Predict the class type of unlabeled images based on the model learnt

Classification: Decision Trees A decision tree is a tree where: Internal nodes: test on a single attribute Branch: an outcome of the test Leaf nodes: class A? B?C? D? Yes

Decision Trees: example (“Play tennis?”) Training set (from Quinlan’s book) :

Decision Trees: example (“Play tennis?”) Decision tree obtained with the ID3 algorithm: outlook overcast humiditywindy highnormal false true sunny rain NNYY Y

Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find groups of objects (i.e., clusters) such that: Data points in the same cluster are highly-similar to each other (high intra-cluster compactness) Data points in different clusters are highly-dissimilar to each other (high inter-cluster separation) Clustering is also known as unsupervised classification: unlike (supervised) classification, clustering does not rely on any labeled data Often used as a preliminary (exploratory) step of more-complex tasks

Clustering Euclidean-distance-based clustering in 2D space

Clustering: Application 1 Market segmentation Goal: subdivide a market into distinct subsets of customers where any subset may be selected as a market target to be reached with a distinct marketing mix Approach: Collect different attributes of customers based on their, e.g., geographical and lifestyle-related information Define an appropriate measure of distance among customers based on such attributes Find clusters of similar customers

Clustering: Application 2 Find topic-coherent documents Goal: find groups of documents that are about the same (set of) topic(s) Approach: Represent each document as a set of attributes, each of which corresponding to the frequency of a term in the document Define a proper distance measure among term-frequency- represented documents Cluster the documents Eventually use clusters to relate new documents to the clustered ones

Clustering: the K-means algorithm

Association-rule discovery Given a set of records (transactions), each of which containing a number of items from a given collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items

Association-rule discovery: Application 1 Marketing and sales promotion Assume to have learnt a rule {Milk, Cheese}  {Chips}: Milk, and Cheese can be used to boost the sales of Chips (e.g., by storing the former items close to Chips) The sale of Chips will be affected if Milk and Cheese will not be sold anymore Putting Milk in bundle promotion with Cheese will boost the sale of Chips

Data Mining in emerging domains: Graph Mining

Graph Data G = (V, E), where V is a set of vertices (nodes), and E  V x V is a set of edges (arcs) G can b directed or undirected Additional information can be present on vertices and/or edges: weight, label, timestamp, probability of existence, feature vector, …

Graphs are ubiquitous Computational biology Protein-protein interaction (PPI) networks Chemical data analysis Chemical compounds Communication networking Device networks, road networks Social network analysis Web link analysis Recommender systems

Mining graph data: Tasks Graph clustering Graph search Dense-subgraph extraction Graph classification Graph pattern mining Graph matching Graph querying Influence maximization …

Graph clustering Partition the input graph in order to maximize some notion of density Notions of density: Average degree Ratio cut Normalized cut Conductance (Quasi-)clique condition … Applications Community detection in a social network Identifying high-cohesive structures in biological networks Packet delivery on communication networks Detecting highly-correlated stocks...

Graph search Given a set of graphs {G 1,..., G n }, and a graph query Q, find all graphs in {G 1,..., G n } that are supergraphs of Q Applications Chemical compound search Molecules represented in terms of atoms and bonds between atoms Context-based image retrieval Images represented in terms of object properties and relationships between objects 3D protein structure search Proteins represented as a set of amino acids related to each other

30 Thanks!

Backup slides

Association-rule discovery: Application 2 Prediction of drug side effects Goal: detect combinations of drugs that result in particular side-effects Approach: Model each patient as a record of two types of items: items representing drugs taken and items representing side effects observed Employ an association-rule-discovery method to detect rules like: {Marijuana, Heroin}  {Depressed respiration} Use the rules discovered for early diagnoses

Mining graph data: Challenges Small-dimensional graphs, but lots of graphs Chemical data graphs Small number of graphs, but huge dimensionality Social networks, the Web Dynamic graphs (i.e., graphs changing over time) PPI networks Time-dependent graphs Road networks

Classification: Application 1 Direct marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product Approach: Use past data from a (set of) similar product(s) introduced before Consider the information about which customers bought and which customers did not. This {buy, don’t buy} decision forms the class attribute Describe each customer according to several other attributes, such as demographic, lifestyle, company-interaction information and so on Use this information to train a classifier that can be used to infer the {buy, don’t buy} class of the various customers for the new product