1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining.

1 Francesco Gullo Barcelona gullo@yahoo-inc.com Francesco Gullo Barcelona gullo@yahoo-inc.com From Patterns in Data to Knowledge Discovery: what Data Mining can do 3rd International Conference Frontiers in Diagnostic Technologies November 25-27, 2013, Laboratori Nazionali di Frascati

What is Data Mining ? Several definitions: “Automated yet non-trivial extraction of implicit, previously unknown, and potentially useful information from data” “Automated exploration and analysis of large quantities of data in order to discover meaningful patterns” “Computational process of automatically extracting useful knowledge from large amounts of data” Keywords: large amounts of data, automation, knowledge

What is Data Mining ? The analysis step of the "Knowledge Discovery in Databases" (KDD) process

Why Data Mining ? Lots of data is being collected/stored web-data e-commerce data purchases bank transactions Lots of data is being processed at enormous speeds (GB/minutes) remote sensors on a satellite telescopes scanning the skies microarray generating gene expression data scientific simulations generating terabytes of data Data analysis in such a challenging contexts cannot be performed with traditional data-analysis techniques, neither manual nor automated

Data Mining: an inter-disciplinary field Database systems Data Mining Artificial Intelligence Statistics Machine Learning

Data-Mining Tasks Predictive tasks Use some variables to predict unknown or future values of other variables Classification Regression Deviaton detection Descriptive tasks Find human-interpretable patterns that well-describe the data Clustering Association-rule discovery Pattern discovery

Classification Given a collection of records (i.e., the training set) Each record contains a set of attributes, one of the attributes denotes the class of the record Find a model (i.e., train a classifier) for class attribute as a function of the values of the other attributes Goal: predict the class attribute of previously unobserved records based on the model found A test set of records is often used in order to evaluate the accuracy of the model

Classification: Example Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Model construction :

Classification: Example Using the model for prediction : Classifier Unseen Data (Jeff, Professor, 4) Tenured?

Classification: Application 1 Fraud Detection Goal: Predict fraudolent cases in credit-card transactions Approach: Use credit-card transactions and the information about its account-holder as attributes e.g., when/what/where does the account-holder buy ? Assign a {fraud, fair} class attribute value to each transaction based on historical data Learn a model based on this data Process each new transaction with this model in order to predict whether the transaction is fraudolent or fair

Classification: Application 2 Sky Survey Cataloging Goal: Predict class type (e.g., star or galaxy) of sky objects based on telescopic-survey images Approach: Segment each image and represent each segment as a set of attributes, such as RGB values, color intensity, brightness Assign a {star,galaxy} class attribute value to each image Learn a model based on this data Predict the class type of unlabeled images based on the model learnt

Classification: Decision Trees A decision tree is a tree where: Internal nodes: test on a single attribute Branch: an outcome of the test Leaf nodes: class A? B?C? D? Yes

Decision Trees: example (“Play tennis?”) Training set (from Quinlan’s book) :

Decision Trees: example (“Play tennis?”) Decision tree obtained with the ID3 algorithm: outlook overcast humiditywindy highnormal false true sunny rain NNYY Y

Clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find groups of objects (i.e., clusters) such that: Data points in the same cluster are highly-similar to each other (high intra-cluster compactness) Data points in different clusters are highly-dissimilar to each other (high inter-cluster separation) Clustering is also known as unsupervised classification: unlike (supervised) classification, clustering does not rely on any labeled data Often used as a preliminary (exploratory) step of more-complex tasks

Clustering Euclidean-distance-based clustering in 2D space

Clustering: Application 1 Market segmentation Goal: subdivide a market into distinct subsets of customers where any subset may be selected as a market target to be reached with a distinct marketing mix Approach: Collect different attributes of customers based on their, e.g., geographical and lifestyle-related information Define an appropriate measure of distance among customers based on such attributes Find clusters of similar customers

Clustering: Application 2 Find topic-coherent documents Goal: find groups of documents that are about the same (set of) topic(s) Approach: Represent each document as a set of attributes, each of which corresponding to the frequency of a term in the document Define a proper distance measure among term-frequency- represented documents Cluster the documents Eventually use clusters to relate new documents to the clustered ones

Clustering: the K-means algorithm

Association-rule discovery Given a set of records (transactions), each of which containing a number of items from a given collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items

Association-rule discovery: Application 1 Marketing and sales promotion Assume to have learnt a rule {Milk, Cheese}  {Chips}: Milk, and Cheese can be used to boost the sales of Chips (e.g., by storing the former items close to Chips) The sale of Chips will be affected if Milk and Cheese will not be sold anymore Putting Milk in bundle promotion with Cheese will boost the sale of Chips

Data Mining in emerging domains: Graph Mining

Graph Data G = (V, E), where V is a set of vertices (nodes), and E  V x V is a set of edges (arcs) G can b directed or undirected Additional information can be present on vertices and/or edges: weight, label, timestamp, probability of existence, feature vector, …

Graphs are ubiquitous Computational biology Protein-protein interaction (PPI) networks Chemical data analysis Chemical compounds Communication networking Device networks, road networks Social network analysis Web link analysis Recommender systems

Mining graph data: Tasks Graph clustering Graph search Dense-subgraph extraction Graph classification Graph pattern mining Graph matching Graph querying Influence maximization …

Graph clustering Partition the input graph in order to maximize some notion of density Notions of density: Average degree Ratio cut Normalized cut Conductance (Quasi-)clique condition … Applications Community detection in a social network Identifying high-cohesive structures in biological networks Packet delivery on communication networks Detecting highly-correlated stocks...

Graph search Given a set of graphs {G 1,..., G n }, and a graph query Q, find all graphs in {G 1,..., G n } that are supergraphs of Q Applications Chemical compound search Molecules represented in terms of atoms and bonds between atoms Context-based image retrieval Images represented in terms of object properties and relationships between objects 3D protein structure search Proteins represented as a set of amino acids related to each other

30 Thanks! gullo@yahoo-inc.com

Backup slides

Association-rule discovery: Application 2 Prediction of drug side effects Goal: detect combinations of drugs that result in particular side-effects Approach: Model each patient as a record of two types of items: items representing drugs taken and items representing side effects observed Employ an association-rule-discovery method to detect rules like: {Marijuana, Heroin}  {Depressed respiration} Use the rules discovered for early diagnoses

Mining graph data: Challenges Small-dimensional graphs, but lots of graphs Chemical data graphs Small number of graphs, but huge dimensionality Social networks, the Web Dynamic graphs (i.e., graphs changing over time) PPI networks Time-dependent graphs Road networks

Classification: Application 1 Direct marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product Approach: Use past data from a (set of) similar product(s) introduced before Consider the information about which customers bought and which customers did not. This {buy, don’t buy} decision forms the class attribute Describe each customer according to several other attributes, such as demographic, lifestyle, company-interaction information and so on Use this information to train a classifier that can be used to infer the {buy, don’t buy} class of the various customers for the new product

1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining.

Similar presentations

Presentation on theme: "1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining.

Similar presentations

Presentation on theme: "1 Francesco Gullo Barcelona Francesco Gullo Barcelona From Patterns in Data to Knowledge Discovery: what Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback