Data Mining, Neural Network and Genetic Programming

Data Mining, Neural Network and Genetic Programming
COMP422 Week 2 DM Tasks and Algorithms Mengjie Zhang, Yi Mei

Outline DM development history DM related fields and relationships
DM common tasks and examples DM common algorithms Bayesian classification approach Support vector machine Web mining DM issues and challenges

KDD/DM History Time Area Contribution Late 1700s Stat
Bayes theorem of probability Early 1900s Regression analysis Early 1920s Maximum likelihood estimation Early 1940s AI Neural networks Early 1950s Nearest neighbour Single link Late 1950s Percepton Resampling, bias redution Early 1960s Machine learning started DB Batch report Mid 1960s Decision tree Linear model for classification IR Similarity measure Clustering

KDD/DM History Time Area Contribution Mid 1960s Stat
Exploratory data analysis Late 1960s DB Relational data model Early 1970s IR SMART IR system AI Expert system Mid 1970s Genetic algorithms Late 1970s Estimation with incomplete data K-means clustering Early 1980s Kohonen self-organising maps Mid 1980s Decision tree algorithms Early 1990s DB? Association rule algorithms Web and search engine Genetic programming Late 1990s AI/Stat Support vector machines 2000s AI/DB Deep learning, Big data, …

KDD Related Fields KDD/DM is multidisciplinary
AI, Machine Learning, Statistics, Pattern Recognition Databases (Parallel DBMS, Deductive Databases) Knowledge Acquisition, Expert Systems, Decision Support Systems Data Visualisation Fuzzy Set, Fuzzy Logic, Uncertainty Machine Discovery and Casual Modeling Data Warehousing High Performance Computing Image Analysis, Signal Processing Web Search Engine, Information Extraction

KDD Related Fields

Artificial Intelligence

Artificial Intelligence
Machine mimics “cognitive” functions such as “learning” and “problem solving” Understanding human speech Playing games Autonomous driving Intelligent routing/delivery Interpreting complex data … Related to DM, but many others Symbolic AI Planning and scheduling Search

Machine Learning Basis for many core DM research topics
Examine existing data (training set) to produce some “rules” (KDD objects) Apply these rules/KDD objects to unseen data (test set) Often used for classification and regression (prediction) Supervised learning vs unsupervised learning Examples: Neural network Genetic algorithm/programming Decision trees (C4.5, C5.0, …)

Statistics Probability
Distributions to describe domains for different data attributes Statistical inference Bayesian approach Regression

Fuzzy Set and Fuzzy Logic
Fuzzy set: A pair (𝑈,𝑚 ⋅ ) 𝑈 is the set 𝑚:𝑈→[0,1] is the membership function Example: “tall” set Crisp set Fuzzy set

Can have multiple fuzzy sets Example: characterise temperature Three fuzzy sets: cold, warm, hot

Fuzzy logic: reasoning with uncertainty Example: characterise temperature Three fuzzy sets: cold, warm, hot “fairly cold” “slightly warm” “not hot”

Fuzzy logic operators Let 𝑥 and 𝑦 be two fuzzy logic statements 𝑚 ¬𝑥 =1−𝑚 𝑥 𝑚 𝑥∧𝑦 =min⁡(𝑚 𝑥 ,𝑚(𝑦)) 𝑚 𝑥∨𝑦 =max⁡(𝑚 𝑥 ,𝑚(𝑦)) Boolean (True/False) Fuzzy (Membership) AND(x, y) min(m(x), m(y)) OR(x, y) max(m(x), m(y)) NOT(x) 1-m(x)

Data Warehousing Central repositories of integrated data from one or more disparate sources Store current and historical data + Create analytical reports OLAP vs OLTP in operational databases

Web Search Engine, Web search engines search online documents related to a particular topic based on keyword search Examples: Google, Yahoo, Baidu, …

Information Extraction
Get particular interesting information from Web Convert human read only to computer read Typical example: natural language processing

DB Management vs Machine Learning
DB is an active, evolving entity DM is static Records may contain erroneous or missing data DBs are complete and noisy free Typical field is numeric Typical field is binary DB contains millions of records DM contains hundreds of instances AI should get down to reality All DB problems have been solved.

DM Tasks Classification Regression Prediction Time Series Analysis
Clustering Summarisation Association Rules Sequencing Discovery

Classification Maps data into predefined groups or classes
Supervised learning Classes are predefined/determined based on data attribute values before examining the data Examples Medical: cancer vs not cancer Bank: credit reliable vs unreliable Digit recognition: multi-class Weather: sunny or rainy Anomaly detection Airport security screening: terrorist/criminals or not

Regression Map a data item to a real-valued prediction variable
Learning a function Assume a certain function type (e.g. linear, logistic, polynomial, …) and determine the best function of this type to fit the given data Examples Financial prediction Saving prediction Ad cost vs sales

Classification vs Regression

Prediction Prediction is similar to classification/regression
The difference is that prediction is predicting a future state in time rather than a current state Example Flood prediction Weather forecasting

Time Series Analysis A special case: the attribute to be examined varies over time Example: stock market (prices for X, Y and Z in one month)

Clustering Similar to classification, except that the groups are not predefined, but rather defined by the data itself Unsupervised learning Segmenting or partitioning data into groups that might or night not disjointed Done by determining the similarity among the data on predefined attributes A domain expert is needed to interpret the meaning Segmentation: a special type of clustering: a DB is partitioned into disjointed groups of similar tuples called segments.

Clustering

Clustering Self-organising map
A special method for clustering & high-dimensional data visualisation

Summarisation Also called characterisation or generalisation
Maps data into subsets with associated simple descriptions Clustering Sampling Compression (e.g. image) Histogram …

Association Rules Link analysis = association
Uncover relationships among data An association rule is a model that identifies specific types of data associations, which are often used in the retail sales community to identify items that are frequently purchased together

Association Rules Link analysis = association
Uncover relationships among data An association rule is a model that identifies specific types of data associations, which are often used in the retail sales community to identify items that are frequently purchased together Beer & Nappies

Sequence Discovery Determine sequential patterns in data
Similar to associations in that data or events are found to be related, but relationship is based on time Example 1: most people who purchase CD player may be found to buy CDs within one week. Example 2: A webmaster at a company is to determine what sequences of web pages are frequently accessed. He found that 70% of the users of Page A follow one of the following patterns: <A, B, C>, <A, D, B, C> or <A, E, B, C>

Data Mining Algorithms
Statistical-based: Regression, Bayesian classification Distance-based: Nearest neighbour, nearest centroid, K-Nearest neighbour Decision tree-based: ID3, C4.5/C5.0, … Neural network-based: perceptron learning, back propagation learning, … Evolutionary-based: genetic programming Rule-based: generating rules from a decision tree, from a neural network, from a GP tree Hybrid: use GA to train weights of an NN, NEAT, …

Bayes Approach Statistical inference
Given a data set: 𝑋={ 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑛 } A hypothesis space: 𝐻={ ℎ 1 , ℎ 2 ,…, ℎ 𝑚 } Assume that one and only one hypothesis must occur at a time, and that 𝑥 𝑖 is an observable event. The Bayes rule estimates the likelihood of a hypothesis given the set of data as evidence or input: Posterior probability: Prior probability: Unconditional probability: Conditional Probability:

Bayes Classification Example
Credit loan authorisation Four classes/hypotheses: ℎ 1 : authorise purchase ℎ 2 : authorise after further identification ℎ 3 : do not authorise ℎ 4 : do not authorise and contact police

Categorise incoming into range1 = [$0, $10,000), range2 = [$10,000, $50,000), range3 = [$50,000, $100,000), range4 = [$100,000, ∞) Credit records: {excellent, good, bad} Then we have 12 values in the data space Given “Income = $130,000” and “Credit record = Excellent”, which class/hypothesis it belongs to?

Prob h1 6/10 h2 2/10 h3 1/10 h4 x Prob x1 x2 2/10 x3 x4 1/10 x5 x6 x7 x8 x9 x10 x11 x12 𝑷( 𝒙 𝒊 | 𝒉 𝒋 ) h1 h2 h3 h4 x1 x2 2/6 x3 x4 1/6 x5 x6 x7 x8 x9 1 x10 1/2 x11 x12 What is the problem here?

Dealing with zero counts (initially 1 for all occurrences Assumption on conditional independence

Support Vector Machines (SVMs)
To find the optimal hyperplane that correctly classifies data points as much as possible and separates the points as far as possible

Support Vector Machines (SVMs)
To find the optimal hyperplane that correctly classifies data points as much as possible and separates the points as far as possible Unseen data points

SVM Main Idea Given a set of data points which belong to either of two classes, an SVM finds the hyperplane: Leave the largest fraction of points of the same class on the same side, and Maximise the distance of either class from the hyperplane Hard margin or Soft margin (class) y = 1 (class) y = -1 Penalty

SVM for Non-linear Classification
What if the data is non-linearly separable? Kernel machine: transform data to higher dimension so it becomes linear separable 𝑥,𝑦 →(𝑥,𝑦, 𝑥 2 + 𝑦 2 )

SVM for Non-linear Classification
Everything remains the same, except that the dot products are replaced by a nonlinear kernel function Resultant classifier is linear in the transformed high-dimensional space, but non-linear in the original space

Vapnik-Chevonenkis (VC) Dimension
Measure the capacity (complexity, expressive power, richness, or flexibility) of a space of functions that can be learned by a statistical classification algorithm. Shattered: A set of functions 𝑓(𝜃), where 𝜃 is the parameter of the function, is said to shatter a set of data points { 𝑥 1 ,…, 𝑥 𝑛 } if, for all assignments of class labels to those points, there exists a 𝜃 value so that 𝑓(𝜃) can perfectly classify the data points. The VC Dimension of 𝑓(𝜃) is the maximal number of points so that 𝑓(𝜃) can shatter them.

VC Dimension Examples If 𝑓 𝜃 = 𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 −𝑏 is a linear classifier in 2D space, there exist sets of 3 points that are shattered by the model. However, no set of 4 points can be shattered.

VC Dimension Examples 𝑓 is constant classifier (with no parameters), it returns 1 if the input is larger than 𝑐, and 0 otherwise. What is its VC dimension? 𝑓 is a single-parametric (threshold) classifier, it returns 1 if the input is larger than 𝜃, and 0 otherwise. What is its VC dimension? 𝑓 is a single-parametric interval classifier, it returns 1 if the input is in the interval 𝜃,𝜃+4 , and 0 otherwise. What is its VC dimension? 𝑓 is a single-parametric sine classifier, it returns 1 if the input is larger than sin⁡(𝜃𝑥), and 0 otherwise. What is its VC dimension?

Expected Risk Given N observations, each as a pair (𝒙 𝑖 , 𝑦 𝑖 )
The data points are independently and identically distributed, and follow some unknown distribution 𝑃 𝒙,𝑦 A machine/function is to learn the mapping 𝑓 𝒙, 𝛼 :𝑿→𝑦∈{−1,1}, where 𝛼 is the parameter Expected risk: Empirical risk:

Bound for Expected Risk
According to the PAC model, the bound for the expected risk which holds with probability 1−𝜂 ℎ is the VC dimension of 𝑓 𝒙,𝛼 Smaller VC dimension  Smaller bound of expected risk Empirical risk VC confidence Bound of expected risk Empirical risk VC dimension confidence = +

Structural Risk Minimisation
To make the expected risk small Minimise the empirical risk Minimise the VC dimension

Using a priori knowledge of the domain, choose a class of functions, such as polynomials of degree n, neural networks having n hidden layer neurons, a set of splines with n nodes or fuzzy logic models having n rules. Divide the class of functions into a hierarchy of nested subsets in order of increasing complexity (VC dimension). For example, polynomials of increasing degree. Perform empirical risk minimization on each subset (this is essentially parameter selection). Select the model in the series whose sum of empirical risk and VC confidence is minimal.

Well founded mathematically, but difficult to implement VC dimension is hard to compute, only a few models for which we know how to compute VC dimension Even if we know how to calculate the VC dimension, not easy to solve the optimisation problem (need to minimise the empirical risk for each subset, and the choose the best one)

SVM for Structural Risk Minimisation
SVMs use the spirit of the SRM principle In SVM, the model (hyperplane) is represented as In SVM, the VC dimension is determined by The set of functions VC dimension SVM tries to minimise the empirical risk SVM tries to maximise the margin (smaller VC dimension)

SVM for Structural Risk Minimisation

SVM Issues/Problems Multiple classes Feature selection
One-against-the-rest One-against-one Feature selection Need to select a good kernel function Usually require lots of memory and CPU time

Web Mining Web mining is mining of data related to the World Wide Web

Web Mining

Web Mining Web content mining: mining, extraction and integration of useful data, information and knowledge from Web page content Topic discovery, extract association patterns, clustering/classifying Web pages Related to Information Retrieval Text Mining, Natural Language Processing Computer Vision & Image Processing

Web Mining Web structure mining: use graph theory to analyse the node and connection structure of a web site. Extract patterns from hyperlinks Mining the document structure (HTML/XML tags) Example: PageRank (Google) Rank search results Estimate the quality of the Web pages based on hyperlinks: a page has a high rank if it is pointed to by many highly ranked pages

Web Mining Web usage mining: discover interesting usage patterns -- identity or origin of Web users, and their browsing behaviours Applications Personalised marketing Identify terrorism threats Ethical issues: invasion of privacy

Issues and Challenges High dimensionality: hundreds of fields, tables, millions of records -- Big Data Overfitting Randomness: if search is over many models, some models will fit well by chance Dynamism: Data and knowledge change over time. Missing and noisy data. Complex relationships between fields Understandability of patterns User interaction and prior knowledge Integrating with other systems Blind use of methods by incompetents leads to meaningless and invalid patterns

Summary DM related fields and relationship DM tasks and examples
DM algorithms Bayes method Support vector machine Structural Risk Minimisation Web mining DM Issues and Challenges

Data Mining, Neural Network and Genetic Programming

Similar presentations

Presentation on theme: "Data Mining, Neural Network and Genetic Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining, Neural Network and Genetic Programming

Similar presentations

Presentation on theme: "Data Mining, Neural Network and Genetic Programming"— Presentation transcript:

Similar presentations

About project

Feedback