Zhu Han University of Houston Thanks for Dr. Nam Nguyen Work

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Applications of one-class classification

Bayesian Belief Propagation

Gentle Introduction to Infinite Gaussian Mixture Modeling

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Gaussian Mixture.

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Unsupervised Learning

Bayesian Nonparametric Classification and Applications

CS479/679 Pattern Recognition Dr. George Bebis

K Means Clustering , Nearest Cluster and Gaussian Mixture

Supervised Learning Recap

Segmentation and Fitting Using Probabilistic Methods

What is Statistical Modeling

Department of Electrical and Computer Engineering Case Study of Big Data Analysis for Smart Grid Department of Electrical and Computer Engineering Zhu.

Lecture 5: Learning models using EM

Machine Learning CMPT 726 Simon Fraser University

Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.

Visual Recognition Tutorial

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Cooperative spectrum sensing in cognitive radio Aminmohammad Roozgard.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

1 Secure Cooperative MIMO Communications Under Active Compromised Nodes Liang Hong, McKenzie McNeal III, Wei Chen College of Engineering, Technology, and.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

EM and expected complete log-likelihood Mixture of Experts

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Randomized Algorithms for Bayesian Hierarchical Clustering

Stick-Breaking Constructions

Lecture 2: Statistical learning primer for biologists

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Department of Electrical and Computer Engineering A NONPARAMETRIC BAYESIAN FRAMEWORK FOR MOBILE DEVICE SECURITY AND LOCATION BASED SERVICES Nam Tuan Nguyen.

Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.

Spectrum Sensing In Cognitive Radio Networks

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Mohsen Riahi Manesh and Dr. Naima Kaabouch

Learning Deep Generative Models by Ruslan Salakhutdinov

Online Multiscale Dynamic Topic Models

Zhu Han University of Houston Thanks for Professor Dan Wang’s slides

Model Inference and Averaging

Classification of unlabeled data:

Non-Parametric Models

Statistical Models for Automatic Speech Recognition

Multimodal Learning with Deep Boltzmann Machines

Bayes Net Learning: Bayesian Approaches

Classification with Perceptrons Reading:

Mean Shift Segmentation

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Latent Variables, Mixture Models and EM

Statistical Learning Dong Liu Dept. EEIS, USTC.

Course Outline MODEL INFORMATION COMPLETE INCOMPLETE

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Revision (Part II) Ke Chen

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

CONTEXT DEPENDENT CLASSIFICATION

EE513 Audio Signals and Systems

Pattern Recognition and Machine Learning

Parametric Methods Berlin Chen, 2005 References:

Sofia Pediaditaki and Mahesh Marina University of Edinburgh

EM Algorithm and its Applications

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Presentation transcript:

Zhu Han University of Houston Thanks for Dr. Nam Nguyen Work Signal processing and Networking for Big Data Applications Lecture 18: Bayesian Nonparametric learning Zhu Han University of Houston Thanks for Dr. Nam Nguyen Work

outline Nonparametric classification techniques Applications Smart grid Bio imaging Security for wireless devices Location based services

Bayesian Nonparametric Classification Question: How to cluster smart meter big data For multi-dimension data Model selection: How many clusters are there? What’s the hidden process created the observations? What are the latent parameters of the process? Classic parametric methods (e.g. K-Means) Need to estimation the number of clusters Can have huge performance loss with poor model Cannot scale well The questions can be solved by using Nonparametric Bayesian Learning! Nonparametric: Number of clusters (or classes) can grow as more data are observed and need not to be known as a priori. Bayesian Inference: Use Bayesian rule to infer about the latent variables.

Main Objective Key Idea Posterior Likelihood Prior Bayesian rule μ contains information such as how many clusters, and which sample belongs to which cluster μ should be nonparametric and can be any value Sample the posterior distribution P(μ|Observations), and get values of the parameter μ. p(μ|Observation)=p(Observation|μ)p(μ)/p(Observation)

Examples of Bayesian inference used for parameter update A Beta distribution is chosen to be prior: Example: a=2, b=2 (head and tail prob. are equal) A Binomial distribution is the conjugate likelihood: One trial (N=1) and the result is one head (m=1) Lead to the Posterior: Update of parameters given the observations The probability of head happening is higher

Dirichlet distribution An extension of the Beta distribution to multiple dimensions. K: number of clusters πi: weight with marginal distribution αi: prior

Dirichlet process A random distribution G on Θ is Dirichlet process distributed with base distribution H and concentration parameter α, written as G ∼ DP(α,H), if for every finite measurable partition A1, . . .,AK of Θ H(·), the mean of the DP, α, strength of the prior

Bayesian nonparametric update Have t observation x1,…,xt. Define The posterior distribution on Θ The posterior Dirichlet process Small number of observation t, the prior dominates When t increases, the prior has less and less impact  controls the balance between the impact of prior and trials Can be used to learn and combine any distributions

Applications Distribution estimation Primary user spectrum map Cognitive radio spectrum bidding Estimate the aggregated effects from all other CR users Primary user spectrum map Different CR users see the spectrum differently How to combine the others’ sensing (as a prior) with own sensing. Classification Infinite Gaussian mixture model

Bayesian Nonparametric Classification Generative model vs. Inference algorithm Generative model Start with the parameters and end up creating observations Concept and framework Inference algorithm Start with observations and end up inferring about the parameters Practical applications

A Dice with Infinite Number of Faces Generative model: A general idea If sample the distribution of each face, we will obtain the weights, or the probabilities for each face (Dirichlet process) 1 2 3 4 5 6 ∞ 7 Question: If we have a dice with infinite number of faces, then how to deal with the situation? π1 π2 π3 π4 π5 π6

Model for Infinite Number of Faces Generative model: Stick breaking process: Generate an infinite number of faces, and their weights which sum up to 1. Sample a breaking point: Calculate the weight: 1 1-π1’ π1’ (1-π2’ )(1-π1’) π2’ (1-π1’)

Infinite Gaussian Mixture Model Generative model Stick(α) Infinite number of faces/classes 1 2 3 4 5 6 ∞ 7 π1 π2 π∞ Indicators are created according to multinomial distribution. z1, z2 .. = 1 z20, z21 .. = 2 X1:N The observations follows a distribution such as Gaussian. µ∞ Σ∞ µ1 Σ1 µ2 Σ2

Inference Model Obtain label Z from sample X Finding the posterior of the multivariate distribution P(Z|X) Given observation X, what are the probability that it belongs to cluster Z Which cluster a sample belongs to? Painful due to the integrations needed to carry out. Finding a univariate distribution is more easily to implement For new observation, can get marginal distribution of indicator In other word, find the marginal distribution of Zi given the other indicators. Gibbs sampling method to sample a value for a variable given all other variables. The process is repeated and proved to be converged after a few iterations.

Chinese Restaurant Process Nonparametric Bayesian Classification inference Goal: is the set of all other labels except the current one, ith Prior (Chinese Restaurant Process) Likelihood (e.g. given as Gaussian) Posterior Probability assigned to a represented class Probability assigned to an unrepresented class is the number of observations in the same class, k, excluding the current one, ith ? ?

Student t Distribution Inference model: Posterior distributions Given the prior and the likelihood, we come up with the posterior: Probability of assigning to a unrepresented cluster: Probability of assigning to a represented cluster: (1) t is the student-t distribution (2) Intuitive: Provide a stochastic gradient!

Gibbs Sampler Inference model: Gibbs sampler STOP Start with random indicator for each observation. Remove the current ith observation from its cluster Update the indicator zi according to (1) and (2) given all the other indicators No Converge? Yes STOP

Amazing Clustering Results Two Gaussian distributed clusters with KL divergence (KLD) 4.5 Intuition why it works so well Not the boundary or threshold. But clustering so that each cluster looks more like the distribution (Gaussian). No prior information on probabilities

Indian Buffet Process (IBP) Chinese restaurant problem: one point only belongs to one cluster Indian buffet process: Multiple assignment clustering, in which, one observation can be caused by multiple hidden sources: Binary matrix rep. of IBP:

Nonparametric classification: Mean shift Density estimation: The gradient of the density: Move toward the densest region of the observations, i.e., the region with 0 gradient.

outline Nonparametric classification techniques Applications Smart grid Bio imaging Security for wireless devices Location based services

Smart Pricing for Maximizing Profit The profit = sum of utility bill – cost to buy power Different shape of loads cost different Incentive using pricing to change the loads The cost reduction is greater than loss of bills

Load Profiling From smart meter data, try to tell users’ usage behaviors CEO, 1%, UH Computer Science people Worker, middle class, myself Homeless, slave, Ph.D. students

Load Profiling Results Utility company wants to know benchmark distributions Nonparametric: do not know how many benchmarks Bayesian: posterior distribution might be time varying Scale: Daily, weekday, weekend, monthly, yearly.

outline Nonparametric classification techniques Applications Smart grid Bio imaging Security for wireless devices Location based services

Image processing pipeline A maximum-intensity projection of a small region from the 3-D image montage; Middle 2-D optical slice from the 3-D microglial soma segmentation results (in orange); 3-D volume rendering of automated microglia reconstruction results (in white) overlaid on the Iba-1 channel (green), and the soma segmentations in orange. Illustrating the L-measure feature computation at multiple levels of granularity, at the compartment, segment, branch and cell levels.

Image processing pipeline (E) Heatmap summary display of the combined L- measure feature table for the datasets in Figs. 1(a) and 1(c), with each row corresponding to a cell and each column corresponding to an L-measure feature.

Comparison of IGMM with other Techniques

Comparisons of the correlation matrices

outline Nonparametric classification techniques Applications Smart grid Bio imaging Security for wireless devices Location based services

Introduction: Security Security Enhancement for Wireless Network Sybil attack Use device dependent radio-metrics as fingerprints. Contributions: Unique and hard-to-spoof device fingerprint An unsupervised and passive attack detection method. Upper bound and lower bound of classification performance. Masquerade attack

Wireless device security 00-B0-D0-86-BB-F7 Device2 00-0C-F1-56-98-AD Device3 Masquerade attack Device1 00-B0-D0-86-BB-F7 00-0C-F1-56-98-AD Device2 00-A0-C9-14-C8-29 Sybil attack Mechanism: If the number of devices can be found, compare that number to the number of associated devices to detect an attack. Based on the label of the observations generated from the devices to mark the malicious nodes.

Security – Features selection The Carrier Frequency Difference (CFD) Defined as the difference between the carrier frequency of the ideal signal and that of the transmitted signal. Depends on the oscillator within each device. The Phase Shift Difference (PSD) Using QPSK modulation technique. Transmitter amplifiers for I-phase and Q-phase might be different. Consequently, the degree shift can have some variances.

Security – Features selection Autocorrelation BPSK Spectral coherence Spectral coherence Cycle frequency Frequency The Second- Order Cyclostationary Feature (SOCF) The Received Signal Amplitude

Security – Inference Algorithm Collect data Unsupervised clustering method Two clusters with the same MAC addresses? Two MAC address with the same cluster? Masquerade attack. Sybil attack. Determine the number of attackers Update the “black” list with the Mac address Yes No

Masquerade and Sybil attack detection: Preliminary USRP2 experiment We collected the fingerprints from some WiFi devices by the USRP2 and tried the algorithm, below is the result:

Applications: Prime User emulation (PEU) attack detection In Cognitive radio, a malicious node can pretend to be a Primary User (PU) to keep the network resources (bandwidth) for his own use. How to detect that? We use the same approach, collect device dependent fingerprints and classify the fingerprints. We limit our study to OFDM system using QPSK modulation technique.

PUE attack detection DECLOAK algorithm ROC curve

outline Nonparametric classification techniques Applications Smart grid Bio imaging Security for wireless devices Location based services

Introduction: Location Based Service (LBS)

Introduction: LBS Major tasks to enable LBS: Localize. Estimate dwelling time. Prediction: Where to go next?

LBS – Problem statements What’s given: Mobile devices are in indoor environments. WiFi scans. Goals: Identifying revisited location. Automatically profiling new location. Unsupervised approach. No labeling required. Online sampling to reduce the complexity. Predicting the next possible locations.

LBS – Current indoor localization solutions Ngram: Based on the order of the APs. Need at least 400 samples per each location to achieve a good result  not energy efficient. SensLoc: Solely based on APs names  not fine grained. Continuously scans for WiFi signals  not energy efficient.

LBS – Indoor place identification (LOIRE) LOIRE is different from the previous approaches the following aspects: Unsupervised and nonparametric approach. Energy efficiency (requires only 50 samples/place). A framework to handle missing data. An online batch sampling approach. A quickest way to stop the algorithm.

LBS – Missing data Use WiFi scans as signature to identify a revisited place and to detect a new place. Problems: Variable length observations 1 room Or 2 rooms? Missing

LBS: IGMM The number of rooms is not given: How to identify revisited room? How to detect a new room? Batch sampling, how to utilize? WiFi scans from 4 rooms share the same list of Aps.

LBS: Online sampling Assume there are K stored places. Place identification purpose: Find a label, zi, for a new observed WiFi scan, Xi, in the set of 1:K. Detect a new place: Find the label, zi, for a new observed WiFi scan, Xi to see if zi = K+1 (a new place). A stochastic way to detect a new place, not a threshold based approach.

LBS – Experimental results. Dataset: 4 weeks long 5 phones

LBS – Future location prediction Once the locations can be determined, we can build algorithm to predict the next visiting location. Propose two models to predict Based on Markov model: A Dynamic Hidden Markov Model. Based on Deep Learning.

LBS – Future location prediction 1 Proposed Dynamic hidden Markov model (DHMM) Hidden States/locations: propose to use NBC to determine the states. Dimensions of the observations : GPS coordinates, time and RSS from either WiFi signal or cell tower signal. Dynamic Hidden Markov Model: Every time the user comes to a new place, the model automatically updates itself with the new state. S1 S2 S3 SK Rss1 Rss2 Rss3 RssK

LBS – Future location prediction 1 Proposed algorithm N data points-Training data Gibb Sampler Determine the number of states and the current state. Observe more data MDP Optimal scheduling result DHMM: Calculate the transition matrix Calculate the distributions of signal strengths in each state Schedule tolerant data package: Each data package has a deadline to be sent. When to sent? Wait until the signal strength profile is at its best to send. How to predict the future signal strength? Predict the next locations of the users, estimate the signal strength at the locations. Use Markov Decision Process (MDP) to calculate the expected reward.

LBS – Future location prediction 1 Simulation results Two states Same room

LBS – Future location prediction 1 Cost saving Naïve: Send immediately Proposed Approach UPDATE

LBS – Future location prediction 2 Based on Deep Learning Purpose: Find the typical moving patterns (E1, E2), based on that, identify users and predict the future locations. User identification: E2 E1 Location prediction:

LBS – Future location prediction 2 Bottom up: Initiate random weights. Learn the activation energy of the first hidden layer. Treat the first hidden layer as the observation layer for the second layer. Repeat the process for all layers. Top down: Regenerate the observations based on the parameters and optimize the weights according to equation (1). Repeat the above two steps until converge.

LBS – Future location prediction 2 Experiment results Deep learning: PCA:

Thanks