Background Information for Project

Slides:

Advertisements

Similar presentations

C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.

Advertisements

Latent Tree Models Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. AAAI 2014 Tutorial.

Latent Structure Models and Statistical Foundation for TCM Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science.

Latent Tree Models Part II: Definition and Properties

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

Latent Tree Models & Statistical Foundation for TCM Nevin L. Zhang Joint Work with: Chen Tao, Wang Yi, Yuan Shihong Department of Computer Science & Engineering.

Latent Tree Analysis of Unlabeled Data Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech.

Quantitative research approaches 1. Descriptive research involves collecting data in order to test hypotheses or answer questions regarding the subjects.

Scatterplots & Correlations Chapter 4. What we are going to cover Explanatory (Independent) and Response (Dependent) variables Displaying relationships.

Final Project ED Modeling & Prediction

Multivariate Analysis - Introduction. What is Multivariate Analysis? The expression multivariate analysis is used to describe analyses of data that have.

Virtual University of Pakistan

Data Collection Techniques

Defining the research problem

Inferential Statistics

Virtual University of Pakistan

Taking Part 2008 Multivariate analysis December 2008

AP Biology: Standard Deviation and Standard Error of the Mean

Virtual University of Pakistan

Measurements Statistics

3 Averages and Variation

Latent Tree Analysis Nevin L. Zhang* and Leonard K. M. Poon**

Study amongst Teens in EMEA Markets – Jan 09

Chapter 11 Chi-Square Tests.

Statistical Reasoning in Everyday Life

3 Averages and Variation

AP Biology: Normal Distribution

Chapter 21 More About Tests.

The Practice of Statistics in the Life Sciences Third Edition

Patterns and trends in adult obesity

Statistical Data Analysis

Resident Opinion Research Regarding Shadow Activities

PowerSchool for Parents

2. Sampling and Measurement

SAMPLING (Zikmund, Chapter 12.

Francesc Pedró Katerina Ananiadou Seoul, 9 – 11 November 2009

Virtual University of Pakistan

Science of Psychology AP Psychology

12 Inferential Analysis.

Lecture 6 Structured Interviews and Instrument Design Part II:

Looking at Data - Relationships Data analysis for two-way tables

The Practice of Statistics in the Life Sciences Fourth Edition

Module 8 Statistical Reasoning in Everyday Life

Hierarchical clustering approaches for high-throughput data

Asist. Prof. Dr. Duygu FIRAT Asist. Prof.. Dr. Şenol HACIEFENDİOĞLU

Gathering and Organizing Data

AP Biology: Standard Deviation and Standard Error of the Mean

Part III: Designing Psychological Research

CONSUMER MARKETS AND CONSUMER BUYING BEHAVIOR

15.1 Goodness-of-Fit Tests

Chapter 11 Chi-Square Tests.

Designing Samples Section 5.1.

1.1: Analyzing Categorical Data

12 Inferential Analysis.

Descriptive Statistics

Yulong Xu Henan University of Chinese Medicine

HELLO TO ALL! Miss. Ye.

Statistical Data Analysis

CLAIM! What I think What’s my position? What is my point?

CHAPTER 1 Exploring Data

What is the nature of descriptive measures to gather data?

Thinking critically with psychological science

Gathering and Organizing Data

Section 11-1 Review and Preview

3 Averages and Variation

Chapter 11 Chi-Square Tests.

Multivariate Analysis - Introduction

Inference: Confidence Intervals

Samples and Populations

Probability, contd.

Presentation transcript:

Background Information for Project CSIT 5220 Background Information for Project Good morning. Welcome to this tutorial. My name is Nevin L. Zhang. I am a professor of computer science at the HK u of sci & Tech. This is a picture of our campus from sea level.

Latent Tree Models (LTMs) Tree-structured probabilistic graphical models Leaves observed (manifest variables) Discrete or continuous Internal nodes latent (latent variables) Discrete Each edge is associated with a conditional distribution One node with marginal distribution Defines a joint distributions over all the variables (Zhang, JMLR 2004) What are latent tree models or LTMs? A latent tree model is a tree structured probabilistic graphical model where the leaf node represent observed variables, while the internal nodes represents latent variables that are not observed. This is an example latent tree model for the academic performance of high school students. The grades that the students get in the 4 subjects, Math, Science, English, and History are observed variables. The other two skill variables, analytical skill and literacy skill, are latent. The model says that a student’s performances in Math and Science are influenced by his analytical skill, his performances in English and History are influenced by literal skill, and the two latent skills are correlated. Each edge in the model is associated with a probability distributions. Collectively, those distributions defined a joint distribution over all the variables.

Latent Tree Analysis (LTA) From data on observed variables, obtain latent tree model Latent tree analysis or LTA refers to the process of obtaining a latent tree model from data on observed variables. For our example, we might start with the scores of the students, and from this data, we want to obtain a model like this. To do so, we first need to determine the number of latent variables; the number of states for each latent variable; how the nodes should be connected up to form a tree, and finally to the probability distributions. This is a rather difficult problem. In this tutorial, I will discuss various algorithms for solving the problem. Before doing that, I would like to use two examples to illustrate what LTA can be used for. Learning latent tree models: Determine Number of latent variables Numbers of possible states for latent variables Connections among nodes Probability distributions

LTA on Danish Beer Market Survey Data 463 consumers, 11 beer brands Questionnaire: For each brand: Never seen the brand before (s0); Seen before, but never tasted (s1); Tasted, but do not drink regularly (s2) Drink regularly (s3). The first example is about a survey of the Danish beer market. 463 consumers were asked about their experiences with 11 beer brands. It was a multiple choice survey. For each brand, the possible choices were: never seen before, seen before but never tasted, tasted but do not drink regularly, or drink regularly. On this data set, LTA produced this model. We see that the beer brands were divided into 3 groups. one latent variable was introduced for each group. (Mourad et al. JAIR 2013)

Why variables grouped as such? Responses on brands in each group strongly correlated. GronTuborg and Carlsberg: Main mass-market beers TuborgClas and CarlSpec: Frequent beers, bit darker than the above CeresTop, CeresRoyal, Pokal, …: minor local beers In general, LTA partitions observed variables into groups such that Variables in each group are strongly correlated, and The correlations among each group can be properly be modeled using one single latent variable Why are the beer brand grouped the way they are? Are the groupings interesting? The answer is yes. It turns out that the two brands under H2, GronTuborg and Calsberg, the are two most popular beers. On the other hand, CarlsSpec and TuborgClas under H1 are less popular and are darker in taste than those in the first group. The brands under H0 are minor local beers. This is clearly meaningful. The beer brands are grouped this way because the consumers’ responses on brands on in each group are strongly correctly. So, intuitively, LTA is a technique that partitions observed into groups in such way that variables in each group are strongly correlated, and the correlations among each group can be properly modeled using one single latent variables.

Multidmensional Clustering Each Latent variable gives a partition of consumers. H1: Class 1: Likely to have tasted TuborgClas, Carlspec and Heineken , but do not drink regularly Class 2: Likely to have seen or tasted the beers, but did not drink regularly Class 3: Likely to drink TuborgClas and Carlspec regularly H0 and H2 give two other partitions. In general, LTA is a technique for multiple clustering. In contrast, K-Means, mixture models give only one partition. Each latent variable represents a soft partition of the consumers surveyed. For example, H1 has three possible states. It represents a partition of the consumers into 3 classes. Here are information about the three classes. They are also obviously meaningful. The three classes consist of 36, 27, and 37% of the population respectively. Consumers class 1 have high probabilities to answer s2, that is tasted but don’t drink regularly. Consumers in Class 2 have high probabilities to answer s1 and s2, i.e., they have seen the beers or have tasted them, but do not drink regularly. Consumers in Class 3 mostly drink the beers regularly. This is the partition given by H1. H2 and H0 give two different partitions. So, LTA is a technology for obtaining multiple clusterings. Traditional clustering method such such K-means and mixture models give only one partition of data.

Unidimensional vs Multidimensional Clustering Grouping of objects into clusters such that objects in the same cluster are similar while objects from different clusters are dissimilar. Let us spend a bit more time on the issue of unidimensional vs multidimensional clustering. We all know that clustering is about grouping objects into clusters such that … Here I have some pictures. Intuitively, how should we cluster them?... Pause Obviously, we should divide them into two group like this.. This is quite clear. Result of clustering is often a partition of all the objects.

How to Cluster Those? Now, how do we cluster those? Pause....

How to Cluster Those? Style of picture Yes, there are two ways. One way is to divide the images into two group in terms of the style. Style of picture

How to Cluster Those? Type of object in picture Another way is to divide them into two groups in terms of the type of obects. Type of object in picture

Multidimensional Clustering So, we have two different ways to partition the images, NOT only one way. This is the concept of multidimensional clustering. In general, complex data usually have multiple facets, and can be meaningfully partitioned in multiple ways. When we perform cluster analysis on data, we should NOT restrict ourselves to one partition. Rather, we should look for multiple partitions. LTA is one method for multidimensional clustering. It should be noted that there are other methods for the problem. Complex data usually have multiple facets and can be meaningfully partitioned in multiple ways. Multidimensional clustering / Multi-Clustering LTA is a model-based method for multidimensional clustering. Other methods: http://www.siam.org/meetings/sdm11/clustering.pdf

Clustering of Variables and Objects LTA produces a partition of observed variables. For each cluster of variables, it produces a partition of objects. To recap the points, LTA first, produces a partition of observed variables, and them, for each cluster of variables, it produces a partition of objects.

Social Survey Data // Survey on corruption in Hong Kong and performance of the anti-corruption agency -- ICAC //31 questions, 1200 samples C_City: s0 s1 s2 s3 // very common, quite common, uncommon, very uncommon C_Gov: s0 s1 s2 s3 C_Bus: s0 s1 s2 s3 Tolerance_C_Gov: s0 s1 s2 s3 //totally intolerable, intolerable, tolerable, totally tolerable Tolerance_C_Bus: s0 s1 s2 s3 WillingReport_C: s0 s1 s2 // yes, no, depends LeaveContactInfo: s0 s1 // yes, no I_EncourageReport: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... I_Effectiveness: s0 s1 s2 s3 s4 //very e, e, a, in-e, very in-e I_Deterrence: s0 s1 s2 s3 s4 // very sufficient, sufficient, average, ... ….. -1 -1 -1 0 0 -1 -1 -1 -1 -1 -1 0 -1 -1 -1 0 1 1 -1 -1 2 0 2 2 1 3 1 1 4 1 0 1.0 -1 -1 -1 0 0 -1 -1 1 1 -1 -1 0 0 -1 1 -1 1 3 2 2 0 0 0 2 1 2 0 0 2 1 0 1.0 -1 -1 -1 0 0 -1 -1 2 1 2 0 0 0 2 -1 -1 1 1 1 0 2 0 1 2 -1 2 0 1 2 1 0 1.0 …. The data is from a survey by ICAC, Hong Kong anti-corruption agency. The survey was designed to gauge people’s views on corruption and on the performance of ICAC. There are 31 questions in the survey and hence 31 observed variables in the data. Here are some of the variable. The first one C_City is corruption level in the city. The possible values are very common, quite common, uncommon, very uncommon. C_gov and C-Bus means corrupaion level in the in the government and in the business sector respectively. Tolerance_C_Gov and mean tolerance against corruption in the government and in the business sector respectively. I_Effectiveness means whether the ICAC is effective and I_deterrence means whether ICAC has enough deterrence against corruption.. And so on. At the bottom are the answers by three people. 0, 1, 2, 3, 4 are possible values of the variables, and -1 means missing value.

Latent Structure Discovery Here is the structure of the LTM obtained from the data. The structure is apparently interesting. First, we see the variables Education, Income, Age and sex are grouped under Y2. Hence Y2 is about demographic information. Y3 is directly connected to variables on people tolerance toward corruption and whether they are willing to report corruption. The connection between Y2 and Y3 indicates that people’s demographic background influences their attitude toward corruption. This is reasonable. The variables connected to Y4 are whether ICAC effective, whether it has sufficient deterrence against corruption, and whether it encourage people to report corruption. So, it is about the performance of ICAC. Y6 is the about the status of corruption. It is reasonable Y6 is directly connected to Y4. For example, if people think corruption is severe, they probably will also think that ICAC is not doing a good job. Another variable connected to Y4 is Y5, which is about change – here we have next year, next year and past year. The connection between Y5 and Y4 is also reasonable. For example, if people ICAC is doing a good job, then they will things will improve next year. Finally, Y7 is about accountability because the related observed variables are whether ICAC abused its powers, whether it has too much power and whether it is impartial. Y8 is about the economic situation. Different latent variables have different cardinalities. For example, Y3 has 3 and Y2 has 4. Next we examine the meaning of the states of the latent variables. Y2: Demographic info; Y3: Tolerance toward corruption; Y4: ICAC performance; Y5: Change in level of corruption; Y6: Level of corruption; Y7: ICAC accountability

Multidimensional Clustering Y2 has 4 states. So, it divides all the people surveyed into 4 cluster. The first group consists of 18% of the people. The value for the age variable is always one, which means the age is between 15-24. The income variable mostly take value s1, which means low income. So, the first cluster can be interpreted as low income youngsters. In the second cluster consists of 24% of the people. The value for the variable sex is always 1, which means woman. The income variable mostly takes value s0, s1, and s2, which mean no or low income. So, cluster can be interpreted as women with no or low income. The third cluster consists of 33% of the people. The income and education variables take values from the top range. So, this is a cluster of people with good education and good income. The fourth cluster consists of 25 of the people. The income variable takes value from the middle range, and the education variable mostly take value below s4, which mean below senior hight school. So, this is a cluster of people with poor education and average income. Y2=s0: Low income youngsters; Y2=s1: Women with no/low income; Y2=s2: people with good education and good income; Y2=s3: people with poor education and average income.

Multidimensional Clustering Y3=s0: people who find corruption totally intolerable; 57% Y3=s1: people who find corruption intolerable; 27% Y3=s2: people who find corruption tolerable; 15% Interesting finding: Y3=s2: 29+19=48% find C-Gov totally intolerable or intolerable; 5% for C-Bus Y3=s1: 54% find C-Gov totally intolerable; 2% for C-Bus Y3=s0: Same attitude toward C-Gov and C-Bus People who are tough on corruption are equally tough toward C-Gov and C-Bus. People who are lenient about corruption are more lenient C-Bus than C-GOv Y3 divides the people into 3 clusters. According to the probabilities given in the table, the first cluster consists of people who think corruption is totally intolerable, the second cluster consists of people who think corruption is intolerable, and the 3rd cluster consists of people who think corruption is tolerable. Let us take a closer look at the 3rd cluster. The variable Tolerance C Gov take values s0 and s1 with probabilities 29% and 19%, while the variable tolerance C bus does not take value s0 at all and take value s1 with probability only 5%. Here s0 mean totally intolerable and s1 means intolerable. This means that people in the group are relatively tougher toward corruption in the government than corruption in the business sector. And the cluster in the middle is similar in the regard. However, people in the first cluster think corruption in the government and the business are both totally intolerable. Those seem to suggest that people who are tough on corruption, i.e., the ppl in the first cluster, are equally tough toward corruption in the government and in the business sector. However, people who are more lenient toward corrutpion are more lenient about corruption the business sector than in corruption in the government. In other words, they are willing to accept corruption in the business sector, but not the government.

Multidimensional Clustering Who are the toughest toward corruption among the 4 groups? Y2=s2: ( good education and good income) the least tolerant. 4% tolerable Y2=s3: (poor education and average income) the most tolerant. 32% tolerable The other two classes are in between. Earlier, we talked about 4 cluster of people: youngster, women with no/low income, people with good education and income, and people with poor education/income. Among those 4 groups, who are the toughest toward corruption? Who are the most lenient? It turns out that people with good education/income are the toughest toward corruption. Here is the conditional probability of Y3 given Y2. We see that, When Y2=s2, the probability of Y3=s2 is 4%, the lowest among all the 4 clusters. The same probability for Y2=s3 is 32%, the highest among the 4 groups. So, people with poor education and income are the most lenient toward corruption. The other two clusters are in between. In summary, Latent tree analysis of social survey data can reveal Interesting latent structures Interesting clusters Interesting relationships among the clusters. Summary: Latent tree analysis of social survey data can reveal Interesting latent structures Interesting clusters Interesting relationships among the clusters.

More Information… T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2012). Model-based multidimensional clustering of categorical data. Artificial Intelligence, 176(1), 2246-2269. Software: Lantern software: http://www.cse.ust.hk/~lzhang/ltm/softwares/Lantern.zip Illustration: Analysis, model interpretation, conditional probability between latent variables