(보충자료) MVA.

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis
Principal Component Analysis
Computer Graphics Recitation 5.
Neural Computation Prof. Nathan Intrator
Principal component analysis (PCA)
Dimensional reduction, PCA
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Principal component analysis (PCA)
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Techniques for studying correlation and covariance structure
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Summarized by Soo-Jin Kim
Principle Component Analysis Presented by: Sabbir Ahmed Roll: FH-227.
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principles of Pattern Recognition
Some matrix stuff.
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Additive Data Perturbation: data reconstruction attacks.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
SINGULAR VALUE DECOMPOSITION (SVD)
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
CpSc 881: Machine Learning PCA and MDS. 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources.
Neural Computation Prof. Nathan Intrator
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Chapter 13.  Both Principle components analysis (PCA) and Exploratory factor analysis (EFA) are used to understand the underlying patterns in the data.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Feature Extraction 主講人:虞台文.
Presented by: Muhammad Wasif Laeeq (BSIT07-1) Muhammad Aatif Aneeq (BSIT07-15) Shah Rukh (BSIT07-22) Mudasir Abbas (BSIT07-34) Ahmad Mushtaq (BSIT07-45)
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis
Introduction to Vectors and Matrices
Principal component analysis (PCA)
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Exploring Microarray data
Principal Component Analysis (PCA)
Principal Components Analysis
Principal Component Analysis
Principal Component Analysis (PCA)
Techniques for studying correlation and covariance structure
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Descriptive Statistics vs. Factor Analysis
Multivariate Statistics
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Recitation: SVD and dimensionality reduction
Principal Component Analysis (PCA)
Feature space tansformation methods
Principal Components What matters most?.
Introduction to Vectors and Matrices
Feature Selection Methods
Principal Component Analysis
Key Concepts R for Data Science.
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

(보충자료) MVA

MVA 기법 기법 Interdependence Explanatory/Confirmatory Factor Analysis Multidimensional Scaling Cluster Analysis Canonical Correlation Dependence SEM (Structural Equation Modeling) ANOVA Discriminant Analysis Logit Choice Model Source: Analyzing Multivariate Data By J.M. Lattin (외)

PCA 기초

1. 개념 의의 PCA: orthogonal projection of highly correlated variables to principal components  linear transformation is defined in such a way that the first principal component has the largest possible variance. PC: a set of values of linearly uncorrelated variables 활용 원리

2. Background Math 2.1 Statistics Mean Standard deviation & Variance Covariance & Covariance matrix

Covariance

For 2 dimensional data, cov(x,y) For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z) For an n-dimensional data set, 𝑛 𝑛−2 !∗2 different covariance values  So, the definition for the covariance matrix for a set of data with dimensions is:

2.2 Matrix Algebra eigenvectors eigenvalues.

Eigenvector? non-0 vector that, after being multiplied by the matrix, remain parallel to original vector. In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length. non-eigenvector  eigenvector

3. PCA 3.1 진행절차 Step 1: 데이터 입수 및 정비 Subtract the mean & covariance matrix 계산 Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산 Step 5: components 선택 및 feature vector 생성 Step 6: 새로운 데이터 셋 도출

components 선택 및 feature vector 생성 eigenvector with the highest eigenvalue is principle component of the data set. 나머지 생략 가능…

Getting the old data back Step 6: 새로운 데이터 셋 도출 RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top. RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.  original data를 우리가 선택한 vector에 의거하여 변형 the patterns are the lines that most closely describe the relationships between the data. Getting the old data back

Biplot shows the proportions of each variable along the 2 PCs Spree

Distances

Mahalanobis Distance

Mahalanobis distance has the following properties: It accounts for the fact that the variances in each direction are different. It accounts for the covariance between variables. It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

SVD

3개의 관점 method for transforming correlated variables into a set of uncorrelated ones  better expose various relationships among the original data items. method for identifying and ordering dimensions  data points exhibit the most variation. method for data reduction. SVD의 의의 차원축소  expose substructure of the original data more clearly and orders it from most variation to the least. 대표적 활용: NLP 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

방법론 a rectangular matrix A can be broken into product of 3 matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V 단, UTU = I, V TV = I; U 행렬의 column들은 orthonormal eigenvectors of AAT , V 행렬의 column들은 orthonormal eigenvectors of ATA, S 는 a diagonal matrix containing square roots of eigenvalues from U or V. 예:

데이터 정제와 EDA

데이터 정제 (Data clearning)

1. 기술적 측면 (1) 데이터 읽기 (2) 타입 변환 (3) 문자열과 encoding read.table read.delim read.delim2 read.csv read.csv2 read.table read.fwf A freshly read data.frame should always be inspected with functions like head, str, and summary (2) 타입 변환 coercion as.numeric as.logical as.integer as.factor as.character as.ordered factor 변환 factor() date 변환 library(lubridate) (3) 문자열과 encoding Sys.getlocale("LC_CTYPE") f <- file("myUTF16file.txt", encoding = "UTF-16")

2. Consistent Data (1) Missing value (2) special value 문제 na.rm = TRUE (persons_complete <- na.omit(person)) (2) special value 문제 (예) is.special <- function(x){ if (is.numeric(x)) !is.finite(x) else is.na(x) } (3) Outlier 문제

3. 수정 대체값 적용 (Imputation) x <- 1:5 # create a vector... x[2] <- NA # ...with an empty value x <- impute(x, mean) x ## 1 2 3 4 5 ## 1.00 3.25* 3.00 4.00 5.00 is.imputed(x) # -- I <- is.na(x) R <- sum(x[!I])/sum(y[!I]) x[I] <- R * y[I] data(iris) iris$Sepal.Length[1:10] <- NA model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) I <- is.na(iris$Sepal.Length) iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])

eda

Exploratory Data Analysis (EDA) an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set; uncover underlying structure; extract important variables; detect outliers and anomalies; test underlying assumptions; develop parsimonious models; and determine optimal factor settings.

3가지 접근법 For classical analysis, the sequence is Problem => Data => Model => Analysis => Conclusions For EDA, the sequence is Problem => Data => Analysis => Model => Conclusions For Bayesian, the sequence is Problem => Data => Model => Prior Distribution => Analysis => Conclusions

dplyr 기초 6가지의 주된 함수 사용법 Pick observations by their values (filter()). Reorder the rows (arrange()). Pick variables by their names (select()). Create new variables with functions of existing variables (mutate()). Collapse many values down to a single summary (summarise()). + group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group. 사용법 The first argument is a data frame. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). The result is a new data frame.

dplyr 의 filter에서의 logical operation

Tidy data set

rules for a tidy dataset : Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.

> table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan 1999 745 2 Brazil 1999 37737 3 China 1999 212258 4 Afghanistan 2000 2666 5 Brazil 2000 80488 6 China 2000 213766 > table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan 745 2666 2 Brazil 37737 80488 3 China 212258 213766

Relational data A primary key A foreign key uniquely identifies an observation in its own table. (ex) planes$tailnum is a primary key A foreign key uniquely identifies an observation in another table. (ex) flights$tailnum is a foreign key