Presentation is loading. Please wait.

Presentation is loading. Please wait.

(보충자료) MVA.

Similar presentations


Presentation on theme: "(보충자료) MVA."— Presentation transcript:

1 (보충자료) MVA

2 MVA 기법 기법 Interdependence Explanatory/Confirmatory Factor Analysis
Multidimensional Scaling Cluster Analysis Canonical Correlation Dependence SEM (Structural Equation Modeling) ANOVA Discriminant Analysis Logit Choice Model Source: Analyzing Multivariate Data By J.M. Lattin (외)

3 PCA 기초

4 1. 개념 의의 PCA: orthogonal projection of highly correlated variables to principal components  linear transformation is defined in such a way that the first principal component has the largest possible variance. PC: a set of values of linearly uncorrelated variables 활용 원리

5 2. Background Math 2.1 Statistics Mean Standard deviation & Variance
Covariance & Covariance matrix

6 Covariance

7 For 2 dimensional data, cov(x,y)
For 3 dimensional data, cov(x,y), cov(x,z), cov(y,z) For an n-dimensional data set, 𝑛 𝑛−2 !∗2 different covariance values  So, the definition for the covariance matrix for a set of data with dimensions is:

8 2.2 Matrix Algebra eigenvectors eigenvalues.

9 Eigenvector? non-0 vector that, after being multiplied by the matrix, remain parallel to original vector. In order to keep eigenvectors standard, we usually scale it to make it have a length of 1, so that all eigenvectors have the same length. non-eigenvector  eigenvector

10 3. PCA 3.1 진행절차 Step 1: 데이터 입수 및 정비
Subtract the mean & covariance matrix 계산 Step 4: covariance matrix 의 eigenvector와 eigenvalues 계산 Step 5: components 선택 및 feature vector 생성 Step 6: 새로운 데이터 셋 도출

11 components 선택 및 feature vector 생성
eigenvector with the highest eigenvalue is principle component of the data set. 나머지 생략 가능…

12 Getting the old data back
Step 6: 새로운 데이터 셋 도출 RowFeatureVector = matrix with the eigenvectors in the columns transposed, with the most significant eigenvector at the top. RowDataAdjust = mean-adjusted data transposed, ie. the data items are in each column, with each row holding a separate dimension.  original data를 우리가 선택한 vector에 의거하여 변형 the patterns are the lines that most closely describe the relationships between the data. Getting the old data back

13 Biplot shows the proportions of each variable along the 2 PCs Spree

14 Distances

15 Mahalanobis Distance

16 Mahalanobis distance has the following properties:
It accounts for the fact that the variances in each direction are different. It accounts for the covariance between variables. It reduces to the familiar Euclidean distance for uncorrelated variables with unit variance.

17 SVD

18 3개의 관점 method for transforming correlated variables into a set of uncorrelated ones  better expose various relationships among the original data items. method for identifying and ordering dimensions  data points exhibit the most variation. method for data reduction. SVD의 의의 차원축소  expose substructure of the original data more clearly and orders it from most variation to the least. 대표적 활용: NLP 문서에 내재된 주된 관계성을 확인하면서도 특정 threshold 이하의 variation을 무시하는 대신 대대적으로 차원을 축소

19 방법론 a rectangular matrix A can be broken into product of 3 matrices - an orthogonal matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V 단, UTU = I, V TV = I; U 행렬의 column들은 orthonormal eigenvectors of AAT , V 행렬의 column들은 orthonormal eigenvectors of ATA, S 는 a diagonal matrix containing square roots of eigenvalues from U or V. 예:

20 데이터 정제와 EDA

21 데이터 정제 (Data clearning)

22

23 1. 기술적 측면 (1) 데이터 읽기 (2) 타입 변환 (3) 문자열과 encoding read.table
read.delim read.delim2 read.csv read.csv2 read.table read.fwf A freshly read data.frame should always be inspected with functions like head, str, and summary (2) 타입 변환 coercion as.numeric as.logical as.integer as.factor as.character as.ordered factor 변환 factor() date 변환 library(lubridate) (3) 문자열과 encoding Sys.getlocale("LC_CTYPE") f <- file("myUTF16file.txt", encoding = "UTF-16")

24 2. Consistent Data (1) Missing value (2) special value 문제
na.rm = TRUE (persons_complete <- na.omit(person)) (2) special value 문제 (예) is.special <- function(x){ if (is.numeric(x)) !is.finite(x) else is.na(x) } (3) Outlier 문제

25 3. 수정 대체값 적용 (Imputation) x <- 1:5 # create a vector...
x[2] <- NA # ...with an empty value x <- impute(x, mean) x ## ## * is.imputed(x) # -- I <- is.na(x) R <- sum(x[!I])/sum(y[!I]) x[I] <- R * y[I] data(iris) iris$Sepal.Length[1:10] <- NA model <- lm(Sepal.Length ~ Sepal.Width + Petal.Width, data = iris) I <- is.na(iris$Sepal.Length) iris$Sepal.Length[I] <- predict(model, newdata = iris[I, ])

26 eda

27 Exploratory Data Analysis (EDA)
an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to maximize insight into a data set; uncover underlying structure; extract important variables; detect outliers and anomalies; test underlying assumptions; develop parsimonious models; and determine optimal factor settings.

28 3가지 접근법 For classical analysis, the sequence is
Problem => Data => Model => Analysis => Conclusions For EDA, the sequence is Problem => Data => Analysis => Model => Conclusions For Bayesian, the sequence is Problem => Data => Model => Prior Distribution => Analysis => Conclusions

29 dplyr 기초 6가지의 주된 함수 사용법 Pick observations by their values (filter()).
Reorder the rows (arrange()). Pick variables by their names (select()). Create new variables with functions of existing variables (mutate()). Collapse many values down to a single summary (summarise()). + group_by() changes the scope of each function from operating on the entire dataset to operating on it group-by-group. 사용법 The first argument is a data frame. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes). The result is a new data frame.

30 dplyr 의 filter에서의 logical operation

31 Tidy data set

32 rules for a tidy dataset :
Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.

33 > table4a %>% + gather(`1999`, `2000`, key = "year", value = "cases") # A tibble: 6 × 3 country year cases <chr> <chr> <int> 1 Afghanistan Brazil China 4 Afghanistan Brazil China > table4a # A tibble: 3 × 3 country `1999` `2000` * <chr> <int> <int> 1 Afghanistan Brazil China

34 Relational data A primary key A foreign key
uniquely identifies an observation in its own table. (ex) planes$tailnum is a primary key A foreign key uniquely identifies an observation in another table. (ex) flights$tailnum is a foreign key


Download ppt "(보충자료) MVA."

Similar presentations


Ads by Google