Multivariate Data and Matrix Algebra Review BMTRY 726 5/15/2018
Syllabus Instructor and Contact information: Bethany Wolf 135 Cannon Place Office 302B 876-1940 wolfb@musc.edu Office hours Monday 2-3 or by appointment Grading: Grades will be based on assigned problem sets, a mid-term exam, class participation, and a final project. Problem sets will require active manipulation of datasets provided by the instructor using standard statistical packages (e.g. R and SAS). Class participation will include participation presenting journal articles. The breakdown of contribution to the final course is as follows: Homework assignments: 50% Mid-term exam: 20% Final Project: 20% Class Participation: 10%
A few things before we begin May 24-25th I will be teaching a Summer Institute on Survival Analysis which we will have in lieu of class on the 24th May 29th and May 31st we are in EL115 Class participation discussion
What is ‘Multivariate’ Data? Data in which each sampling unit contributes to more than one outcome. For example…. Sampling Unit Cancer patients Serum concentrations on a panel of protein markers are collected in chemotherapy patients Smoking cessation participants Collect background information and smoking behavior at multiple visits Post-operative patient outcome Multiple measures of how a patient is doing post-operatively: patient self-reported pain, opioid consumption, ICU/Hospital length of stay Diabetics Each subject assigned to different glucose control option (medication, diet, diet and medication). Fasting blood glucose is monitored at 0, 3, 6, 9, 12, and 15 months.
Goals of Multivariate Analysis Data reduction and structural simplification Say we collect p biological markers to examine patient response to chemotherapy. Ideally we might like to summarize patient response as some simple combination of the markers. How can variation in the p markers be summarized?
Goals of Multivariate Analysis Sorting and grouping data Participants are enrolled in a smoking cessation program for several years Information about the background of each subject and smoking behavior at multiple visits Some patients quit while others do not Can we use the background and smoking behavior information to classify those that quit and those that do not in order to screen future participants?
Goals of Multivariate Analysis Investigating dependence among variables Subjects take a standardized test with different categories of questions Sentence completion Number sequences Orientation of patterns Arithmetic (etc.) Can correlation among scores be attributed to variation in one or more unobserved factors? Intelligence Cognitive ability Critical thinking
Goals of Multivariate Analysis Prediction based on relationship between variables We conduct a microarray experiment to compare tumor and healthy tissue We want to develop a reliable classification tool based on the gene expression information from our experiment
Goals of Multivariate Analysis Hypothesis testing Participants in a diabetes study are placed into one of three treatment groups Fasting blood glucose is evaluated at 0, 3, 6, 9, 12, and 15 months We want to test the hypothesis that treatment groups are different.
Multivariate Data Properties What property/ies of multivariate data make commonly used statistical approached inappropriate?
Notation & Data Organization Consider an example where we have 15 tumor markers collected on 30 tissue samples The 15 markers are variables and our samples represent the subjects in the data. These data can most easily be expressed as an 30 by 15 array/matrix
Notation & Data Organization More generally, let i = 1, 2,…, n represent the unique samples And let j = 1, 2,…, p represent a set of variables collected in a study
Random Vectors Each experimental unit has multiple outcome measures thus we can arrange the ith subject’s j = 1, 2,…, p outcomes as a vector. is a random variable as are it’s individual elements p denotes the number of outcomes for subject i i = 1, 2,…, n is the number subjects
Descriptive Statistics We can calculate familiar descriptive statistics for this array Mean Variance Covariance (Correlation)
Arranged as Arrays Means Covariance
Quick Example Find the mean and variance of
Easier in R We can calculate these values in R > A<-matrix(c(1,2,3,3,4,5,2,3,7), nrow=3, ncol=3, byrow=T) > A [,1] [,2] [,3] [1,] 1 2 3 [2,] 3 4 5 [3,] 2 3 7 > colMeans(A) [1] 2 3 5 > var(A) [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 4
Distance Many multivariate statistics are based on the idea of distance For example, if we are comparing two groups we might look at the difference in their means Euclidean distance
Concept of Euclidean Distance Start with distance from the origin for 2-dimensions What about a p dimensional point? What about between two p dimensional points?
Distance But why is Euclidean distance inappropriate in statistics? This leads us to the idea of statistical distance Consider a case where we have two measures
Statistical Distance Consider a case where we have two measures
Statistical Distance Consider a case where we have two measures
Statistical Distance Our expression of statistical distance can be generalized to p variables to any fixed set of points
Now onto some linear algebra basics…
Basic Matrix Operations Can I add A2x3 and B3x3? What is the product of matrix A and scalar c? When can I multiply the two matrices A and B?
Matrix Transposes The transpose of an n x m matrix A, denoted as A’, is an m x n matrix whose ijth element is the jith element of A Properties of a transpose:
Quick Examples: Matrix Transposes Consider the two matrices
Types of Matrices Square matrix: Idempotent: Symmetric: A square matrix is diagonal :
More Definitions An n x n matrix A is nonsingular if there exists an matrix Bn x n such that B is the multiplicative inverse of A and can be written as A square matrix with no multiplicative inverse is said to be…. We can calculate the inverse of a matrix assuming one exists but it is tedious (let the computer do it).
Finding an Inverse in R We can find inverses by hand, but in most cases it is tedious (I won’t ask you to) Instead use R (base package): > A<-matrix(c(1,2,-3,-1,1,-1,0,-2,3), nrow=3, ncol=3, byrow=T) > B<-solve(A); B [,1] [,2] [,3] [1,] 1 0 1 [2,] 3 3 4 [3,] 2 2 3 > A%*%B # Just a check to show this is the inverse [,1] [,2] [,3] [1,] 1.000000e+00 0 0.000000e+00 [2,] -4.440892e-16 1 -4.440892e-16 [3,] 0.000000e+00 0 1.000000e+00
Matrix Determinant The determinant of a square matrix A is a scalar given by What is the determinant of
Matrix Determinant What about the determinant of the 3x3 matrix?
Matrix Determinant Using this result what is the determinant of
Easier in R… We can calculate the determinant of a matrix in R using functions in the base package > A<-matrix(c(1,4,0,2,2,1,-1,3,0), nrow=3, ncol=3, byrow=T) > det(A) [1] -7 Note, R will also give you an error if you try to calculate it for a non-square matrix
A Little on Vectors The inner product of two vectors is useful in statistics Think about this in terms of linear regression…
Orthogonal an Orthonormal vectors A collection of m-dimensional vectors, x1, x2,…, xp are orthogonal if… The collection of vectors is said to be orthonormal if what 2 conditions are met?
Linear Dependence The p of m-dimensional vectors, , are linearly dependent if there is a set of constants, c1,c2,…,cp not all zero for which
Linear Dependence Conversely, if no such set of non-zero constants exists, the vectors are linearly independent.
Rank of a Matrix Row rank is the number of rows Column rank is the number of cols Find the column rank of
Rank of a Matrix How are row and column rank related? If a matrix is not square, what is the maximum rank a matrix can have? If the rank of Amxn is min(m, n), then A is said to be full rank What does rank tell us about linear dependence of the vectors that make up the matrix?
Orthogonal Matrices A square matrix Anxn is said to be orthogonal if its columns form an orthonormal set. This can be easily be determined by showing that
Eigenvalues and Eigenvectors The eigenvalues of an Anxn matrix are scalar values that are the solutions to for a set of eigenvectors, . We typically normalize so that
Example: Eigen Values Find the eigenvalues for
Example: Eigen Vectors Find the first eigenvector for
Quadratic Forms Given a symmetric matrix Anxn and an n-dimensional vector x, The scalar quantity is referred to as a quadratic form. For example, the expression is a quadratic form for some matrix A where x is a vector
Positive Definite Matrices A symmetric matrix A is said to be positive definite if this implies
Spectral Decomposition We can use eigenvalues and vectors to yield the spectral decomposition of a symmetric matrix A Using the spectral decomposition and quadratic forms, we can then show that a symmetric matrix is positive definite.
Quadratic Forms, Spectral Decomposition, and Positive Definite Matrices Given the quadratic form, show A is positive definite
Positive Definite Matrices A real symmetric matrix is:
Trace Let A be an nxn matrix, the trace of A is given by Properties of the trace:
Back to Random Vectors Define Y as a random vector Then the population mean vector is:
Random Vectors Cont’d So Yj is a random variable whose mean and variance can be expressed by:
Covariance of Random Vectors We then define the covariance between the jth and kth trait in Y as Yielding the covariance matrix
Correlation Matrix of Y The correlation matrix for Y is
Properties of a Covariance Matrix is symmetric (i.e. sij = sji for all i,j) is positive semi-definite for any vector of constants
Linear Combinations Consider linear combinations of the elements of Y If Y has mean m and covariance S, then
Linear Combinations Cont’d If S is not positive definite then for at least one
Next Time We will start discussing properties of the multivariate normal distribution…