CORRELATION-REGULATION ANALYSIS Томский политехнический университет
In simulation of certain components of complex systems there is often a challenge to establish qualitative and quantitative relationship between input and output of some functional units. Certain components of complex system can be represented as a box that connects via the sum of its internal parameters input stimulus with output signals. Functional unit of complex system x2 y 1 y m.. x1 …. s 1 s k.. Input Output Unit parameters
If mathematical expressions describing behavior of the box are known, it is easy to define its output signals for a given input stimulus by solving direct problem. This situation is the most easy- to-system modeling. It occurs when object's behavior is uniquely described by the known laws of physics (dependence of current on voltage in a circuit), or equations relating inputs and outputs of functional units are obtained from previous studies of similar systems.
The simplest way to visually identify the relationship between quantitative variables is to design a scatterplot, which is a graph in which along the horizontal axis (x) one variable and along the vertical (y) another variable are marked. Each object in the diagram corresponds to a point whose coordinates are equal to the values of the pair of variables selected for analysis (fig. 4.2). Totally, there are n experimental points in the graph which correspond to n observations. The scatterplot is a "cloud" of points in a coordinate plane. If the cloud of points resembles a line shape, it can be assumed that we see in scatterplot the form of dependence, which is distorted by the influence of some factors causing points deviation from theoretical form.
Graphical view of observation results Actual dependence of energy consumption on the number of residents Ths. kWh
In this example, it can be assumed a linear relationship between population and amount of electricity consumed - one- dimensional linear regression model. However, through the cloud of points you can cross a lot of lines and the eye cannot determine which one suits better to describe the desired function.
In general, equation of a straight line is described by the expression: Y=A 0 +A 1 ∙X(4.1) Hence to obtain regression equation it is necessary to determine the values of coefficients А 0 and А 1. One of the most popular method, which allows to calculate the values of coefficients, i.e. to determine position of the line that best passes through a cloud of given points, is the method of least squares. The main idea of the method of least squares is to minimize the squared errors (lengths) from experimental points to points on the theoretical straight line.
To obtain regression equation by least squares it is necessary to perform consistently the following calculations: 1.For each n of experimental points it is necessary to calculate the error (E i ) between experimental (Y i exp ) and theoretical value (Y i theor ), lying on a straight line, given by equation (4.1): E i = (Y i exp. – Y i theor. ), i = 1, …, n or E i = Y i – A 0 – A 1 · X i, i = 1, …, n (4.2) 2. Errors E i for all n points need to add up. To make sure that positive errors do not compensate in sum negative ones, each of the errors is squared and added their value to the total error S of the same sign: S=E i 2 = (Y i – A 0 – A 1 · X i ) 2, i = 1, …, n. (4.3)
Total error S is a function of two variables A 0 and A 1, changing them we can influence the magnitude of total error. The principle of least-squares method is selection of the coefficients A 0, A 1 of linear function Y = A 1 X + A 0, so that its graph is held as close as possible simultaneously to all experimental points: (4.4)
3. Necessary condition for the minimum function of several variables is equality of all its partial derivatives to zero. We find the partial derivatives of S with respect to each of the variables, and equate them to zero: (4.5) After the transformations equation system (4.5) can be represented as follows: : (4.6)
From the system of linear equations (4.6) it can be expressed formulas for the direct determination of variables A 0, A 1 of the desired linear function: (4.7)
To quantify closeness of relationship between variables, determine its direction it is necessary to conduct correlation analysis of the available experimental data. Thus, solution to the problem of designing qualitative mathematical model of the object by available statistics (experimental) data is possible only on the basis of correlation- regression analysis. Correlation and regression analysis is a branch of statistics - science which studies general problems of measuring and analyzing of mass quantitative relationships and interactions.
In terminology of statistics input variables are named factor characteristics, i.e. characteristics that cause an immediate change of other characteristics, or create opportunities for its change. Output variables are called resultant characteristics, i.e. characteristics whose magnitude depends on the factor characteristics. For example, electricity consumption is now resultant characteristic, whose value depends on the factor characteristics - amount and range of products.
Correlation and regression analysis allows to quantify closeness, direction of statistical relationship and to establish analytical expression depending on the result of specific factors remaining constant when the rest factor characteristics affect resultant characteristic. To perform correlation and regression analysis the following conditions are necessary: sufficiently large volume of sample population: number of observations should exceed more than 10 times the number of factors influencing result; qualitatively homogeneous sample population; obedience of population distribution by resultant and factor characteristics to the normal distribution law or close to it.
When carrying out correlation and regression analysis, the following problems are solved: Identity of relationship between resultant and factor characteristics; Identity of relationship forms; Identity of strength (closeness) and direction of relationship; Prediction of possible values of resultant characteristics based on specified values of factor characteristics.
Regression in statistics is dependence of mean value of any quantity y on another quantity x or number of quantities х i. Pair regression is model that expresses dependence of mean value of dependent variable y on single independent variable x: 4.10 where y - dependent variable (resultant characteristic), x - independent variable (factor characteristic). Pair regression is used when there is a dominant factor that may influence a large proportion of change in dependent variable.
Multiple regression is called a model, expressing dependence of the mean value of dependent variable y on a number of independent variables х 1, х 2, …, х n : 4.11 Multiple regression is used in cases when out of many factors influencing resultant characteristic, cannot be identified a dominant factor and it is necessary to take into account simultaneous influence of several factors.
Using pair regression equation (4.10), model of the relationship between variables y and x can be represented as follows: 4.12 where the first term f(x) can be interpreted as that part of the value y, which is explained by regression equation (4.10), while the second term ε as unexplained part of the value y. Relationship between these parts characterizes quality of regression equation, its ability to represent actual relationship between variables x and y. The presence of component ε is due to such factors as availability of additional factors that influence variable y, wrong view of functional dependence f (x), measurement error, selective nature of input data. When designing regression equation, ε is regarded as model error, which is a random variable that satisfies certain conditions.
The main types of pair regression equations Regression typeRegression equation Linear Hyperbolic Polynomial Power
To estimate equation parameters of pair regression method of least squares is used. The method of least squares is to identify such coefficients asа 0, a 1, a 2, for which the sum of squared deviations of actual values y i from theoretical result will be minimal. Equation of pair linear regression is often shown as follows: 4.13 To determine parameters a, b by the least squares method it is necessary to solve the following system of standard equations:
It is obtained as a result of system solution (4.13): 4.14 where – mean factor value х; – mean resultant variable y; – mean square of variables х; – mean product of variables х and y;
Closeness and direction of pair linear correlation is measured by means of linear correlation coefficient r ху : 4.15 mean-square deviation of variable х; where n – number of observations; x i, y i – observation data; – mean values of variables x and y; mean-square deviation of variable у ;
Positive values of correlation coefficient show positive relationship between characteristics, negative – negative correlation. Correlation relationships between variables a) – positive; b) – negative
Correlation coefficients for various relationships
Having obtained regression equation, it is necessary to assess its significance. Checking the significance of regression equation involves answering two important questions: whether a mathematical model that expresses relationship between variables corresponds to experimental data?; whether there are enough included in the equation explanatory variables for the description of dependent variable?.
Accuracy of the model can be estimated by regression mean square error: 4.21 To assess quality of the model average error of approximation is used, which is mean relative deviation of calculated values from observables:
Checking the significance of regression equation is based on analysis of dispersion. The central place in this case is the analysis of three sums: - total sum of squared deviations of the studied parameter y from its average value; total sum of sguares - sum of squared deviations y is explained by regression; regression sum of sguares - residual sum of squared deviations y is due to the influence of factors unaccounted in simulation; error sum of sguares
Quality of regression model design is estimated using coefficient of determination: By the definition The closer the value of R 2 to unity, the better the regression equation fits observation data. When R 2 = 1 the relation holds for all observations, i.e. dependence is functional.
The value of R 2 shows what percentage of total dispersion (variance) in resultant characteristic y is explained by regression equation. For example, the value of R 2 = 0,8 means that regression equation explains 80% of total dispersion (variance) of resultant y. Thus, by the value of R 2 it can be judged how well the model fits original data. Since the value R 2 is defined by the sum of squared deviations, it is necessary to know the number of degrees of freedom k, which is associated with the number of indicator observations and defined constants for them.
Dispersion per degree of freedom Variance sources (dispersion) Sums of squared deviations Number of degrees of freedom Dispersion per degree of freedom Total n - 1 Explanatory 1 Residual n - 2