In the previous chapters we have developed the basics of the simple regression model, describing how to estimate the population parameters using sample information and how to perform inference on the population. But so far we do not know how well the model describes the data. The two most popular measures for model fit are the so called coefficient of determination and the adjusted coefficient of determination.

5.1. The coefficient of determination (R2)

In the simple regression model we explain the variation of one variable with help of another. We can do that because they are correlated. Had they not been correlated there would be no explanatory power in our X variable. In regression analysis the correlation coefficient and the coefficient of determination are very much related, but their interpretation differs slightly. Furthermore, the correlation coefficient can only be used between pairs of variables, while the coefficient of determination can connect a group of variable with the dependent variable.

In general the correlation coefficient offers no information about the causal relationship between two variables. But the attempt of this chapter is to put the correlation coefficient in a context of the regression model and show under what: conditions it is appropriate to interpret the correlation coefficient as a measure of strength of a causal relationship.

The coefficient of determination tries to decompose the average deviation from the mean into an explained part and an unexplained part. It is therefore natural to start the derivation of the measure from the deviation from the mean expression and then introduce the predicted value that comes from the regression model. That is, for a single individual we have:

We have to remember that we try to explain the deviation from the mean value of Y, using the regression model. Hence, the difference between the expected value (y) and the mean value (y) will therefore be denoted as the explained part of the mean difference. The remaining part will therefore be denoted the unexplained part. With this simple trick: we decomposed the simple mean difference for a single observation. We must now transform (5.1) into an expression that is valid for the whole sample, that is for all observations. We do that by squaring and summing over all n observations:

It is possible to show that the sum of the last expression on the right hand side equals zero. With that knowledge we may write:

With these manipulations we end up with three different components. On the left hand side we have the total sum of squares (TSS) which represent the total variation of the model. On the right hand side we first have the Explained Sum of Squares (ESS) and the second component on the right hand side represents the unexplained variation and is called thee Residual Sum of Squares (RSS).

Caution: be careful when using different text books. The notation is not consistent in the literature so it is always important to make sure that you know what ESS and RSS stands for.

The identity we have found may now be expressed as:

which may be rewritten in the following way:

Hence, by dividing by the total variation on both sides we may express the explained and unexplained variation as shares of the total variation, and since the right hand side sum to one, the two shares can be expressed in percentage form. We have

ESS

-= the share of the total variation that is explained by the model

TSS

RSS

-= the share of the total variation that is unexplained by the model

TSS

The coefficient of determination

The percent of variation in the dependent variable associated with or explained by variation in the independent variable in the regression equation:

Example 5.1

Assume that a simple linear regression model estimated an R2 equal to 0.65. That would imply that 65 percent of the total variation around the mean value of Y is explained by the variable X included in the model.

In the simple regression model there is a nice relationship among the measures of sample correlation coefficient, the OLS estimator of the slope coefficient, and the coefficient of determination. To see this we may rewrite the explained sum of squares in the following way:

Using this transformation we may re-express the coefficient of determination:

where SX and SY represents the sample standard deviation for X and Y respectively. Furthermore we can establish a relation between the OLS slope estimator and the correlation coefficient between X

and Y.

where SXY represents the sample covariance between X and Y, and r the sample correlation coefficient for X and Y. Hence, substituting (5.4) into (5.3) shows the relation between the sample correlation coefficient and the coefficient of determination.

Hence, in the simple regression case the square root of the coefficient of determination is the absolute value of the sample correlation coefficient:

This means that the smaller the correlation between X and Y, the smaller is the explained share of the variation by the model, which is the same as to say that the larger is the unexplained share of the variation. That is, the more disperse the sample points are from the regression line the smaller is the correlation and the coefficient of determination. This leads to an important conclusion about the importance of the coefficient of determination:

R2 and the significance of the OLS estimators

An increased variation in Y, with an unchanged variation in X, will directly reduce the size of the coefficient of determination. But it will not have any effect on the significance of the parameter estimate of the regression model.

From (5.3)-(5.6) it is clear that an increased variation in Y will reduce the size of the coefficient of determination of the regression model. However, when the variation in Y increases, so will the covariance between Y and X which will increase the value of the parameter estimate. It is therefore not obvious that the significance of the parameter will be unchanged. By creating the f-ratio we can see that:

where S represents the standard deviation of the residual. The expression for the standard error of the OLS estimator was derived in the previous chapter. Now, lets us see what happens with the f-value if we increase the variation of Y with a constant c.

Hence, increasing the variation of Y with a constant c, has no effect what so over on the f-value. We should therefore draw the conclusion that the coefficient of determination is just a measure of linear strength of the model and nothing else. As an applied researcher it is far more interesting and important to analyze the significance of the parameters in the model which is related to the f-values.

Found a mistake? Please highlight the word and press Shift + Enter