10-6. The Coefficient of Determination
Last updated
Last updated
If the scatter diagram of a set of pairs shows neither an upward or downward trend, then the horizontal line fits it well, as illustrated in Figure 10.11.
The lack of any upward or downward trend means that when an element of the population is selected at random, knowing the value of the measurement for that element is not helpful in predicting the value of the measurement .
Figure 10.11
The line fits the scatter diagram well.
If the scatter diagram shows a linear trend upward or downward then it is useful to compute the least squares regression line and use it in predicting .
Figure 10.12 "Same Scatter Diagram with Two Approximating Lines" illustrates this. In each panel we have plotted the height and weight data of Section 10.1 "Linear Relationships Between Variables".
This is the same scatter plot as Figure 10.2 "Plot of Height and Weight Pairs", with the average value line superimposed on it in the left panel and the least squares regression line imposed on it in the right panel. The errors are indicated graphically by the vertical line segments.
Figure 10.12 Same Scatter Diagram with Two Approximating Lines
The sum of the squared errors computed for the regression line, , is smaller than the sum of the squared errors computed for any other line. In particular it is less than the sum of the squared errors computed using the line , which sum is actually the number that we have seen several times already. A measure of how useful it is to use the regression equation for prediction of is how much smaller is than . In particular, the proportion of the sum of the squared errors for the line that is eliminated by going over to the least squares regression line is
We can think of as the proportion of the variability in that cannot be accounted for by the linear relationship between and , since it is still there even when is taken into account in the best way possible (using the least squares regression line; remember that is the smallest the sum of the squared errors can be for any line).
Seen in this light, the coefficient of determination, the complementary proportion of the variability in , is the proportion of the variability in all the measurements that is accounted for by the linear relationship between and .
In the context of linear regression the coefficient of determination is always the square of the correlation coefficient discussed in Section 10.2 "The Linear Correlation Coefficient". Thus the coefficient of determination is denoted , and we have two additional formulas for computing it.
The coefficient of determination of a collection of (x,y)(x,y) pairs is the number r2 computed by any of the following three expressions:
It measures the proportion of the variability in y that is accounted for by the linear relationship between x and y.
If the correlation coefficient is already known then the coefficient of determination can be computed simply by squaring , as the notation indicates, .
EXAMPLE 10. The value of used vehicles of the make and model discussed in Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line" varies widely. The most expensive automobile in the sample in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model" has value $30,500, which is nearly half again as much as the least expensive one, which is worth $20,400. Find the proportion of the variability in value that is accounted for by the linear relationship between age and value.
[ Solution ]
The proportion of the variability in value that is accounted for by the linear relationship between it and age is given by the coefficient of determination, .
Since the correlation coefficient was already computed in Note 10.19 "Example 3" as , .
About 67% of the variability in the value of this vehicle can be explained by its age.
EXAMPLE 11. Use each of the three formulas for the coefficient of determination to compute its value for the example of ages and values of vehicles.
[ Solution ] In Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line" we computed the exact values
, , , 
In Note 10.24 "Example 5" in Section 10.4 "The Least Squares Regression Line" we computed the exact value
Inserting these values into the formulas in the definition, one after the other, gives
which rounds to 0.670.
The discrepancy between the value here and in the previous example is because a rounded value of from Note 10.19 "Example 3" was used there.
The actual value of before rounding is 0.8186864772, which when squared gives the value for obtained here.