10-7. Estimation and Prediction

Consider the following pairs of problems, in the context of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the automobile age and value example.

    1. Estimate the average value of all four-year-old automobiles of this make and model.

    2. Construct a 95% confidence interval for the average value of all four-year-old automobiles of this make and model.

    1. Shylock intends to buy a four-year-old automobile of this make and model next week. Predict the value of the first such automobile that he encounters.

    2. Construct a 95% confidence interval for the value of the first such automobile that he encounters.

The method of solution and answer to the first question in each pair, (1a) and (2a), are the same. When we set xx equal to 4 in the least squares regression equation y^=2.05x+32.83\hat{y}=−2.05x+32.83 that was computed in part (c) of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", the number returned,

y^=2.05(4)+32.83=24.63\hat{y}=−2.05(4)+32.83=24.63

which corresponds to value $24,630, is an estimate of precisely the number sought in question (1a): the mean E(y)E(y) of all yy values when x=4x = 4 . Since nothing is known about the first four-year-old automobile of this make and model that Shylock will encounter, our best guess as to its value is the mean value E(y)E(y) of all such automobiles, the number 24.63 or $24,630, computed in the same way.

The answers to the second part of each question differ. In question (1b) we are trying to estimate a population parameter: the mean of the all the y-values in the sub-population picked out by the value x=4x = 4, that is, the average value of all four-year-old automobiles. In question (2b), however, we are not trying to capture a fixed parameter, but the value of the random variable yy in one trial of an experiment: examine the first four-year-old car Shylock encounters. In the first case we seek to construct a confidence interval in the same sense that we have done before. In the second case the situation is different, and the interval constructed has a different name, prediction interval. In the second case we are trying to “predict” where a the value of a random variable will take its value.

100(1α)%100(1−α)\% Confidence Interval for the Mean Value of yy at x=xpx=x_p

yp^±tα2sε1n+(xpxˉ)2SSxx\hat{y_p} ±t_{α∕2} s_ε \sqrt{\frac{1}{n} + \frac{(x_p−\bar{x})^2} {SS_{xx}}}

where

  1. xpx_p is a particular value of xx that lies in the range of x-values in the sample data set used to construct the least squares regression line;

  2. yp^\hat{y_p} is the numerical value obtained when the least square regression equation is evaluated at x=xpx=x_p ; and

  3. the number of degrees of freedom for tα2t_{α∕2} is df=(n2)df=(n−2).

The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.

The formula for the prediction interval is identical except for the presence of the number 1 underneath the square root sign. This means that the prediction interval is always wider than the confidence interval at the same confidence level and value of xx . In practice the presence of the number 1 tends to make it much wider.

100(1α)%100(1−α)\% Prediction Interval for an Individual New Value of yy at x=xpx=x_p

yp^±tα2sε1+1n+(xpxˉ)2SSxx\hat{y_p} ±t_{α∕2} s_ε \sqrt{1 + \frac{1}{n} + \frac{(x_p−\bar{x})^2} {SS_{xx}}}

where

  1. xpx_p is a particular value of xx that lies in the range of x-values in the sample data set used to construct the least squares regression line;

  2. yp^\hat{y_p} is the numerical value obtained when the least square regression equation is evaluated at x=xpx=x_p ; and

  3. the number of degrees of freedom for tα2t_{α∕2} is df=(n2)df=(n−2).

The assumptions listed in Section 10.3 "Modelling Linear Relationships with Randomness Present" must hold.

EXAMPLE 12. Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% confidence interval for the average value of all three-and-one-half-year-old automobiles of this make and model.

[ Solution ]

Solving this problem is merely a matter of finding the values of yp^\hat{y_p}, αα and tα2t_{α∕2} , sεs_ε , xˉ\bar{x} , and SSxxSS_{xx} and inserting them into the confidence interval formula given just above.

Most of these quantities are already known. From Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", SSxx=14SS_{xx} = 14and xˉ=4\bar{x} =4. From Note 10.31 "Example 7" in Section 10.5 "Statistical Inferences About ", sε=1.902169814s_ε = 1.902169814. From the statement of the problem xp=3.5x_p=3.5 , the value of xx of interest.

The value of yp^\hat{y_p} is the number given by the regression equation, which by Note 10.19 "Example 3" is y^=2.05x+32.83\hat{y}=−2.05x+32.83 , when x=xpx=x_p, that is, when x=3.5x = 3.5 . Thus here yp^=2.05(3.5)+32.83=25.655\hat{y_p}=−2.05(3.5)+32.83=25.655.

Lastly, confidence level 95% means that α=(10.95)=0.05α=(1−0.95)=0.05 so α2=0.025α∕2=0.025 . Since the sample size is n=10n = 10 , there are (n2)=8(n−2)=8 degrees of freedom. By Figure 12.3 "Critical Values of ", t0.025=2.306t_{0.025}=2.306 .

Thus

yp^±tα2sε1n+(xpxˉ)2SSxx\hat{y_p} ±t_{α∕2} s_ε \sqrt{\frac{1}{n} + \frac{(x_p−\bar{x})^2} {SS_{xx}}} =25.655±(2.306)(1.902169814)110+(3.54)214= 25.655±(2.306)(1.902169814)\sqrt{\frac{1}{10}+ \frac{(3.5−4)^2} {14}} =25.655±4.3864035910.1178571429= 25.655±4.386403591 \sqrt{0.1178571429} =25.655±1.506=25.655±1.506

which gives the interval (24.149,27.161)(24.149,27.161) .

We are 95% confident that the average value of all three-and-one-half-year-old vehicles of this make and model is between $24,149 and $27,161.

EXAMPLE 13. Using the sample data of Note 10.19 "Example 3" in Section 10.4 "The Least Squares Regression Line", recorded in Table 10.3 "Data on Age and Value of Used Automobiles of a Specific Make and Model", construct a 95% prediction interval for the predicted value of a randomly selected three-and-one-half-year-old automobile of this make and model.

[ Solution ]

The computations for this example are identical to those of the previous example, except that now there is the extra number 1 beneath the square root sign.

Since we were careful to record the intermediate results of that computation, we have immediately that the 95% prediction interval is

yp^±tα2sε1+1n+(xpxˉ)2SSxx\hat{y_p} ±t_{α∕2} s_ε \sqrt{1 + \frac{1}{n} + \frac{(x_p−\bar{x})^2} {SS_{xx}}} =25.655±4.3864035911.1178571429=25.655±4.386403591 \sqrt{1.1178571429} =25.655±4.638=25.655±4.638

which gives the interval (21.017,30.293)(21.017,30.293).

We are 95% confident that the value of a randomly selected three-and-one-half-year-old vehicle of this make and model is between $21,017 and $30,293.

Note what an enormous difference the presence of the extra number 1 under the square root sign made. The prediction interval is about two-and-one-half times wider than the confidence interval at the same level of confidence.

Last updated