10-3. Modelling Linear Relationships with Randomness Present

In this chapter we are dealing with a population for which we can associate to each element two measurements, xx and yy .

We are interested in situations in which the value of xx can be used to draw conclusions about the value of yy, such as predicting the resale value yy of a residential house based on its size xx.

Since the relationship between xx and yy is not deterministic, statistical procedures must be applied. For any statistical procedures, given in this book or elsewhere, the associated formulas are valid only under specific assumptions.

The set of assumptions in simple linear regression are a mathematical description of the relationship between xx and yy. Such a set of assumptions is known as a model.

For each fixed value of xx a sub-population of the full population is determined, such as the collection of all houses with 2,100 square feet of living space. For each element of that sub-population there is a measurement yy, such as the value of any 2,100-square-foot house. Let E(y)E(y) denote the mean of all the yy-values for each particular value of xx. E(y)E(y) can change from xx-value to xx-value, such as the mean value of all 2,100-square-foot houses, the (different) mean value for all 2,500-square foot-houses, and so on.

Our first assumption is that the relationship between xx and the mean of the yy-values in the sub-population determined by xx is linear. This means that there exist numbers β1β_1 and β0β_0 such that

E(y)=β1x+β0E(y)=β_1x+β_0

This linear relationship is the reason for the word “linear” in “simple linear regression” below. (The word “simple” means that yy depends on only one other variable and not two or more.)

Our next assumption is that for each value of xx the yy-values scatter about the mean E(y)E(y) according to a normal distribution centered at E(y)E(y) and with a standard deviation σσ that is the same for every value of xx. This is the same as saying that there exists a normally distributed random variable εε with mean 0 and standard deviation σσ so that the relationship between xx and yy in the whole population is

y=β1x+β0+εy=β_1x+β_0+ε

Our last assumption is that the random deviations associated with different observations are independent.

In summary, the model is:

Simple Linear Regression Model

For each point (x,y)(x,y) in data set the y-value is an independent observation of

y=β1x+β0+εy=β_1x+β_0+ε

where β1β_1 and β0β_0 are fixed parameters and εε is a normally distributed random variable with mean 0 and an unknown standard deviation σσ .

The line with equation y=β1x+β0y=β_1x+β_0 is called the population regression line.

Figure 10.5 "The Simple Linear Model Concept" illustrates the model. The symbols N(μ,σ2)N(μ,σ^2) denote a normal distribution with mean μμ and variance σ2σ^2 , hence standard deviation σσ .

Figure 10.5 The Simple Linear Model Concept

It is conceptually important to view the model as a sum of two parts: y=β1x+β0+εy=β_1x+β_0+ε

  1. Deterministic Part. The first part (β1x+β0)(β_1x+β_0) is the equation that describes the trend in yy as xx increases. The line that we seem to see when we look at the scatter diagram is an approximation of the line y=β1x+β0y=β_1x+β_0. There is nothing random in this part, and therefore it is called the deterministic part of the model.

  2. Random Part. The second part εε is a random variable, often called the error term or the noise. This part explains why the actual observed values of yy are not exactly on but fluctuate near a line. Information about this term is important since only when one knows how much noise there is in the data can one know how trustworthy the detected trend is.

There are three parameters in this model: β0,β1,β_0, β_1, and σσ . Each has an important interpretation, particularly β1 β_1 and σσ. The slope parameter β1 β_1 represents the expected change in yy brought about by a unit increase in xx. The standard deviation σσ represents the magnitude of the noise in the data.

There are procedures for checking the validity of the three assumptions, but for us it will be sufficient to visually verify the linear trend in the data. If the data set is large then the points in the scatter diagram will form a band about an apparent straight line. The normality of εε with a constant standard deviation corresponds graphically to the band being of roughly constant width, and with most points concentrated near the middle of the band.

Fortunately, the three assumptions do not need to hold exactly in order for the procedures and analysis developed in this chapter to be useful.

Last updated