### Least Squares

#### Part 1: An example: Data on cancer deaths

The Hanford Atomic Energy Plant in Washington has been a plutonium production facility since World War II, and some of the wastes have been stored in pits in the same area. Radioactive waste has been seeping into the Columbia River since that time, and eight Oregon counties and the city of Portland have been exposed to radioactive contamination. The table below lists the number of cancer deaths per 100,000 residents for Portland and these counties. It also lists an index of exposure that measures the proximity of the residents to the contamination. The index assumes that exposure is directly proportional to river frontage and inversely proportional both to the distance from Hanford and to the square of the county's (or city's) average depth away from the river. The accompanying figure is a scatter plot of the deaths vs. index data.

 County/city Index Deaths Umatilla 2.5 147 Morrow 2.6 130 Gilliam 3.4 130 Sherman 1.3 114 Wasco 1.6 138 Hood River 3.8 162 Portland 11.6 208 Columbia 6.4 178 Clatsop 8.3 210

Even though the data points do not lie on a straight line, they exhibit a definitely linear trend. The problem of least squares is to find the line that "best fits" the data. Our next figure shows a candidate for such a line, along with vertical line segments indicating the deviations or residuals, that is, the (directed) distances from the data points to the corresponding points on the model line. For reasons we will see shortly, our criterion for best fit is that the choice of line should minimize the sum of the squares of the residuals -- hence the name "least squares." The best fitting line is called the least squares line or the regression line.

Remark about notation: Throughout this module, incontrast to our conventions elsewhere in the linear algebra materials, we will maintain a distinction between vectors and scalars by boldfacing vector names but not boldfacing scalars.

1. Suppose the least squares line has the form y = mx + b, where x represents the index value and y the number of cancer deaths per 100,000 population. What does b represent in terms of cancer deaths? What does m represent in terms of cancer deaths?
2. Suppose we represent the data points as

(X1,Y1), (X2,Y2), ..., (X9,Y9).

Explain why the quantity to be minimized is

[Y1 - (mX1 + b)]2 + [Y2 - (mX2 + b)]2 + ... + [Y9 - (mX9 + b)]2.

3. Explain why our least squares problem is to find numbers m and b so as to minimize the distance in R9 between the vector

y = (Y1, Y2, ..., Y9)T

and the set of vectors of the form

(mX1+ b, mX2 + b, ..., mX9 + b)T.

Let

x = (X1, X2, ..., X9)T

and

1 = (1, 1, ..., 1)T

in R9. Then the set of vectors in the preceding step is also the set of vectors of the form

mx + b1,

where m and b are real numbers.

• Explain why this set is a subspace of R9. (We call this space the model space.)
• Explain why the least squares problem is to find the closest vector in the model space to the vector y.

4. What are the possible dimensions of the model space? What would it mean if the dimension were something other than 2?
5. What would it mean if the vector y were in the model space? Is this the case for the cancer death data? Why or why not?

modules at math.duke.edu