You might have heard of linear regression for fitting a straight line to data, such as in the picture below. But did you know that you can fit much more complicated functions to data, and the procedure is almost as simple? You may be surprised to learn that “linear” covers a lot more than straight lines. This article is looking at the general “linear least squares” problem. So why is it called “linear” then? We’ll get to that.

As we’ll see below, the math becomes a lot simpler but also more generally useful when we introduce Linear Algebra and matrices. But first we consider the common case of fitting a straight line, without the use of matrices.

That is, we do a straight line fit the hard way.

#### Linear Regression (straight-line fitting, non-matrix version)

Often we have data that is noisy as with the blue dots in the figure, and we are interested in the trend line. We believe that the “true” y data follows a straight-line relationship with respect to x,

but has been corrupted by noise, so instead the observed y values are a little above or below the true y data.

where the term refers to a random error that has been added to the i-th data point.

**Statistics Note:** (If statistical language triggers you, skip this paragraph.) For the statistical analysis of this procedure, you need to know that most commonly the errors are assumed to come from a normal distribution with mean = 0, and they are independent. Also, they all have the same standard deviation. If you drew the error bars, they would all be the same size. We’re not doing a statistical analysis, but you can assume we keep those assumptions because the “best” fit is different if they don’t hold.

Note that the equation above with the error term is for the purpose of the analysis. We don’t have access to the error value and the “true y”. All we have is a measured for each . This is our blue dot. We then have a predicted or estimated y value for that , and we indicate that with a hat: . The hats on m and b indicate that they too are estimated values, our best guess at the true values.

**Statistics Note**: In statistics terms, and are *estimators* of the true values and . The “best fit” procedure is the one that gives good estimators, which usually to a statistician means unbiased (it’s the right value on average) and minimum variance (most guesses are closer for this procedure than others). This is what I meant when I said the “best” fit is different for a different error model. Again, I’m not going to actually do a statistical analysis here but I assure you there is plenty of material out there if you want to find one.

So what is the procedure? For any particular guess at the parameters and , we can calculate the *residuals* on each data point, the difference between measured and predicted y value for each point.

Some will be positive, some negative, so what we do is square them, and add them up. That gives us this quantity (assuming we have data points), the total squared error:

The are fixed in this expression. Do not think of them as variables. Instead, think of and as the variables. If we change those parameters, we get a different trend line, different residuals, and a different calculated value of E. The goal in least-squares fitting is to find the parameters which give us the lowest possible value of the total squared error E.

In other words, we are going to **minimize the function E with respect to the variables** **and** .

So that means we’re going to take the derivatives with respect to those variables and set them equal to 0.

First we expand E in terms of the parameters. For simplicity we’ll just write

The n in the third term comes from the fact that we are adding a term for each i, a total of n such terms. Now we take the derivatives.

(Remember, the derivatives are with respect to the hatted parameters. The summations over data values are constants.) The second equation can be rearranged to

This can be used to find the value of once is known. Substituting that into the first equation (and dividing out the 2)

Giving us the standard linear regression equations:

Whew! That was a heck of a lot of algebra!

Now watch how easy this becomes when we apply linear algebra.

#### Linear Regression (straight-line fitting, matrix version)

We define a column vector of our x values and similarly of our observed y values. Our fitted y values from the model are .

Anticipating the more general fitting problem, we will give our unknown parameters the names (for the constant term) and (for the slope, the coefficient of ). Then

and the residuals can be expressed as .

Let’s define a parameter vector . Then we can write a matrix equation for the predicted values:

or where A is the n x 2 matrix whose first column is the number 1 and second column contains the x values. So the residual vector is .

Recall that for any vector , the magnitude squared or the sum of the squared components is

So the total squared residual E can be written as

Expanding a product like this in linear algebra isn’t that different from ordinary algebra, using the Distributive Law (you may have learned the technique called FOIL or First, Outer, Inner, Last). The difference in matrix expressions is that you should keep products in the order they occurred because AB is not in general the same as BA.

We also need the property that the transpose of a product reverses the order of multiplication, i.e., that so .

Our vectors such as and are all n x 1 column vectors, and their transposes are 1 x n row vectors. So every term here is a 1 x 1 scalar, a simple number, since of course E is a simple number representing the total squared error.

The middle two terms are transposes of each other, but since they are scalars, the transposes are equal, representing the same number. So we can combine them.

We want to minimize this with respect to . That is, we want to find the value of such that

The first term is constant with respect to , so its gradient is 0. The second term has the form with and its gradient is equal to . The third term has the form with and its gradient is equal to .

(For proofs of those gradients, see this post)

Thus the least squares solution for satisfies

This is a linear system of equations for of the form . is 2 x n and is n x 2, so is 2 x 2.

This is a general result. We started by asking what vector minimizes . If a problem can be put in this form, the solution is the vector which solves the linear system .

We often in linear algebra run into overdetermined systems (more equations than unknowns) of the form with no exact solution. In these cases the best approximate solution, the one that minimizes the magnitude of the difference between and . The linear regression problem with n rows and 2 columns is just one example.

As we’ll see below, a much wider class of problems than linear regression fits this framework, and all those problems can be solved the same way. Because the equations are a linear system in terms of the unknown parameters, all such problems are considered linear least-squares.

The system is a system of linear equations. Software to solve linear systems is available in many computer languages and mathematics libraries. Python has one called numpy.linalg.solve, and Matlab has the backslash operator (\). So one approach is to build , build and then give them to a linear system solver to find the unique exact solution of this system, which will be the best least-squares to the original overdetermined system.

But both those languages also have the built-in capability to find the least-squares solution of an overdetermined system. Python has numpy.linalg.lstsq, and Matlab’s backslash operator will calculate the least-squares solution if the system is overdetermined. So in fact all you have to do is construct and .

To summarize, the way to do linear regression with linear algebra is:

- Create the 2-column matrix whose first column is all 1’s and second column is the values.
- Create a column vector of the values.
- If your computer environment has a least-squares system solver, use it to find the least squares solution of .
- Otherwise, compute and and use a linear system solver to find the solution of the system
- The elements of are the least-squares values of intercept and slope, in that order.

Here is the Python code that performed the fit and plotted the results that appear at the top of this page. This code is modeled from the example code at https://numpy.org/doc/stable/reference/generated/numpy.linalg.lstsq.html.

```
A = np.vstack([np.ones(len(x)),x]).T
p = np.linalg.lstsq(A, y, rcond=None)[0]
plt.plot(x, y, 'o', label='Original data', markersize=10)
plt.plot(x, p[1]*x + p[0], 'r', label='Fitted line')
plt.legend()
plt.show()
```

Notice that the actual regression only took two lines here. The other four lines are to generate the plot.

#### General Linear Least Squares

All of the method above is based on the equation . If the model for can be written in that form, then the total squared error has the form we saw above and the optimum least-squares solution has exactly the same form. All we need is to properly define the matrix .

We can do this as long as our model has the form

i.e., a sum of functions multiplied by unknown coefficients , which we will optimize in the least-squares sense.

This has the matrix representation if is the matrix whose -th elements is . The -th column contains the value of the function evaluated at each of the x values.

For linear regression, we have and . This is easily extended to polynomials of any degree by adding columns for , , etc.

But it is much more general than that, since the can be any function at all. We’ll see some examples in the next section. The general method is virtually the same as linear regression, but generalizing the first step.

- Create the r-column matrix where the -th colum consists of evaluated at each .
- Create a column vector of the values.
- If your computer environment has a least-squares system solver, use it to find the least squares solution of .
- Otherwise, compute and and use a linear system solver to find the solution of the system
- The elements of are the least-squares values of intercept and slope, in that order.

#### Examples of the general method

##### Line constrained to go through the origin

Sometimes we want a line where the y-intercept is forced to be 0. In this case, there’s only one parameter, the slope. It’s a little bit of overkill to use a computer linear algebra system to solve this problem, but we can still take advantage of the equations to immediately write down the solution.

Since we only have one function , our matrix is now just the column of values.

Then is and is . The “system” to solve is

so

is the optimal slope. The figure below shows a fit using this model, compared to an ordinary linear regression with a nonzero y-intercept.

##### Cubic polynomial

The model here is so our functions are . The matrix contains those powers of x as its columns.

Again we use Python’s numpy.linalg.lstsq() to calculate the least-squares solution to

```
A = np.vstack([np.ones(len(x)),x,x*x,x**3]).T
p = np.linalg.lstsq(A, y, rcond=None)[0]
```

The result is shown in the figure below.

##### Sinusoid

If we are trying to fit a model of the form , we cannot use the linear least squares procedure to find because it does not appear as a term of the form . The model is not linear in the parameter

However, despite the fact that the phase shift is also a nonlinear parameter, we can fit a phase-shifted sinusoid to data, because a phase shifted sinusoid can be represented as a sum of an unshifted sine and cosine.

So we will use a model with and , and is either known or estimated by other means.

The result is shown below. The original data had a frequency of 2 Hz, so . In practice you could have some procedure to estimate the frequency so might not obtain the correct value. To simulate that, we used a value of 11 in this fit.

## Leave a Reply