Matrix calculus

This seems kind of esoteric but it’s the basis for really simplifying some complicated mathematics, like the general theory of least squares. It’s well worth working through if you’re going to be doing the linear algebra version of least squares or multivariable optimization.

The two main results proved as theorems below:

$\nabla \bf{a^Tx} = \bf{a} \\ \nabla \bf{x^TQx} = \bf{Qx}$

Definitions

In the expressions below, $\mathbf{x}$ is an (n x 1) column vector of variables,

$\bf{x} = \begin{pmatrix}x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix}$

$\mathbf{a}$ is an (n x 1) column vector of coefficients,

$\bf{a} = \begin{pmatrix}a_1 \\ a_2 \\ \vdots \\ a_n \end{pmatrix}$

$\mathbf{Q}$ is a symmetric (n x n) matrix of coefficients

$\bf{Q} = \begin{pmatrix}q_{11} & q_{12} & \cdots & q_{1n} \\ q_{21} & q_{22} & \cdots & q_{2n} \\ \vdots & \vdots & \cdots & \vdots \\ q_{n1} & q_{n2} & \cdots & q_{nn} \\ \end{pmatrix}$

with $q_{ij} = q_{ji}$ for all $i, j$ and the gradient $\nabla$ is with respect to $\mathbf{x}$

$\nabla f(\bf{x}) = \begin{pmatrix}\partial{f} / \partial{x_1} \\ \partial{f} / \partial{x_2} \\ \vdots \\ \partial{f} / \partial{x_n} \end{pmatrix}$

Gradient of linear function

Theorem: $\nabla \bf{a^Tx} = \bf{a}$ .

Proof: To see this, just work out the expression in terms of the individual components.

$\bf{a^Tx} = \sum_{i=1}^n a_i x_i$ is a general linear function of $\bf{x}$ . The partial derivative with respect to $x_i$ is obviously a_i since the i-th term is the only term in which $x_i$ appears.

Thus

$\nabla \bf{a^Tx} = \begin{pmatrix}a_1 \\ a_2 \\ \vdots \\ a_n \end{pmatrix} = \bf{a}$

General quadratic function

Before examining its gradient, let’s examine some properties of the expression $\nabla \bf{x^TQx}$ and why it represents the most general quadratic function.

The most general quadratic function of the $\{x_i\}$ is a function containing all products $x_i x_j$ for $i < j$ in addition to all $x_i^2$ . Every product $x_i x_j$ with $i > j$ is a duplicate of a product with $i < j$ so we do not need those terms. In other words

$f(x) = \sum_{i=1}^n a_i x_i^2 + \sum_{i < j} b_{ij} x_i x_j$

Now let’s examine the expression $\nabla \bf{x^TAx}$ for an arbitrary (not necessarily symmetric) A. We show that we can always replace it with a symmetric one.

First, note that $\bf{Ax}$ is a column vector so $\bf{x^T(Ax)}$ is a scalar. Thus it is equal to its own transpose, $\bf{x^TAx} = \bf{x^TA^Tx}$ . So

$\bf{x^TAx} = \frac{1}{2} \left ( \bf{x^TAx} + \bf{x^TA^Tx} \right ) \\[6 pt] = \frac{1}{2} \bf{x^T} (\bf{Ax} + \bf{A^Tx}) \\[6 pt] = \frac{1}{2} \bf{x^T} (\bf{A} + \bf{A^T}) \bf{x} \\[6 pt] = \bf{x^T} \left [ \frac{1}{2}(\bf{A} + \bf{A^T}) \right ] \bf{x}$

The matrix $\bf{Q} = \frac{1}{2}(\bf{A} + \bf{A^T})$ is symmetric, $\bf{Q^T} = \frac{1}{2}(\bf{A^T} + \bf{A}) = \bf{Q}$ . Thus if you have an expression $\bf{x^TAx}$ with non-symmetric square matrix $\bf{A}$ you can always replace $\bf{A}$ with the symmetric $\bf{Q}$ .

Finally, let’s expand $\bf{x^TQx}$ .

The i-th element of $\bf{Qx}$ is

$\bf(Qx)_i = \sum_{j=1}^n q_{ij} x_j$

$\bf{x^TQx} = \sum_{i=1}^n x_i (\bf{Qx})_i \\[6 pt] = \sum_{i=1}^n \sum_{j=1}^n x_i q_{ij} x_j$

which we can break into diagonal and off-diagonal terms

$\bf{x^TQx} = \sum_{i=1}^n q_{ii} x_i x_i + \sum_{i, j, i \neq j} q_{ij} x_i x_j$

Now note in the second sum that every combination of distinct i and j occurs twice, once with $q_{ij} x_i x_j$ and once with $q_{ji} x_j x_i$ . For instance we have a term $q_{23} x_2 x_3$ and $q_{32} x_3 x_2$ , and because $\bf{Q}$ is symmetric, these are equal. So we can write

$\bf{x^TQx} = \sum_{i=1}^n q_{ii} x_i^2 + \sum_{i, j, i < j} 2 q_{ij} x_i x_j$

Comparing to the general quadratic function at the top of this section, we can see that any such expression can be written in the form $\bf{x^T Q x}$ with a symmetric $\bf{Q}$ , with $q_{ii} = a_i$ and $q_{ij} = q_{ji} = (1/2)b_{ij}$

Gradient of a general quadratic

And now we can attack our final gradient calculation.

Theorem: $\nabla \bf{x^TQx} = \bf{2Qx}$ .

Proof:

Note that $\sum_{i, j, i < j} 2 q_{ij} x_i x_j$ has exactly one term for each distinct combination of i and j, i.e. that each $x_i$ occurs in exactly one term with each $x_j$ with $i != j$ . Thus the partial of this sum with respect to $x_i$ is $\sum_{j \neq i} 2q_{ij} x_j$ .

So the i-th element of the gradient vector is

$\frac{\partial}{\partial {x_i}} \bf{x^TQx} = \frac{\partial}{\partial {x_i}}\sum_{i=1}^n q_{ii} x_i^2 + \frac{\partial}{\partial {x_i}}\sum_{i, j, i < j} 2 q_{ij} x_i x_j \\[6 pt] = 2 q_{ii} x_i + \sum_{j \neq i} 2 q_{ij} x_j \\[6 pt] = 2 \sum_{j=1}^n q_{ij} x_j \\[6 pt] = 2 (\bf{Qx})_i$

and it follows that $\nabla \bf{x^TQx} = \bf{2Qx}$