A principled introduction to calculus for machine learning

Over my five terms as a TA for introductory machine learning classes, I have seen students struggle with the intuition of multivariate calculus as applied to machine learning, in particular with gradient descent and Newton’s method for convex minimization. In this post, I present a more principled and thorough construction of these techniques utilizing elementary functional analysis. It’s likely not the best introduction for someone completely unfamiliar with multivariate calculus but rather for those already familiar with the concepts while having lingering doubts. This post develops the calculus machinery, and I plan on later writing another post how this connects to optimization techniques.

I will assume familiarity with basic linear algebra (vector spaces, norms, inner products), but not of analysis. I will also discuss dual vector spaces, but prior knowledge is not required.

1. Setting: Banach spaces

Most of machine learning takes place in finite-dimensional real vector spaces endowed with the Euclidean inner product and its induced $\ell_2$ norm. This is a specific type of Hilbert space, and we will begin by more abstractly defining Banach and Hilbert spaces. This abstraction will help reconcile the treatment of calculus for functions that map $m$-dimensional tensor inputs to scalar or $n$-dimensional tensor outputs, which is the general setup in ML.

Most of optimization theory takes place in Banach spaces—complete normed vector spaces.

Definition 1 (Completeness). A normed vector space \((X, \|\cdot\|)\) is complete if every Cauchy sequence in \(X\) converges to a limit that still lies in \(X\).

Example 1. The space \(\mathbb{R}^n\) is complete under any norm. In particular, with the usual Euclidean norm, every Cauchy sequence in \(\mathbb{R}^n\) converges to a point of \(\mathbb{R}^n\).

Example 2. The space \(\mathbb{R}^n \setminus \{0\}\) is not complete. For instance, the sequence \(x_k = \left(\frac{1}{k}, 0, \dots, 0\right)\) lies entirely in \(\mathbb{R}^n \setminus \{0\}\) and is Cauchy, but it converges to \(0\), which is not in the space.

Definition 2 (Banach space). A Banach space is a complete normed vector space \((X, \mathbb F, \| \cdot \|)\).

The most familiar example is \(\mathbb{R}^n\) over $\mathbb R$ with any of the \(\ell^p\) norms:

\[\|x\|_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}, \qquad \|x\|_\infty = \max_{1 \le i \le n} |x_i|.\]

Since all norms on \(\mathbb{R}^n\) are equivalent, completeness in one implies completeness in all.

A Banach space that additionally carries an inner product \(\langle \cdot, \cdot \rangle\) inducing the norm (\(\|x\| = \sqrt{\langle x, x \rangle}\)) is a Hilbert space. The distinction between Banach and Hilbert will become apparent when we discuss derivatives, which require Banach spaces, and gradients, which additionally require an inner product.

2. Fréchet derivative

When we say “derivative,” we will refer to the following Fréchet derivative unless otherwise stated.

Definition 3 (Fréchet derivative). Let \((X, \| \cdot \|_X)\) and \((Y, \| \cdot \|_Y)\) be Banach spaces, \(U \subseteq X\) open, and \(f : U \to Y\). We say \(f\) is Fréchet differentiable at \(x \in U\) if there exists a bounded linear operator \(A_x \in \mathcal{L}(X, Y)\) such that

\[\lim_{\|h\| \to 0} \frac{\|f(x + h) - f(x) - A_x h\|_Y}{\|h\|_X} = 0.\]

The operator \(A_x\) is unique and is called the Fréchet derivative of \(f\) at \(x\), denoted \(Df(x)\). Equivalently,

\[f(x+h) = f(x) + Df(x)[h] + o(\|h\|).\]

The key point is that \(Df(x)\) is a bounded linear operator, from \(X\) to \(Y\), so \(Df(x)\) is itself a function. When \(X = \mathbb{R}^n\) and \(Y = \mathbb{R}^m\), \(Df(x)\) is represented by the \(m \times n\) Jacobian matrix, but the abstract definition is free of coordinates.

Let’s take a look at a few examples for vector-valued functions before moving on. Here we will see how the Jacobian matrix relates to the Fréchet derivative.

Example 3. Let \(A \in \mathbb{R}^{m \times n}\) and \(f : \mathbb{R}^n \to \mathbb{R}^m\) be \(f(x) = Ax\). Then

\[f(x + h) = A(x + h) = Ax + Ah = f(x) + Ah,\]

with no error term, so

\[Df(x)[h] = \underbrace{A}_{\text{Jacobian matrix}} h.\]

The derivative is the constant linear map \(A\), independent of \(x\). This generalizes the one-dimensional fact that \(f(x) = ax\) has \(f'(x) = a\).

Example 4. Let \(f : \mathbb{R}_{>0} \times \mathbb{R} \to \mathbb{R}^2\) be the polar-to-Cartesian map \(f(r, \theta) = re^{i\theta} = (r\cos\theta, r\sin\theta)\), identifying \(\mathbb{C} \cong \mathbb{R}^2\). Using \(\cos(\theta + \delta\theta) = \cos\theta - \sin\theta \, \delta\theta + O(\delta\theta^2)\) and \(\sin(\theta + \delta\theta) = \sin\theta + \cos\theta \, \delta\theta + O(\delta\theta^2)\), expanding gives

\[f(r + \delta r, \theta + \delta \theta) = f(r, \theta) + \begin{pmatrix} \cos\theta \, \delta r - r \sin\theta \, \delta \theta \\ \sin\theta \, \delta r + r \cos\theta \, \delta \theta \end{pmatrix} + O(\|h\|^2),\]

where \(h = (\delta r, \delta \theta)\). Hence the Fréchet derivative is represented by the Jacobian

\[Df(r, \theta)[h] = \underbrace{\begin{pmatrix} \cos\theta & -r\sin\theta \\ \sin\theta & r\cos\theta \end{pmatrix}}_{\text{Jacobian matrix}} \ h.\]

3. Gradient-derivative duality

This is perhaps the most important conceptual point that we will discuss, a large pain point for students introduced to these concepts. We start by noting that gradients are only defined for scalar-valued functions.

The derivative lives in the dual space

For a scalar-valued function \(f : \mathbb{R}^n \to \mathbb{R}\), the Fréchet derivative at \(x\) is a bounded linear functional:

\[Df(x) \in \mathcal{L}(\mathbb{R}^n, \mathbb{R}) = (\mathbb{R}^n)^*.\]

Definition 4 (Dual space). For a normed vector space \(X\), its dual space \(X^*\) is the set of all bounded linear maps from \(X\) to \(\mathbb{R}\).

That is, \(Df(x)\) is an element of the dual space. It takes a vector \(h\) and returns a scalar—the best local linear prediction of how \(f\) changes. In coordinates, \(Df(x)[h] = \sum_i \frac{\partial f}{\partial x_i} h_i\).

The gradient lives in the primal space

The gradient \(\nabla f(x)\) is not the derivative. It is the Riesz representative of the derivative.

Theorem 1 (Riesz representation). Let \(H\) be a Hilbert space. For every bounded linear functional \(\varphi \in H^*\), there exists a unique \(g \in H\) such that

\[\varphi(h) = \langle g, h \rangle \quad \text{for all } h \in H,\]

and \(\|g\|_H = \|\varphi\|_{H^*}\).

In our setting, since \(Df(x) \in (\mathbb{R}^n)^*\) is a bounded linear functional on the Hilbert space \((\mathbb{R}^n, \langle \cdot, \cdot \rangle)\), there exists a unique vector \(\nabla f(x) \in \mathbb{R}^n\) such that

\[Df(x)[h] = \langle \nabla f(x), h \rangle \quad \text{for all } h \in \mathbb{R}^n.\]

Definition 5 (Gradient). The gradient \(\nabla f(x)\) is the unique element of \(\mathbb{R}^n\) satisfying \(Df(x)[h] = \langle \nabla f(x), h \rangle\) for all \(h\). It is the Riesz representative of the Fréchet derivative.

The conceptual relation is

\[\underbrace{Df(x)}_{\text{derivative} \in X^*} \;\xrightarrow{\text{Riesz map}}\; \underbrace{\nabla f(x)}_{\text{gradient} \in X}.\]

The derivative is intrinsic—it does not depend on the choice of inner product (but may depend on the norm). The gradient depends on the inner product.

Why they often feel the same

With the standard Euclidean inner product \(\langle u, v \rangle = u^\top v\), the Riesz map is effectively the transpose operation (when we represents vectors as column vectors). The Frechet derivative is the inner product of the vector of partials at $x$ times $h$, which is often represented by the row vector of partials \(\frac{\partial f}{\partial x} = \begin{pmatrix} \frac{\partial f}{\partial x_1} \cdots \frac{\partial f}{\partial x_n} \end{pmatrix}\) since \(Df(x)[h] = \frac{\partial f}{\partial x} h\). The gradient equals the column-vector of partial derivatives, \(\nabla f(x) = \frac{\partial f}{\partial x}^\top\).

Why the distinction generally matters

If we instead equip \(\mathbb{R}^n\) with the inner product \(\langle u, v \rangle_G = u^\top G v\) for some symmetric positive definite \(G\), then the gradient becomes

\[\nabla f(x) = G^{-1} \frac{\partial f}{\partial x}.\]

The derivative \(Df(x)\) is unchanged (recall that all norms on \(\mathbb{R}^n\) are equivalent)—only its Riesz representative shifts.

Aside (infinite-dimensions). In infinite dimensions, the distinction is even starker. For $1 < p < \infty$ with $p \neq 2$, the Fréchet derivative of a functional on $L^p(\Omega)$ naturally lives in the dual space

\[(L^p(\Omega))^* \cong L^q(\Omega), \qquad \frac{1}{p} + \frac{1}{q} = 1,\]

not in $L^p(\Omega)$ itself. Since there is no canonical identification between $L^p$ and $L^q$, there is no canonical notion of gradient as an element of $L^p$ unless one chooses additional structure, such as an inner product or a metric.

Example. Let

\[F(u) = \frac{1}{p} \int_\Omega |u(x)|^p \, dx.\]

Then

\[DF(u)[h] = \int_\Omega |u|^{p-2} u h \, d \lambda,\]

so the derivative is represented by

\[|u|^{p-2} u \in L^q(\Omega).\]

This is naturally an element of the dual space. Only when $p = 2$ do we get

\[|u|^{p-2} u = u,\]

so the derivative is represented by an element of the same space in the canonical $L^2$ way.

4. Derivatives of vector- and matrix-valued functions

Vector-to-scalar functions

For \(f : \mathbb{R}^n \to \mathbb{R}\), under the Euclidean inner product, we have already seen that \(Df(x) \in (\mathbb{R}^n)^*\) acts as \(Df(x)[h] = \nabla f(x)^\top h\). The gradient \(\nabla f(x) \in \mathbb{R}^n\) is a column vector of partial derivatives.

Example 5. Let \(f(x) = \sum_{i=1}^n x_i = \mathbf{1}^\top x\), where \(\mathbf{1} \in \mathbb{R}^n\) is the all-ones vector. Then

\[f(x + h) = \mathbf{1}^\top (x + h) = f(x) + \mathbf{1}^\top h,\]

with no higher-order terms, so \(Df(x)[h] = \mathbf{1}^\top h = \langle \mathbf{1}, h \rangle\), giving \(\nabla f(x) = \mathbf{1}\).

Example 6. Let \(f(x) = \|x\|_2^2 = x^\top x\). Then

\[f(x + h) = (x + h)^\top (x + h) = x^\top x + 2 x^\top h + h^\top h = f(x) + 2 x^\top h + O(\|h\|^2),\]

so \(Df(x)[h] = 2 x^\top h = \langle 2x, h \rangle\), giving \(\nabla f(x) = 2x\).

Matrix-to-scalar functions

For \(f : \mathbb{R}^{m \times n} \to \mathbb{R}\), the Fréchet derivative \(Df(X)\) is a linear functional on the space of matrices. Using the Frobenius inner product \(\langle A, B \rangle_F = \operatorname{tr}(A^\top B)\), the gradient \(\nabla f(X) \in \mathbb{R}^{m \times n}\) satisfies

\[Df(X)[H] = \langle \nabla f(X), H \rangle_F = \operatorname{tr}(\nabla f(X)^\top H).\]

The Frobenius inner product is the matrix analogue of the Euclidean inner product. \((\mathbb R^{m \times n}, \langle \cdot, \cdot \rangle_\text{Frobenius})\) is isometrically isomorphic to \((\mathbb{R}^{mn}, \langle \cdot, \cdot \rangle_\text{Euclidean})\) via vectorization, which simply flattens a matrix into a vector.

Example 7. Let \(A \in \mathbb{R}^{m \times m}\) be symmetric and \(f(X) = \operatorname{tr}(X^\top A X)\). Then

\[f(X + H) = f(X) + \operatorname{tr}(H^\top AX) + \operatorname{tr}(X^\top AH) + O(\|H\|^2) = f(X) + \operatorname{tr}(H^\top \cdot 2AX) + O(\|H\|^2),\]

so \(Df(X)[H] = \operatorname{tr}(H^\top \cdot 2AX) = \langle 2AX, H \rangle_F\), giving \(\nabla_X f = 2AX\).

Example 8. Let \(X \succ 0\) and \(f(X) = \log \det(X)\), perturbed by a symmetric \(H\). Factor \(X\) out of \(X + H\):

\[\log \det(X + H) = \log \det\!\big(X(I + X^{-1}H)\big) = \log \det(X) + \log \det(I + X^{-1}H).\]

Let \(\lambda_1, \ldots, \lambda_n\) be the eigenvalues of \(E := X^{-1}H\), which are real since \(E\) is similar to the symmetric matrix \(X^{-1/2} H X^{-1/2}\). For \(H\) small enough that \(| \lambda_i | < 1\), applying \(\log(1 + t) = t - \tfrac{1}{2} t^2 + O(t^3)\) eigenvalue-by-eigenvalue,

\[\log \det(I + E) = \log \prod_i (1 + \lambda_i) = \sum_i \log(1 + \lambda_i) = \sum_i \lambda_i + O\!\Big(\sum_i \lambda_i^2\Big) = \operatorname{tr}(E) + O(\|E\|_F^2),\]

where the last equality uses \(\sum_i \lambda_i = \operatorname{tr}(E)\) and \(\sum_i \lambda_i^2 = \operatorname{tr}(E^2) \le \|E\|_F^2\). Substituting \(E = X^{-1}H\) and absorbing the bounded factor \(\|X^{-1}\|^2\) into the big-\(O\),

\[\log \det(X + H) = \log \det(X) + \operatorname{tr}(X^{-1}H) + O(\|H\|_F^2),\]

so \(Df(X)[H] = \operatorname{tr}(X^{-1}H) = \langle X^{-1}, H \rangle_F\), giving \(\nabla_X \log \det X = X^{-1}\).

This result is quite important as the log-determinant appears in Gaussian log-likelihoods.

Conclusion

I hope this provided some intuition/grounding into the calculus that we encounter in machine learning. I’ll likely write a follow-up post on how this relates to optimization with gradient descent ($L$-smoothness, $\mu$-strong convexity, GD as a majorize-minimize algorithm, convergence rates) and Newton’s method.

If you spot any errors or have suggestions, please feel free to reach out to me at suchir@stanford.edu.