Introduction
Analyzing a deep learning model often involves inspecting the subspaces that its weight matrices span. One notable example is the paper LoRA(Low-Rank Adaptation), where the authors analyzed the similarity of subspaces to provide the solid evidence that their low-rank approximation is enough for representing features. Intrigued by these approaches, I am going to explore the concept of Grassmann Distance, its mathematical background and the derivation. The article is primarily based on the lecture slides of Jackson Van Dyke, titled "Distances between subspaces.".
Linear Subspace
Those who have taken linear algebra would already be familiar with the concept linear subspace. To recall, a linear subspace of $\mathbb{R}^N$ is a subset which is also a vector space. You may remember that the definition of vector space stipulates that:
- If $\mathbf{v}$ and $\mathbf{u}$ are the elements of the vector space, $\mathbf{v}+\mathbf{u}$ is also an element of that vector space.
- If $\mathbf{v}$ is an element of a vector space, $a\mathbf{v}$ is also an element for an arbitrary scalar $a$.
Therefore, a linear subspace spanned by a linearly independent set of vectors $\mathbf{v_1}, \mathbf{v_2}, \cdots, \mathbf{v_k}$ is a set
$$\{a_1\mathbf{v}_1+a_2\mathbf{v}_2+\cdots + a_k\mathbf{v}_k\ | a_1, a_2, \cdots, a_k \in \mathbb{R}\}$$
which is bound to pass through the origin. For example, a line through the origin is a linear subspace of $\mathbb{R}^2$, whereas a plane through the origin is a 2-dimensional linear subspace of $\mathbb{R}^3$.
We can also define a set of $k$-dimensional linear subspaces, which is called Grassmannian. Formally, the Grasmannian $(k, N)$ is defined as
$$Gr(k, N) := \{k-\text{dimensional linear subspaces of } \mathbb{R}^N\}$$
Thus, we can say that a line in a plane that goes through the origin is a point of Grassmannian (2, 1), while a plane in $\mathbb{R}^3$ that goes through origin is a point in Grassmannian (3, 2).
Grassmann Distance
Then how can we measure the similarity between two linear subspaces? When the subspace being discussed is a line, the answer is straightforward: we can just measure the angle between the two lines. However, the problem more complicated for a larger dimension. This is where Grassmann distance comes into play.
Let's define the subspaces with the set of vectors which spans it.
Let $a_1, a_2, \cdots a_k \in \mathbb{R}^N$ and $b_1, b_2, \cdots, b_k\in\mathbb{R}^N$ be linearly independent sets of vectors. Denote the spans of each set as
$$A:=\text{span}\{a_1, \cdots, a_k\}, \quad B:=\text{span}\{b_1, \cdots, b_k\}$$
The process of measuring the Grassmann distance between the two sets starts from extracting the principal vectors and
the angles between them from the subspaces. First, we choose one vector from each subspace, $\hat{a}_1\in A$ and $\hat{b}_1\in B$ which
$$\text{maximize } \hat{a}_1^T\hat{b}_1$$
$$\text{ such that } \lVert \hat{a}_1\rVert=\lVert \hat{b}_1\rVert=1 $$
Thus, a pair of vectors which minimizes the angle between them is chosen. Note that the inner product $a^Tb$ may (and in most cases) not be 1 since they are chosen from different subspaces.
Next, unit vectors $\hat{a}_2\in A$ and $\hat{b}_2\in B$ which
$$ \text{maximize } \hat{a}_2^T\hat{b}_2$$
$$\text{ such that } \lVert \hat{a}_2\rVert=\lVert \hat{b}_2\rVert=1 \text{ and } \hat{a}_2^T\hat{a}_1 = \hat{b}_2^T\hat{b}_1=0$$
are chosen. The process continues in the same way, constructing two sets of orthonormal vectors, $\{\hat{a}_1, \hat{a}_2, \cdots \hat{a}_k\}$ and $\{\hat{b}_1, \hat{b}_2, \cdots \hat{b}_k\}$. Thus, we have vectors from each subspaces which are drawn to minimize the angle with its counterpart in the other subspace, while remaining orthogonal to those selected from the same subspace.
Now, we define the principal angles $\theta_i$ as $\cos\theta_i = \hat{a}_i^T \hat{b}_i$. The angles satisfy $\theta_1\le \cdots\le \theta_k$ by their definition. Based on these principal angles, the Grassmann distance between the subspaces $A$ and $B$ is defined as
$$d_k (A, B) = \left(\sum\limits_{i=1}^k\theta_i^2\right)^{1/2}$$
Computing Principal Angles
In the actual computation process, it is too demanding to optimize the unit vectors to maximize the inner product while obeying the constraint. Therefore, the principal angles are rather computed using a bit of linear algebra trick: Singular Value Decomposition(SVD).
Theorem. Suppose that the matrices $M_A$ and $M_B$ are the matrices comprising the orthonormal bases of $A$ and $B$, respectively. Let the singular value decomposition of $M_A^TM_B$ is expressed as follows:
$$M_A^TM_B=U\Sigma V^T\text{ where } \Sigma = \text{diag}(\sigma_1, \sigma_2, \cdots, \sigma_k)$$
Then, the principal angles satisfy $\cos\theta_i=\sigma_i$. Also, the principal vectors are the columns of $M_AU$ and $M_BV$.
Proof. A well-known characteristic of singular values and vectors is utilized, which is formulated in the following lemma.
Lemma. Suppose that the singular values of a matrix $A\in \mathbb{R}^{n\times m}$ are $\sigma_1, \sigma_2, \cdots, \sigma_k$. Then, the $i$-th singular value $\sigma_i$ satisfies
$$\sigma_i = \max\limits_{\lVert y \rVert = \lVert z \rVert = 1}y^TAz = y_i^T Az_i$$
when subject to constraint $y^Ty_j = z^T z_j = 0 \;(j=1, \cdots, i-1)$.
Proof. Let the singular value decomposition of the matrix $A$ is given by $U\Sigma V^T$. Then, given the previous singular values $\sigma_1, \cdots, \sigma_{i-1}$ and their corresponding unit vectors $y_1, \cdots, y_{i-1}$ and $z_1, \cdots, z_{i-1}$, the maximization problem becomes:
$$\sigma_i = \max\limits_{\lVert y \rVert = \lVert z \rVert = 1}y^TU\Sigma V^Tz$$
$$ = \max\limits_{\lVert y \rVert = \lVert z \rVert = 1}(U^Ty)^T\Sigma (V^Tz)$$
$$=\max\limits_{\lVert u\rVert = \lVert v\rVert = 1} u^T\Sigma v$$
where $u\in R(U)$ and $v\in R(V)$. Also, $u$ and $v$ should be orthogonal to $\hat{e}_1, \hat{e}_2, \cdots, \hat{e}_{i-1}$ ,since it is equivalent to $y=Uu$ and $z=Vv$ being orthogonal to $y_1, \cdots, y_{i-1}$ and $z_1, \cdots, z_{i-1}$, respectively. Therefore, the above maximum is equal to $\sigma_{i}$, which is attained when $u=\hat{e}_i$ and $v=\hat{e}_i$.
By the lemma above, we know that the singular values of $M_A^TM_B$, say, $\sigma_1, \cdots, \sigma_k$, are the maximum values
$$\max\limits_{\lVert y \rVert = \lVert z \rVert = 1}y^T M_A^TM_B z$$
$$\max\limits_{\lVert y \rVert = \lVert z \rVert = 1}(M_Ay)^T(M_B z)$$
$$\max\limits_{\lVert a \rVert = \lVert b \rVert = 1, a\in A, b\in B}{a}^Tb\quad(a=M_Ay, b=M_Bz)$$
subject to constraints above. Therefore, it holds that the singular values are the maximized inner products, given that the two vectors should be orthogonal to previously chosen vectors. Also, the $i$-th principal vector pairs are the $i$-th column vectors of $M_AU$ and $M_BV$, since $y_i=U\hat{e}_i, z_i=V\hat{e}_i$ and therefore $a_i=M_AU\hat{e}_i$, $b_i=M_BV\hat{e}_i$.
Thus, we can compute the singular values of the matrix $M_A^TM_B$ to calculate the principal angles $\theta_i$ between the two subspaces $A$ and $B$. Calculating $\left(\sum_{i=1}^k\theta_i^2\right)^{1/2}$ yields the Grassmann distance between $A$ and $B$.
Distance Between Different Dimensions
Grassmann distance is an effective method in calculating the distance between two linear subspaces in the same dimension. However, it cannot measure the distance between the subspaces in different dimensions.
If the subspaces we are dealing with are in different dimensions, we can match the dimensions using a concept called Schubert varieties. Given two subspaces, $A\in Gr(k,N)$ and $B\in Gr(l, N)$, the Schubert varieties are the sets of subspaces that satisfies following conditions:
- $\Omega_+(A):=\{P\in Gr(l, N) | A\subseteq P\}$
- $\Omega_-(B):=\{P\in Gr(k, N) | P\subseteq B\}$
Thus, $\Omega_+(A)$ is the set of $l$-planes containing $A$, whereas $\Omega_-(B)$ is the set of $k$-planes contained in $B$. For example, $\Omega_+(\text{a line})$ is a set of planes containing the line, whereas $\Omega_-(\text{a plane})$ is a set of lines contained in the plane.
Using the Schubert varieties, we can define the distances between $A$ and $B$ in two different ways. First, we can define the distance $\delta_-$ to be the minimal (Grassmann) distance between $A$ and a member of $\Omega_-(B)$. Alternatively, we can use the distance $\delta_+$ defined as the minimal distance between $B$ and a member of $\Omega_+(A)$. Thus,
- $\delta_- = \min\{d_k (P, A) | P\in \Omega_-(B)\}$
- $\delta_+ = \min\{d_l(P, B) | P\in \Omega_+(A)\}$
The theorem proven by Ye and Lim shows that the two definitions of distance turn out to coincide:
$$\delta(A, B)=\delta_+=\delta_- = \left(\sum\limits_{i=1}^{\min(k, l)}\theta_i^2\right)^{1/2}$$
Metric?
A Grassmann distance $d$ between the subspaces in the same dimension defines a metric, since it satisfies:
- $d(A, B)=0$ if and only if $A=B$
- $d(A, B) \ge 0$ if and only if $A\ne B$ (positivity)
- $d(A, B) = d(B, A)$ (symmetry)
- $d(A, C) \le d(A, B) + d(B, C)$ (triangle inequality)
However, its extension to subspaces of different dimensions, $\delta$ is not a metric on $Gr(\infty, \infty) = \cup_{k=1}^\infty Gr(k, \infty)$. It disobeys positivity since $\delta(A, B) = 0 \Leftrightarrow A\subseteq B \text{ or } B\subseteq A$. Also, it does not satisfy the triangle inequality: assuming that $L_1$ and $L_2$ are two different lines lying in a plane $P$, $\delta(L_1, L_2) > \delta(L_1, P) +\delta(L_1, P)= 0$, which yields a contradiction. The only two properties that $\delta$ satisfies are positivity and symmetry.
References
[1] https://web.ma.utexas.edu/users/vandyke/notes/deep_learning_presentation/presentation.pdf
[2] Björck, Ȧke, and Gene H. Golub. "Numerical methods for computing angles between linear subspaces." Mathematics of computation 27.123 (1973): p. 582.
[3] https://en.wikipedia.org/wiki/Metric_space