# From Dot Product to Inner Product Space: The Unified Language Behind Linear Algebra, Signals, and AI
## Abstract
The **inner product** is a core algebraic structure shared across linear algebra, functional analysis, signal processing, machine learning, and quantum mechanics. Taking the inner product as its sole thread, this paper begins with the dot product in finite-dimensional Euclidean space and progressively introduces the axioms of inner product spaces, orthogonal decomposition, least-squares projection, Hilbert spaces, Fourier series and transforms, convolution, the discrete cosine transform, wavelet analysis, self-attention mechanisms, kernel methods, and state-vector projection in quantum mechanics. It reveals the mathematical unity underlying these seemingly disparate concepts: **define an inner product → establish an orthonormal basis → project and decompose → extract information**. This paper aims to provide readers with a cognitive map that cuts across mathematics, engineering, and physics.
## Preface: Everything Is a Projection
Mathematics and engineering sciences exhibit a recurring pattern: decomposing a complex object into a linear combination of "elementary components," where the tool of decomposition is precisely **projection**. The essence of projection is the inner product—a binary operation that measures "similarity." From decomposing a signal into sinusoidal waves of different frequencies in Fourier analysis, to finding the best-fit line for data in the method of least squares, to measuring a particle in a superposition state in quantum mechanics—these processes all share the same mathematical language: **define an inner product → establish an orthonormal basis → project → orthogonally decompose → extract information**.
The goal of this paper is to systematically elucidate this unified framework. We begin with the familiar vector dot product, gradually abstract to inner product spaces and Hilbert spaces, and demonstrate how this structure recurs throughout calculus, signal processing, artificial intelligence, and quantum mechanics. Readers need no prior background in functional analysis—only a working knowledge of basic linear algebra and calculus.
---
## Chapter 1 The Ontology of the Inner Product — A Fundamental Operation for Measuring Similarity
### 1.1 Theory and Rigorous Definitions
The concept of the **inner product** originates from the Euclidean dot product, but its mathematical content has been vastly generalized in functional analysis. This section begins with the finite-dimensional case and progressively builds toward a rigorous definition of the inner product.
```ad-definition
title: Definition 1.1 Dot Product
Let $\mathbb{R}^n$ be the $n$-dimensional real Euclidean space. For any two vectors $\mathbf{a} = (a_1, a_2, \dots, a_n)$ and $\mathbf{b} = (b_1, b_2, \dots, b_n)$, their dot product is defined as the sum of the products of corresponding components$^{[1]}$:
$$
\langle \mathbf{a}, \mathbf{b} \rangle = \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i.
\tag{1.1}
$$
The dot product is a binary operation that maps two vectors to a scalar. Its geometric interpretation is given by the law of cosines:
$$
\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta,
\tag{1.2}
$$
where $\|\mathbf{a}\| = \sqrt{\langle \mathbf{a}, \mathbf{a} \rangle}$ is the Euclidean norm ($L_2$ norm) of the vector, and $\theta$ is the angle between the two vectors.
```
```ad-definition
title: Definition 1.2 Inner Product Space
Let $V$ be a vector space over the field $\mathbb{F}$ (either $\mathbb{R}$ or $\mathbb{C}$). A mapping $\langle \cdot, \cdot \rangle: V \times V \to \mathbb{F}$ is called an **inner product** if it satisfies the following three axioms$^{[8][9]}$:
1. **Conjugate Symmetry**: $\langle \mathbf{u}, \mathbf{v} \rangle = \overline{\langle \mathbf{v}, \mathbf{u} \rangle}$, where the overline denotes complex conjugation. For real vector spaces, this reduces to symmetry $\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$.
2. **Linearity in the First Argument**: $\langle \alpha\mathbf{u} + \beta\mathbf{v}, \mathbf{w} \rangle = \alpha\langle \mathbf{u}, \mathbf{w} \rangle + \beta\langle \mathbf{v}, \mathbf{w} \rangle$, for all $\alpha, \beta \in \mathbb{F}$.
3. **Positive Definiteness**: $\langle \mathbf{v}, \mathbf{v} \rangle \geq 0$, and $\langle \mathbf{v}, \mathbf{v} \rangle = 0$ if and only if $\mathbf{v} = \mathbf{0}$.
The inner product induces a norm $\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}$, which in turn induces a metric $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|$. Hence, an inner product space is naturally a normed space, and consequently a metric space.
```
```ad-theorem
title: Theorem 1.1 Cauchy–Schwarz Inequality
For any vectors $\mathbf{u}, \mathbf{v}$ in an inner product space $V$,
$$
|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \|\mathbf{v}\|,
\tag{1.3}
$$
with equality if and only if $\mathbf{u}$ and $\mathbf{v}$ are linearly dependent. This inequality is the most fundamental inequality in inner product spaces; it guarantees that the definition of the angle $\theta$ via $\cos\theta = \langle \mathbf{u}, \mathbf{v} \rangle / (\|\mathbf{u}\|\|\mathbf{v}\|)$ always lies in the interval $[-1, 1]$.
```
### 1.2 Geometric and Spatial Intuition
The geometric intuition of the inner product can be understood through the concept of **projection**. Given vectors $\mathbf{a}$ and $\mathbf{b}$, the scalar projection of $\mathbf{a}$ onto the direction of $\mathbf{b}$ is
$$
\text{comp}_{\mathbf{b}} \mathbf{a} = \|\mathbf{a}\| \cos\theta = \frac{\langle \mathbf{a}, \mathbf{b} \rangle}{\|\mathbf{b}\|}.
\tag{1.4}
$$
This value measures the "component magnitude" of $\mathbf{a}$ in the direction of $\mathbf{b}$. The inner product $\langle \mathbf{a}, \mathbf{b} \rangle$ can then be interpreted as this projection length multiplied by $\|\mathbf{b}\|$, i.e., "projection length $\times$ base length"$^{[10]}$.
When $\langle \mathbf{a}, \mathbf{b} \rangle = 0$, we say $\mathbf{a}$ and $\mathbf{b}$ are **orthogonal**, in which case the projection of $\mathbf{a}$ onto $\mathbf{b}$ is zero. Orthogonality means that the two vectors are completely independent in direction, containing no information about each other.
If we normalize the vectors (i.e., divide by their own norms), the inner product reduces to the cosine of the angle:
$$
\cos\theta = \frac{\langle \mathbf{a}, \mathbf{b} \rangle}{\|\mathbf{a}\| \|\mathbf{b}\|}.
\tag{1.5}
$$
This quantity is called the **cosine similarity**$^{[13]}$. It eliminates the influence of vector magnitude and purely measures directional similarity, finding wide application in information retrieval and natural language processing.
### 1.3 Worked Examples
```ad-example
title: Example 1.1 Dot Product and Projection of High-Dimensional Vectors
Given the vectors $\mathbf{a} = (1, 2, 3, 4)$ and $\mathbf{b} = (2, 0, -1, 3)$ in four-dimensional space, compute: (1) the dot product; (2) their respective norms; (3) the scalar projection of $\mathbf{a}$ onto $\mathbf{b}$; (4) the cosine similarity.
**Solution** (1) By definition (1.1):
$$
\langle \mathbf{a}, \mathbf{b} \rangle = 1\times 2 + 2\times 0 + 3\times (-1) + 4\times 3 = 2 + 0 - 3 + 12 = 11.
$$
(2) Norm computation:
$$
\|\mathbf{a}\| = \sqrt{1^2 + 2^2 + 3^2 + 4^2} = \sqrt{30}, \quad
\|\mathbf{b}\| = \sqrt{2^2 + 0^2 + (-1)^2 + 3^2} = \sqrt{14}.
$$
(3) The scalar projection is given by (1.4):
$$
\text{comp}_{\mathbf{b}} \mathbf{a} = \frac{\langle \mathbf{a}, \mathbf{b} \rangle}{\|\mathbf{b}\|} = \frac{11}{\sqrt{14}} \approx 2.940.
$$
(4) The cosine similarity is given by (1.5):
$$
\cos\theta = \frac{11}{\sqrt{30 \times 14}} = \frac{11}{\sqrt{420}} \approx 0.5367.
$$
This result indicates that $\mathbf{a}$ and $\mathbf{b}$ form an angle of approximately $57.5^\circ$ in four-dimensional space, exhibiting a moderate degree of positive correlation.
```
```ad-example
title: Example 1.2 Inner Product of Continuous Functions: Average Power in an AC Circuit
In an AC circuit, the voltage $v(t) = V_m\cos(\omega t)$ and the current $i(t) = I_m\cos(\omega t + \phi)$ have a phase difference $\phi$. Find the average power over one complete period $T = 2\pi/\omega$.
**Solution** The average power is physically defined as the mean value of the product of voltage and current over one period. Mathematically, this is the inner product of the two functions in the space $L^2[0,T]$ (divided by the period length):
$$
P = \frac{1}{T} \int_0^T v(t) i(t) \, dt.
\tag{1.6}
$$
Substituting the expressions:
$$
P = \frac{1}{T} \int_0^T V_m I_m \cos(\omega t) \cos(\omega t + \phi) \, dt.
$$
Using the product-to-sum formula $\cos\alpha \cos\beta = \frac{1}{2}[\cos(\alpha+\beta) + \cos(\alpha-\beta)]$:
$$
P = \frac{V_m I_m}{2T} \int_0^T [\cos(2\omega t + \phi) + \cos(-\phi)] \, dt.
$$
Since $\cos(-\phi) = \cos\phi$ and $\int_0^T \cos(2\omega t + \phi) dt = 0$ (the integral of a sinusoid over a full period is zero), we obtain:
$$
P = \frac{V_m I_m}{2T} \cdot T \cos\phi = \frac{1}{2} V_m I_m \cos\phi.
\tag{1.7}
$$
The factor $\cos\phi$ in (1.7) is called the **power factor**. When voltage and current are in phase ($\phi = 0$), $\cos\phi = 1$ and the average power reaches its maximum; when they are orthogonal ($\phi = \pi/2$), $\cos\phi = 0$ and the average power is zero. This result reveals the mathematical essence of active and reactive power in AC circuits: only the in-phase component (i.e., the component with non-zero inner product) contributes to net energy transfer.
```
### 1.4 Engineering and Cutting-Edge Applications
One of the most fundamental engineering applications of the inner product appears in natural language processing. **Word embedding**$^{[21]}$ techniques map each word to a dense vector in $\mathbb{R}^d$ space, such that semantically similar words are close together in the vector space. The semantic similarity between two words is typically measured by the cosine similarity (1.5).
Figure 1 shows a $5 \times 5$ cosine similarity heatmap for five English words, generated by the accompanying main.py code. As shown in the figure, the cosine similarity between `king` and `queen` is high (close to $0.92$), while that between `king` and `apple` is low (approximately $0.10$)—a quantitative result that aligns closely with human semantic intuition.

**Figure 1: Cosine similarity heatmap of word vectors.** Each cell $(i,j)$ represents the cosine similarity between the word vectors of the $i$-th and $j$-th words. Warmer colors (closer to 1) indicate greater semantic similarity, while cooler colors (closer to 0) indicate less semantic relatedness. This figure is generated by a simulation of random word vectors in main.py.
Cosine similarity is widely used in recommendation systems, information retrieval, text classification, and other tasks. The core idea remains the same: embed unstructured data into a vector space, measure similarity via the inner product, and then retrieve or rank based on that similarity.
---
## Chapter 2 Orthogonality and Orthogonal Complements — An Algebraic Characterization of Independence
### 2.1 Theory and Rigorous Definitions
The inner product provides a tool for measuring the "similarity" between vectors. When the inner product of two vectors is zero, they are completely independent in direction—a property called **orthogonality**.
```ad-definition
title: Definition 2.1 Orthogonality
Let $V$ be an inner product space. If $\langle \mathbf{u}, \mathbf{v} \rangle = 0$, then the vectors $\mathbf{u}$ and $\mathbf{v}$ are said to be orthogonal, denoted $\mathbf{u} \perp \mathbf{v}$.
```
```ad-definition
title: Definition 2.2 Orthogonal Complement
Let $W$ be a subspace of an inner product space $V$. The **orthogonal complement** of $W$ is defined as$^{[2]}$
$$
W^\perp = \{ \mathbf{v} \in V \mid \langle \mathbf{v}, \mathbf{w} \rangle = 0,\ \forall \mathbf{w} \in W \}.
\tag{2.1}
$$
$W^\perp$ is a subspace of $V$, and $W \cap W^\perp = \{\mathbf{0}\}$.
```
```ad-theorem
title: Theorem 2.1 Orthogonal Decomposition Theorem
Let $W$ be a finite-dimensional subspace of an inner product space $V$. Then for any $\mathbf{x} \in V$, there exists a unique decomposition
$$
\mathbf{x} = \mathbf{x}_W + \mathbf{x}_{W^\perp},
\tag{2.2}
$$
where $\mathbf{x}_W \in W$, $\mathbf{x}_{W^\perp} \in W^\perp$, and $\langle \mathbf{x}_W, \mathbf{x}_{W^\perp} \rangle = 0$.
The vector $\mathbf{x}_W$ is called the **orthogonal projection** of $\mathbf{x}$ onto $W$, denoted $\operatorname{proj}_W \mathbf{x}$. The uniqueness of this decomposition is guaranteed by the positive definiteness of the inner product.
```
```ad-theorem
title: Theorem 2.2 Projection Operator
If $\{\mathbf{w}_1, \dots, \mathbf{w}_k\}$ is an orthonormal basis of $W$, then the orthogonal projection of $\mathbf{x}$ onto $W$ can be expressed explicitly as
$$
\operatorname{proj}_W \mathbf{x} = \sum_{i=1}^k \langle \mathbf{x}, \mathbf{w}_i \rangle \mathbf{w}_i.
\tag{2.3}
$$
Equation (2.3) is one of the most important formulas in inner product theory: it shows that the projection coefficients are precisely the inner products themselves.
```
### 2.2 Geometric and Spatial Intuition
The geometric meaning of orthogonality is "independence." In $\mathbb{R}^3$, given a plane $W$ passing through the origin, its orthogonal complement $W^\perp$ is the line perpendicular to the plane. Any vector $\mathbf{x}$ in space can be uniquely decomposed into a component lying in the plane and a component along the normal direction—the two are mutually perpendicular and contain no information about each other.
This concept extends to function spaces: if two functions have zero inner product over some interval, they are said to be orthogonal in the $L^2$ sense. For example, $\sin(mx)$ and $\sin(nx)$ ($m \neq n$) are orthogonal on $[0, 2\pi]$, meaning that as "signals" they do not interfere with each other—this is the mathematical foundation of frequency-division multiplexing.
### 2.3 Worked Examples
```ad-example
title: Example 2.1 Orthogonal Complement and Orthogonal Decomposition
In $\mathbb{R}^3$, given the subspace $W = \operatorname{span}\{\mathbf{v}_1, \mathbf{v}_2\}$, where
$$
\mathbf{v}_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix},\quad
\mathbf{v}_2 = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix}.
$$
For the vector $\mathbf{x} = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix}$, find its orthogonal decomposition $\mathbf{x} = \mathbf{x}_W + \mathbf{x}_{W^\perp}$.
**Solution** Step 1: Find a basis for $W^\perp$. Let $\mathbf{n} = (n_1, n_2, n_3)^T \in W^\perp$. Then $\langle \mathbf{n}, \mathbf{v}_1 \rangle = \langle \mathbf{n}, \mathbf{v}_2 \rangle = 0$:
$$
\begin{cases}
n_1 + n_2 = 0, \\
n_1 + n_3 = 0.
\end{cases}
$$
Solving yields $n_2 = n_3 = -n_1$. Taking $n_1 = 1$, we obtain $\mathbf{n} = (1, -1, -1)^T$, so $W^\perp = \operatorname{span}\{\mathbf{n}\}$.
Step 2: Compute the projection of $\mathbf{x}$ onto $W^\perp$. By the projection formula (2.3):
$$
\mathbf{x}_{W^\perp} = \operatorname{proj}_{\mathbf{n}} \mathbf{x} = \frac{\langle \mathbf{x}, \mathbf{n} \rangle}{\|\mathbf{n}\|^2} \mathbf{n}.
$$
Compute the inner product: $\langle \mathbf{x}, \mathbf{n} \rangle = 2\times 1 + 3\times(-1) + 4\times(-1) = -5$.
Compute the squared norm: $\|\mathbf{n}\|^2 = 1^2 + (-1)^2 + (-1)^2 = 3$.
Therefore:
$$
\mathbf{x}_{W^\perp} = \frac{-5}{3} \begin{bmatrix} 1 \\ -1 \\ -1 \end{bmatrix} = \begin{bmatrix} -5/3 \\ 5/3 \\ 5/3 \end{bmatrix}.
$$
Step 3: Compute the projection of $\mathbf{x}$ onto $W$. By the orthogonal decomposition theorem (2.2):
$$
\mathbf{x}_W = \mathbf{x} - \mathbf{x}_{W^\perp} = \begin{bmatrix} 2 \\ 3 \\ 4 \end{bmatrix} - \begin{bmatrix} -5/3 \\ 5/3 \\ 5/3 \end{bmatrix} = \begin{bmatrix} 11/3 \\ 4/3 \\ 7/3 \end{bmatrix}.
$$
Step 4: Verify $\mathbf{x}_W \in W$. Solve for $\alpha, \beta$ such that $\mathbf{x}_W = \alpha\mathbf{v}_1 + \beta\mathbf{v}_2$:
$$
\begin{bmatrix} \alpha + \beta \\ \alpha \\ \beta \end{bmatrix} = \begin{bmatrix} 11/3 \\ 4/3 \\ 7/3 \end{bmatrix} \implies \alpha = \frac{4}{3},\ \beta = \frac{7}{3}.
$$
Verification passes. Additionally, we can verify $\langle \mathbf{x}_W, \mathbf{x}_{W^\perp} \rangle = 0$, confirming the orthogonality of the decomposition.
```
### 2.4 Engineering and Cutting-Edge Applications
The concept of the orthogonal complement finds direct application in communication engineering. The **Gram–Schmidt orthogonalization algorithm**$^{[3]}$ is based precisely on the idea of orthogonal decomposition: given a set of linearly independent vectors, it constructs an orthonormal basis by successively subtracting the projection components onto already-processed directions. This algorithm is the theoretical foundation of QR decomposition.
In the beamforming design of 5G millimeter-wave communications, the transmitted signal vector is placed in the orthogonal complement of the signal subspace of other users, thereby achieving theoretically zero interference—as long as the signal vector is orthogonal to the interference subspace, no matter how large the transmission power, it will not interfere with neighboring users.
---
## Chapter 3 The Method of Least Squares (Linear Algebra Perspective) — Optimal Approximation of Unsolvable Equations
### 3.1 Theory and Rigorous Definitions
In practical engineering problems, we frequently encounter **overdetermined systems**: linear systems $A\mathbf{x} = \mathbf{b}$ with more equations than unknowns, where $A \in \mathbb{R}^{m \times n}$ and $m > n$. Such systems typically have no exact solution, because $\mathbf{b}$ does not lie in the column space $\operatorname{Col}(A)$ of $A$.
```ad-definition
title: Definition 3.1 Least-Squares Problem
For $A \in \mathbb{R}^{m \times n}$ ($m > n$) and $\mathbf{b} \in \mathbb{R}^m$, the least-squares problem is
$$
\min_{\mathbf{x} \in \mathbb{R}^n} \| A\mathbf{x} - \mathbf{b} \|^2.
\tag{3.1}
$$
The geometric interpretation of this problem is: find the vector $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$ in the column space of $A$ that is closest to $\mathbf{b}$.
```
```ad-theorem
title: Theorem 3.1 Normal Equation
The solution $\hat{\mathbf{x}}$ to the least-squares problem (3.1) satisfies
$$
A^T A \hat{\mathbf{x}} = A^T \mathbf{b}.
\tag{3.2}
$$
**Proof** Let $\mathbf{r}(\mathbf{x}) = \mathbf{b} - A\mathbf{x}$ be the residual vector. By the orthogonal decomposition theorem, $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$ is the orthogonal projection of $\mathbf{b}$ onto $\operatorname{Col}(A)$ if and only if the residual $\mathbf{r}(\hat{\mathbf{x}})$ is perpendicular to $\operatorname{Col}(A)$, i.e.,
$$
\langle A\mathbf{y}, \mathbf{r}(\hat{\mathbf{x}}) \rangle = 0, \quad \forall \mathbf{y} \in \mathbb{R}^n.
$$
Equivalently, $A^T \mathbf{r}(\hat{\mathbf{x}}) = \mathbf{0}$, i.e., $A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0}$, which rearranges to (3.2). $\square$
When $A$ has full column rank, $A^T A$ is invertible, and the solution can be written explicitly as
$$
\hat{\mathbf{x}} = (A^T A)^{-1} A^T \mathbf{b}.
\tag{3.3}
$$
```
### 3.2 Geometric and Spatial Intuition
The geometric essence of the method of least squares is illustrated in Figure 2. The data vector $\mathbf{b}$ resides in the high-dimensional space $\mathbb{R}^m$, while the reachable set of the model $\operatorname{Col}(A)$ is an $n$-dimensional subspace of $\mathbb{R}^m$. Since $\mathbf{b}$ is generally not in this subspace, we cannot solve $A\mathbf{x} = \mathbf{b}$ exactly. The optimal strategy is to orthogonally project $\mathbf{b}$ onto $\operatorname{Col}(A)$, obtaining $\hat{\mathbf{b}}$, and then solve for the coefficients $\hat{\mathbf{x}}$.
The residual vector $\mathbf{e} = \mathbf{b} - \hat{\mathbf{b}}$ is perpendicular to $\operatorname{Col}(A)$, i.e., $\mathbf{e} \perp \operatorname{Col}(A)$. This orthogonality condition $A^T \mathbf{e} = \mathbf{0}$ is precisely the equivalent statement of the normal equation (3.2).
### 3.3 Worked Examples
```ad-example
title: Example 3.1 Least-Squares Solution for Linear Regression
Given three data points $(1,2), (2,3), (3,5)$, find the best-fit line $y = kx + c$.
**Solution** Write the problem in matrix form $A\mathbf{x} = \mathbf{b}$:
$$
\begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \end{bmatrix} \begin{bmatrix} k \\ c \end{bmatrix} = \begin{bmatrix} 2 \\ 3 \\ 5 \end{bmatrix}.
$$
Since $3 > 2$, the system is overdetermined and has no exact solution. We use the normal equation (3.2).
Step 1: Compute $A^T A$:
$$
A^T A = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 2 & 1 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 14 & 6 \\ 6 & 3 \end{bmatrix}.
$$
Step 2: Compute $A^T \mathbf{b}$:
$$
A^T \mathbf{b} = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \\ 5 \end{bmatrix} = \begin{bmatrix} 23 \\ 10 \end{bmatrix}.
$$
Step 3: Solve the normal equation:
$$
\begin{bmatrix} 14 & 6 \\ 6 & 3 \end{bmatrix} \begin{bmatrix} k \\ c \end{bmatrix} = \begin{bmatrix} 23 \\ 10 \end{bmatrix}.
$$
From the second row, $6k + 3c = 10$, i.e., $c = (10 - 6k)/3$. Substituting into the first row:
$$
14k + 6 \cdot \frac{10 - 6k}{3} = 23 \implies 14k + 20 - 12k = 23 \implies 2k = 3 \implies k = \frac{3}{2}.
$$
Back-substituting gives $c = \frac{10 - 9}{3} = \frac{1}{3}$. Therefore, the best-fit line is
$$
y = \frac{3}{2}x + \frac{1}{3}.
\tag{3.4}
$$
Step 4: Verify orthogonality. Compute the fitted values $\hat{\mathbf{b}} = A\hat{\mathbf{x}}$ and the residual $\mathbf{e}$:
$$
\hat{\mathbf{b}} = \begin{bmatrix} 1\times 1.5 + 1/3 \\ 2\times 1.5 + 1/3 \\ 3\times 1.5 + 1/3 \end{bmatrix} = \begin{bmatrix} 11/6 \\ 10/3 \\ 29/6 \end{bmatrix},\quad
\mathbf{e} = \mathbf{b} - \hat{\mathbf{b}} = \begin{bmatrix} 1/6 \\ -1/3 \\ 1/6 \end{bmatrix}.
$$
Verify $A^T \mathbf{e} = \mathbf{0}$:
$$
A^T \mathbf{e} = \begin{bmatrix} 1 & 2 & 3 \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1/6 \\ -1/3 \\ 1/6 \end{bmatrix} = \begin{bmatrix} 1/6 - 2/3 + 3/6 \\ 1/6 - 1/3 + 1/6 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}.
$$
The orthogonality condition is satisfied, confirming that $\hat{\mathbf{b}}$ is the orthogonal projection of $\mathbf{b}$ onto $\operatorname{Col}(A)$.
```
### 3.4 Engineering and Cutting-Edge Applications
The method of least squares is the foundation of **regression analysis**$^{[5][25]}$ in statistics. Figure 2 illustrates the geometric intuition of the above example: the black dots are the original data points, the red line is the least-squares fit, and the gray dashed lines represent the residuals (i.e., the perpendicular distances from $\mathbf{b}$ to $\operatorname{Col}(A)$).

**Figure 2: Geometric intuition of the method of least squares.** The black dots are the data points, and the red line is the fitted result. The residual vector $\mathbf{e}$ is perpendicular to the column space $\operatorname{Col}(A)$. The numerical verification of the orthogonality condition $A^T \mathbf{e} = \mathbf{0}$ yields $\|A^T \mathbf{e}\|_2 \approx 1.92 \times 10^{-14}$ (computed by main.py), which is zero within floating-point precision.
The method of least squares has broad applications in engineering: the measurement update step of the Kalman filter, parameter estimation in system identification, and linear regression models in machine learning—all of these can be reduced to solving the normal equation (3.2).
---
## Chapter 4 From Finite Dimensions to Infinite Dimensions — Functions as Vectors
### 4.1 Theory and Rigorous Definitions
The inner products discussed in the preceding chapters have been confined to the finite-dimensional Euclidean space $\mathbb{R}^n$. However, the concept of the inner product can be naturally extended to infinite-dimensional function spaces. This generalization is a central topic in functional analysis and serves as the bridge connecting linear algebra with signal processing and quantum mechanics.
```ad-definition
title: Definition 4.1 $L^2$ Inner Product
Let $f, g: [a, b] \to \mathbb{R}$ be square-integrable functions, i.e., $\int_a^b [f(x)]^2 dx < \infty$. Their inner product is defined as
$$
\langle f, g \rangle = \int_a^b f(x) g(x) \, dx.
\tag{4.1}
$$
The norm induced by this inner product is
$$
\|f\| = \sqrt{\langle f, f \rangle} = \sqrt{\int_a^b [f(x)]^2 \, dx},
\tag{4.2}
$$
called the $L^2$ norm, which is often interpreted physically as the "energy" of a signal.
```
```ad-definition
title: Definition 4.2 Hilbert Space
A complete inner product space is called a **Hilbert space**$^{[6][8]}$. Specifically, a Hilbert space $\mathcal{H}$ is an inner product space in which every Cauchy sequence converges in $\mathcal{H}$ (i.e., the space is complete).
The finite-dimensional inner product space $\mathbb{R}^n$ is a special case of a Hilbert space. Infinite-dimensional examples include $L^2[a,b]$ (the space of square-integrable functions) and $\ell^2$ (the space of square-summable sequences). The completeness of Hilbert spaces guarantees the convergence of infinite series expansions such as Fourier series.
```
```ad-theorem
title: Theorem 4.1 Cauchy–Schwarz Inequality in $L^2$ Space
For any functions $f, g$ in $L^2[a,b]$,
$$
\left| \int_a^b f(x) g(x) \, dx \right| \leq \sqrt{\int_a^b [f(x)]^2 \, dx} \cdot \sqrt{\int_a^b [g(x)]^2 \, dx}.
\tag{4.3}
$$
### 4.2 Geometric and Spatial Intuition
The key to understanding functions as vectors lies in the idea of "pointwise correspondence." In $\mathbb{R}^n$, the $i$-th component $v_i$ of a vector $\mathbf{v} = (v_1, \dots, v_n)$ corresponds to the value along the $i$-th coordinate axis. In function spaces, each $x \in [a,b]$ corresponds to an independent "coordinate axis," and the function value $f(x)$ is the component along that axis. Thus, a function $f$ is essentially a vector with uncountably infinitely many components.
Two functions are orthogonal ($\langle f, g \rangle = 0$) means that they "contain no component of each other" in the $L^2$ sense. This concept has profound physical meaning in signal processing: orthogonal signals can be transmitted over the same channel without interfering with each other.
### 4.3 Worked Examples
```ad-example
title: Example 4.1 Orthogonality and Distance in Function Spaces
On the interval $[-1, 1]$, given $f(x) = x$ and $g(x) = x^2$. Determine whether they are orthogonal, and compute their respective norms and the distance between them.
**Solution** (1) Compute the inner product:
$$
\langle f, g \rangle = \int_{-1}^{1} x \cdot x^2 \, dx = \int_{-1}^{1} x^3 \, dx = \left[ \frac{x^4}{4} \right]_{-1}^{1} = \frac{1}{4} - \frac{1}{4} = 0.
$$
Therefore $\langle f, g \rangle = 0$, so $f$ and $g$ are orthogonal on $[-1,1]$. This is because $x^3$ is an odd function, and its integral over a symmetric interval is zero.
(2) Compute the norms:
$$
\|f\| = \sqrt{\int_{-1}^{1} x^2 \, dx} = \sqrt{\left[ \frac{x^3}{3} \right]_{-1}^{1}} = \sqrt{\frac{2}{3}} \approx 0.8165,
$$
$$
\|g\| = \sqrt{\int_{-1}^{1} x^4 \, dx} = \sqrt{\left[ \frac{x^5}{5} \right]_{-1}^{1}} = \sqrt{\frac{2}{5}} \approx 0.6325.
$$
(3) Compute the distance between the functions:
$$
\|f - g\|^2 = \int_{-1}^{1} (x - x^2)^2 \, dx = \int_{-1}^{1} (x^2 - 2x^3 + x^4) \, dx = \frac{2}{3} + 0 + \frac{2}{5} = \frac{16}{15},
$$
hence $d(f, g) = \|f - g\| = \sqrt{16/15} \approx 1.0328$.
This example demonstrates that odd and even functions are naturally orthogonal on symmetric intervals. This property is crucial in Fourier analysis—it guarantees the orthogonality between sine and cosine bases.
```
### 4.4 Engineering and Cutting-Edge Applications
The most direct engineering application of the function inner product is the **matched filter**. In radar and communication systems, the inner product of the received signal $r(t)$ with the transmitted template $s(t)$,
$$
\langle r, s \rangle = \int_{-\infty}^{\infty} r(t) s(t) \, dt
$$
is used to detect the presence of a target. When a target reflection is present in the echo, the inner product value increases significantly. This is essentially a "similarity detection" operation in function space.
Additionally, **kernel methods**$^{[22]}$ are built on the idea of mapping data points into a reproducing kernel Hilbert space (RKHS) and computing inner products in that infinite-dimensional space, thereby implicitly achieving high-dimensional feature transformations. We will explore this topic in depth in Chapter 12.
---
## Chapter 5 Orthogonality of Trigonometric Functions — Basis Functions in the Frequency Domain
### 5.1 Theory and Rigorous Definitions
In the Hilbert space $L^2[-\pi, \pi]$, the trigonometric system forms an important orthogonal basis. Consider the set of functions
$$
\{1,\ \sin x,\ \cos x,\ \sin 2x,\ \cos 2x,\ \dots,\ \sin nx,\ \cos nx,\ \dots\}.
$$
```ad-theorem
title: Theorem 5.1 Orthogonality of Trigonometric Functions
On the interval $[-\pi, \pi]$, the trigonometric system satisfies the following orthogonality relations$^{[4]}$:
$$
\int_{-\pi}^{\pi} \sin(mx) \cos(nx) \, dx = 0, \quad \forall m, n,
\tag{5.1}
$$
$$
\int_{-\pi}^{\pi} \sin(mx) \sin(nx) \, dx = 0, \quad m \neq n,
\tag{5.2}
$$
$$
\int_{-\pi}^{\pi} \cos(mx) \cos(nx) \, dx = 0, \quad m \neq n.
\tag{5.3}
$$
The self-inner products at the same frequency are non-zero:
$$
\int_{-\pi}^{\pi} \sin^2(nx) \, dx = \pi, \quad
\int_{-\pi}^{\pi} \cos^2(nx) \, dx = \pi.
\tag{5.4}
$$
**Proof** These relations can be derived directly from the product-to-sum formulas for trigonometric functions. For example, for (5.2):
$$
\sin(mx)\sin(nx) = \frac{1}{2}[\cos((m-n)x) - \cos((m+n)x)].
$$
When $m \neq n$, the integrals of $\cos((m-n)x)$ and $\cos((m+n)x)$ over $[-\pi, \pi]$ are both zero. $\square$
```
### 5.2 Geometric and Spatial Intuition
The geometric meaning of the orthogonality of trigonometric functions is that sine and cosine waves of different frequencies are mutually perpendicular in the $L^2$ space. This means that as "signals," they do not interfere with each other—this is the mathematical foundation of frequency-division multiplexing.
In communication systems, data from different users can be modulated onto mutually orthogonal carrier waves and transmitted simultaneously. The receiver can separate the signals through inner product operations, even if they completely overlap in the time domain. This principle lies at the heart of **frequency-domain**$^{[16]}$ analysis in modern wireless communications.
### 5.3 Worked Examples
```ad-example
title: Example 5.1 Manual Verification of Trigonometric Orthogonality
Verify the following three inner products on $[-\pi, \pi]$.
**Case A: $\langle \sin(2x), \cos(3x) \rangle$**
$$
\langle \sin(2x), \cos(3x) \rangle = \int_{-\pi}^{\pi} \sin(2x)\cos(3x) \, dx.
$$
Using the product-to-sum formula $\sin\alpha\cos\beta = \frac{1}{2}[\sin(\alpha+\beta) + \sin(\alpha-\beta)]$:
$$
\sin(2x)\cos(3x) = \frac{1}{2}[\sin(5x) + \sin(-x)] = \frac{1}{2}[\sin(5x) - \sin(x)].
$$
Since $\int_{-\pi}^{\pi} \sin(kx) \, dx = 0$ for any integer $k$, we have
$$
\langle \sin(2x), \cos(3x) \rangle = \frac{1}{2} \times 0 - \frac{1}{2} \times 0 = 0.
$$
**Case B: $\langle \sin(2x), \sin(3x) \rangle$**
Using $\sin\alpha\sin\beta = \frac{1}{2}[\cos(\alpha-\beta) - \cos(\alpha+\beta)]$:
$$
\sin(2x)\sin(3x) = \frac{1}{2}[\cos(-x) - \cos(5x)] = \frac{1}{2}[\cos(x) - \cos(5x)].
$$
Since $\int_{-\pi}^{\pi} \cos(kx) \, dx = 0$ for $k \neq 0$, we have
$$
\langle \sin(2x), \sin(3x) \rangle = \frac{1}{2} \times 0 - \frac{1}{2} \times 0 = 0.
$$
**Case C: $\langle \sin(2x), \sin(2x) \rangle$ (self-inner product)**
Using the double-angle formula $\sin^2\theta = (1 - \cos 2\theta)/2$:
$$
\langle \sin(2x), \sin(2x) \rangle = \int_{-\pi}^{\pi} \frac{1 - \cos(4x)}{2} \, dx = \frac{1}{2} \cdot 2\pi - 0 = \pi.
$$
This result shows that $\|\sin(2x)\| = \sqrt{\pi}$, which explains why $\pi$ appears in the denominators of Fourier series coefficients.
```
### 5.4 Engineering and Cutting-Edge Applications
**Orthogonal Frequency-Division Multiplexing (OFDM)** is a core technology in modern 4G/5G wireless communications$^{[16]}$. It splits a high-rate data stream into multiple low-rate substreams, each modulated onto mutually orthogonal subcarriers for parallel transmission. Due to the orthogonality between subcarriers,
$$
\int_0^T \sin(2\pi f_k t) \cdot \sin(2\pi f_l t) \, dt = 0, \quad k \neq l,
$$
the receiver can perfectly separate the subcarrier signals through inner product operations, even if they heavily overlap in the frequency spectrum. This greatly improves spectral efficiency.
---
## Chapter 6 Fourier Series and Fourier Transform — Projection of Functions onto Trigonometric Bases
### 6.1 Theory and Rigorous Definitions
The orthogonality of the trigonometric system allows us to decompose any periodic function into a linear combination of trigonometric functions of different frequencies. This decomposition is called the **Fourier series**$^{[11]}$.
```ad-theorem
title: Theorem 6.1 Fourier Series
Let $f(t)$ be a square-integrable function with period $2\pi$. Its Fourier series expansion is
$$
f(t) = \frac{a_0}{2} + \sum_{n=1}^{\infty} [a_n \cos(nt) + b_n \sin(nt)],
\tag{6.1}
$$
where the coefficients are given by inner products:
$$
a_0 = \frac{1}{\pi} \int_{-\pi}^{\pi} f(t) \, dt,
\tag{6.2}
$$
$$
a_n = \frac{1}{\pi} \int_{-\pi}^{\pi} f(t) \cos(nt) \, dt = \frac{\langle f, \cos(nt) \rangle}{\|\cos(nt)\|^2},
\tag{6.3}
$$
$$
b_n = \frac{1}{\pi} \int_{-\pi}^{\pi} f(t) \sin(nt) \, dt = \frac{\langle f, \sin(nt) \rangle}{\|\sin(nt)\|^2}.
\tag{6.4}
$$
Equations (6.3)–(6.4) reveal the essence of Fourier coefficients: they are precisely the projection coefficients of the function $f$ onto each trigonometric basis (inner product divided by the squared norm of the basis), exactly analogous to computing coordinates of a finite-dimensional vector with respect to an orthonormal basis.
When the period $T \to \infty$, the Fourier series transitions into the **Fourier transform**$^{[12]}$:
$$
X(f) = \int_{-\infty}^{\infty} x(t) e^{-j2\pi ft} \, dt = \langle x(t), e^{j2\pi ft} \rangle.
\tag{6.5}
$$
The Fourier transform projects the time-domain function $x(t)$ onto the complex exponential basis $e^{j2\pi ft}$, yielding the frequency-domain representation $X(f)$.
```
### 6.2 Geometric and Spatial Intuition
The geometric essence of the Fourier transform is the "probe" idea: use complex exponential oscillations of different frequencies as probes, and compute their inner products with the signal under analysis. If the signal contains a particular frequency component, the inner product will be large (producing a spectral peak); if not, the inner product will be close to zero. Each peak in the spectrum corresponds to the projection strength of the signal onto that frequency basis.
### 6.3 Worked Examples
```ad-example
title: Example 6.1 Fourier Series Expansion of a Periodic Square Wave
Given a square wave with period $2\pi$,
$$
f(t) = \begin{cases}
1, & 0 < t < \pi, \\
-1, & -\pi < t < 0,
\end{cases}
$$
find its Fourier series coefficients.
**Solution** $f(t)$ is an odd function, so $a_0 = a_n = 0$ (all cosine coefficients are zero). We only need to compute $b_n$.
$$
b_n = \frac{1}{\pi} \int_{-\pi}^{\pi} f(t) \sin(nt) \, dt = \frac{1}{\pi} \left( \int_{-\pi}^{0} (-\sin(nt)) \, dt + \int_{0}^{\pi} \sin(nt) \, dt \right).
$$
Compute the first term: $\int_{-\pi}^{0} -\sin(nt) \, dt = \left[ \frac{\cos(nt)}{n} \right]_{-\pi}^{0} = \frac{1}{n} - \frac{\cos(-n\pi)}{n} = \frac{1 - (-1)^n}{n}$.
Compute the second term: $\int_{0}^{\pi} \sin(nt) \, dt = \left[ -\frac{\cos(nt)}{n} \right]_{0}^{\pi} = -\frac{\cos(n\pi)}{n} + \frac{1}{n} = \frac{1 - (-1)^n}{n}$.
Therefore:
$$
b_n = \frac{1}{\pi} \cdot \frac{2[1 - (-1)^n]}{n} = \begin{cases}
\dfrac{4}{n\pi}, & n \text{ odd}, \\[6pt]
0, & n \text{ even}.
\end{cases}
\tag{6.6}
$$
Thus, the Fourier series expansion of the square wave is
$$
f(t) = \frac{4}{\pi} \sum_{k=0}^{\infty} \frac{\sin((2k+1)t)}{2k+1} = \frac{4}{\pi} \left( \sin t + \frac{1}{3}\sin 3t + \frac{1}{5}\sin 5t + \cdots \right).
\tag{6.7}
$$
Numerical verification: at $t = \pi/2$, the first 3 terms approximate
$$
f(\pi/2) \approx \frac{4}{\pi} \left( 1 - \frac{1}{3} + \frac{1}{5} \right) = \frac{52}{15\pi} \approx 1.103,
$$
which is already close to the true value $1$. More terms will converge to the square wave (the Gibbs phenomenon produces an overshoot of approximately $9\%$ at the discontinuity).
```
### 6.4 Engineering and Cutting-Edge Applications
Figure 3 shows a typical application of the Fourier transform. A noisy signal $x(t)$ containing three frequency components at 50 Hz, 120 Hz, and 260 Hz appears chaotic in the time domain. After applying the Fourier transform, the spectrum clearly shows three peaks at the corresponding frequencies—these are precisely the projection strengths of the signal onto each frequency basis.

**Figure 3: Frequency-domain projection via the Fourier transform.** The upper panel shows the time-domain waveform of a noisy multi-tone signal $x(t) = 1.2\sin(2\pi\cdot 50t) + 0.7\sin(2\pi\cdot 120t) + 0.4\sin(2\pi\cdot 260t) + \eta(t)$; the lower panel shows the magnitude spectrum, with prominent peaks at 50, 120, and 260 Hz. This figure is generated by `np.fft.rfft` (the discrete Fourier transform) in main.py, which essentially computes inner products between the time-domain sampling vector and complex exponential basis vectors.
Fourier analysis finds applications across all fields of engineering: MP3 audio compression reduces data by discarding high-frequency components to which the human ear is insensitive; JPEG image compression uses the discrete cosine transform (DCT)$^{[18]}$ to project image blocks onto frequency bases; and frequency-domain diagnosis of electrocardiogram (ECG) signals uses spectral features to identify pathological patterns.
---
## Chapter 7 From the Frequency Domain to the Complex Frequency Domain — The Laplace and Z Transforms
### 7.1 Theory and Rigorous Definitions
The Fourier transform requires the signal to satisfy the absolute integrability condition $\int_{-\infty}^{\infty} |f(t)|\,dt < \infty$. For exponentially divergent signals such as $f(t) = e^{2t}$ ($t \geq 0$), the energy diverges as $t$ grows, and the Fourier transform inner product $\langle f(t), e^{-j\omega t} \rangle$ does not converge. To address this issue, the probing basis must be generalized from purely imaginary exponentials $e^{-j\omega t}$ to complex exponentials $e^{-st}$ with a real-valued decay factor, where $s = \sigma + j\omega$. ```ad-definition title: Definition 7.1 Laplace Transform Let $f(t)$ be a function defined on $[0, \infty)$. Its **Laplace transform** is defined as$^{[14]}$: $$F(s) = \mathcal{L}\{f(t)\} = \int_0^{\infty} f(t) e^{-st}\,dt, \quad s = \sigma + j\omega \in \mathbb{C} \tag{7.1}$$ When the real part $\sigma$ of $s$ is sufficiently large, the decay factor $e^{-\sigma t}$ can suppress the divergent trend of $f(t)$, making the integral converge. The set of $s$ values for which (7.1) converges is called the **region of convergence (ROC)**. ``` ```ad-definition title: Definition 7.2 Z Transform Let $x[n]$ be a discrete sequence defined on $\mathbb{Z}$. Its **Z transform** is defined as$^{[15]}$: $$X(z) = \mathcal{Z}\{x[n]\} = \sum_{n=-\infty}^{\infty} x[n] z^{-n}, \quad z = re^{j\omega} \in \mathbb{C} \tag{7.2}$$ The Z transform can be viewed as the discrete-domain counterpart of the Laplace transform: letting $z = e^{sT}$ (where $T$ is the sampling period), the unit circle $|z| = 1$ in the $z$-plane corresponds to the imaginary axis $s = j\omega$ in the $s$-plane. From the inner product perspective, both the Laplace transform and the Z transform can be understood as inner products of the signal with complex exponential basis functions: $$\mathcal{L}\{f(t)\} = \langle f(t), e^{st} \rangle, \quad \mathcal{Z}\{x[n]\} = \langle x[n], z^n \rangle$$ where the basis functions $e^{st}$ and $z^n$ incorporate two degrees of freedom—amplitude scaling (via $\sigma$ or $r$) and phase rotation (via $\omega$)—making them more expressive than the basis functions of the Fourier transform. ``` ### 7.2 Geometric and Spatial Intuition The basis $e^{-j\omega t}$ of the Fourier transform is a uniformly rotating vector on the unit circle of the complex plane, with constant magnitude 1. For a divergent signal $e^{2t}$, the integrand $|e^{2t} \cdot e^{-j\omega t}| = e^{2t}$ diverges as $t$ grows, and the integral never converges. The basis $e^{-(\sigma + j\omega)t} = e^{-\sigma t} e^{-j\omega t}$ of the Laplace transform adds a "decay knob" $\sigma$. When $\sigma > 2$, the decay rate of $e^{-\sigma t}$ exceeds the growth rate of $e^{2t}$, and the inner product integral converges. In the complex $s$-plane:
- **Region of Convergence (ROC)**: the set of $s$ values for which the transform converges;
- **Pole**: a point where $F(s)$ diverges to infinity (the denominator of $F(s)$ vanishes);
- **Zero**: a point where $F(s)$ is zero (the numerator of $F(s)$ vanishes).
The locations of the poles directly determine the stability of the system: the system is stable when all poles lie in the left half-plane ($\text{Re}(s) < 0$); it diverges when any pole lies in the right half-plane. The geometric interpretation of the Z transform is analogous: $z = re^{j\omega}$, where $r$ controls amplitude scaling and $\omega$ controls phase rotation. The ROC is an annular/exterior region $|z| > R$ (for right-sided sequences) or $|z| < R$ (for left-sided sequences). A discrete system is stable when all poles lie inside the unit circle. ### 7.3 Worked Examples ```ad-example title: Example 7.1 Laplace Transform of a Divergent Function — Pole and ROC Analysis Given the exponentially divergent function $f(t) = e^{2t}$ ($t \geq 0$), compute its Laplace transform and analyze the ROC and poles. **Solution**: Substituting into the Laplace transform definition (7.1): $$F(s) = \int_0^{\infty} e^{2t} \cdot e^{-st}\,dt = \int_0^{\infty} e^{-(s-2)t}\,dt$$ Let $a = s - 2 = (\sigma - 2) + j\omega$. Then: $$F(s) = \int_0^{\infty} e^{-at}\,dt = \left[-\frac{1}{a}e^{-at}\right]_{t=0}^{t=\infty}$$ As $t \to \infty$, $e^{-at} \to 0$ requires $\text{Re}(a) > 0$, i.e., $\text{Re}(s - 2) > 0$, or $\sigma > 2$. Under this condition:
$$F(s) = 0 - \left(-\frac{1}{a}\right) = \frac{1}{a} = \frac{1}{s - 2}$$
Therefore:
$$\mathcal{L}\{e^{2t}\} = \frac{1}{s - 2}, \quad \text{ROC: } \text{Re}(s) > 2, \quad \text{Pole: } s = 2$$
**Analysis**: The Fourier transform corresponds to $\sigma = 0$, and the real part of $s = j\omega$ is 0, which is less than 2 and lies outside the ROC—this explains why the Fourier transform of $e^{2t}$ does not exist. By introducing the real-part degree of freedom $\sigma$, the Laplace transform extends the integration path from the imaginary axis to the right half-plane of the complex plane, thereby enabling the analysis of divergent signals.
```
```ad-example
title: Example 7.2 Z Transform of a Discrete Sequence — ROC and Stability Analysis
Given the discrete sequence $x[n] = (0.5)^n u[n]$, where $u[n]$ is the unit step function (0 for $n < 0$, 1 for $n \geq 0$). Compute its Z transform and analyze the ROC and stability. **Solution**: Substituting into the Z transform definition (7.2): $$X(z) = \sum_{n=0}^{\infty} (0.5)^n z^{-n} = \sum_{n=0}^{\infty} (0.5 z^{-1})^n$$ This is a geometric series. It converges when $|0.5 z^{-1}| < 1$, i.e., $|z| > 0.5$:
$$X(z) = \frac{1}{1 - 0.5z^{-1}} = \frac{z}{z - 0.5}, \quad \text{ROC: } |z| > 0.5$$
The ROC is the exterior of a circle of radius 0.5 centered at the origin. The unit circle $|z| = 1$ lies entirely within the ROC, meaning that the discrete-time Fourier transform (DTFT, corresponding to $z = e^{j\omega}$) of this sequence exists. The pole is at $z = 0.5$, inside the unit circle, so the system is stable.
```
### 7.4 Engineering and Cutting-Edge Applications
The Laplace transform is the cornerstone of control theory. In feedback control systems, the pole locations of the system's transfer function $H(s)$ directly determine stability:
- All poles in the left half-plane ($\text{Re}(s) < 0$): system is stable, impulse response decays exponentially;
- Any pole in the right half-plane ($\text{Re}(s) > 0$): system diverges, impulse response grows exponentially;
- Poles on the imaginary axis ($\text{Re}(s) = 0$): system is marginally stable, impulse response oscillates with constant amplitude.
The Z transform is central to digital signal processing. The frequency response of a digital filter is determined by evaluating $H(z)$ on the unit circle, and stability is determined by whether all poles lie inside the unit circle. IIR filter design essentially involves placing poles and zeros in the $z$-plane to approximate a target frequency response.
---
## Chapter 8 The Essence of Convolution — "Sliding Inner Products"
### 8.1 Theory and Rigorous Definitions
**Convolution** is one of the most fundamental operations in signal processing, control theory, and deep learning$^{[17]}$. From the inner product perspective, convolution is essentially a **sequence of inner products over a sliding window**.
```ad-definition
title: Definition 8.1 Convolution
Let $f, g: \mathbb{R} \to \mathbb{R}$ be two continuous functions. Their **convolution** is defined as:
$$(f * g)(t) = \int_{-\infty}^{\infty} f(\tau) g(t - \tau)\,d\tau \tag{8.1}$$
For discrete sequences $x, h: \mathbb{Z} \to \mathbb{R}$, their **discrete convolution** is defined as:
$$(x * h)[n] = \sum_{k=-\infty}^{\infty} x[k]\, h[n - k] \tag{8.2}$$
```
```ad-theorem
title: Proposition 8.1 Inner Product Interpretation of Convolution
At a fixed time $t$, the convolution operation $(f * g)(t)$ is equivalent to the inner product between the function $f(\tau)$ and the flipped and shifted version of $g(\tau)$:
$$(f * g)(t) = \langle f(\tau), g(t - \tau) \rangle = \int f(\tau) g(t - \tau)\,d\tau \tag{8.3}$$
where the flipping operation $g(\tau) \to g(-\tau)$ ensures causality—the current output depends only on current and past inputs.
```
```ad-definition
title: Definition 8.2 Cross-Correlation
A closely related operation is **cross-correlation**:
$$(f \star g)(t) = \int_{-\infty}^{\infty} f(\tau) g(\tau + t)\,d\tau \tag{8.4}$$
Cross-correlation does not involve flipping; it directly computes the inner product of signals at different shifts, and is commonly used for template matching and similarity detection.
```
### 8.2 Geometric and Spatial Intuition
The geometric process of convolution can be decomposed into four steps:
1. **Flip**: Reflect the kernel $g(\tau)$ to obtain $g(-\tau)$, ensuring causality;
2. **Shift**: Translate the flipped kernel by $t$ to obtain $g(t - \tau)$;
3. **Multiply**: Multiply $f(\tau)$ and $g(t - \tau)$ pointwise;
4. **Integrate**: Sum (integrate) the product to obtain the inner product value at that instant.
As $t$ varies, the kernel slides along the time axis, computing the inner product between the signal and the kernel at each position. The convolution result $y(t)$ is the curve of inner product values as a function of the sliding position. Positions with large inner product values indicate that the local signal waveform most closely resembles the kernel—this is precisely the principle of the **matched filter**.
In image processing, a two-dimensional convolution kernel slides over the image, computing the 2D inner product between each $k \times k$ neighborhood and the kernel at each position, producing a "response map" (feature map). Regions with high response values indicate that the local image patch best matches the pattern of the convolution kernel.
### 8.3 Worked Examples
```ad-example
title: Example 8.1 Sliding Inner Product Convolution of Discrete Sequences — Pointwise Manual Computation
Given the input sequence $x[n] = [1, 2, 3]$ ($n = 0, 1, 2$) and the convolution kernel $h[n] = [0.5, 1, 0.5]$ ($n = 0, 1, 2$). Compute the convolution $y[n] = (x * h)[n]$.
**Solution**: Using the discrete convolution formula (8.2), compute point by point:
$n = 0$:
$$y[0] = \sum_{k} x[k]h[0-k] = x[0]h[0] = 1 \times 0.5 = 0.5$$
$n = 1$:
$$y[1] = x[0]h[1] + x[1]h[0] = 1 \times 1 + 2 \times 0.5 = 2$$
$n = 2$:
$$y[2] = x[0]h[2] + x[1]h[1] + x[2]h[0] = 1 \times 0.5 + 2 \times 1 + 3 \times 0.5 = 4$$
$n = 3$:
$$y[3] = x[1]h[2] + x[2]h[1] = 2 \times 0.5 + 3 \times 1 = 4$$
$n = 4$:
$$y[4] = x[2]h[2] = 3 \times 0.5 = 1.5$$
Therefore $y[n] = [0.5, 2, 4, 4, 1.5]$. The convolution values are largest at $n = 2, 3$ (equal to 4), where the overlap between the input sequence $[1, 2, 3]$ and the flipped kernel $[0.5, 1, 0.5]$ is greatest, and the inner product reaches its peak.
```
```ad-example
title: Example 8.2 Sobel Edge Detection — 2D Convolution as an Inner Product Template
The Sobel operator consists of two $3 \times 3$ convolution kernels for detecting edges in the horizontal and vertical directions, respectively:
$$S_x = \begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix}, \quad S_y = \begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix}$$
Given a $3 \times 3$ local image patch (grayscale values):
$$I = \begin{bmatrix} 10 & 20 & 30 \\ 10 & 20 & 30 \\ 10 & 20 & 30 \end{bmatrix}$$
This image patch exhibits a horizontal brightness gradient (brightening from left to right) and uniform brightness in the vertical direction.
**Solution**: Compute the 2D inner product of the Sobel X operator with the image patch:
$$G_x = \sum_{i=1}^{3} \sum_{j=1}^{3} S_x(i,j) \cdot I(i,j)$$
$$= (1 \times 10) + (0 \times 20) + (-1 \times 30) + (2 \times 10) + (0 \times 20) + (-2 \times 30) + (1 \times 10) + (0 \times 20) + (-1 \times 30)$$
$$= 10 + 0 - 30 + 20 + 0 - 60 + 10 + 0 - 30 = -80$$
Compute the 2D inner product of the Sobel Y operator:
$$G_y = (1 \times 10) + (2 \times 20) + (1 \times 30) + (0 \times 10) + (0 \times 20) + (0 \times 30) + (-1 \times 10) + (-2 \times 20) + (-1 \times 30)$$
$$= 10 + 40 + 30 + 0 + 0 + 0 - 10 - 40 - 30 = 0$$
Edge strength is:
$$\|\nabla I\| = \sqrt{G_x^2 + G_y^2} = \sqrt{(-80)^2 + 0^2} = 80$$
**Analysis**: $|G_x| = 80$ is large, indicating a significant brightness change in the horizontal direction (vertical edge); $G_y = 0$ indicates uniform brightness in the vertical direction. The essence of Sobel edge detection is to slide two orthogonal convolution kernels (inner product templates) over the image, computing the 2D inner product between each pixel neighborhood and the template. Positions with large inner product magnitudes are where edges are located.
```
### 8.4 Engineering and Cutting-Edge Applications

> **Figure 4: Sliding Inner Product and Matched Filter.** The blue curve shows a noisy random sequence $x[n]$, and the red curve shows the convolution response. The template pulse $h[n] = [0, 0.35, 1.0, 0.35, 0]$ slides along the time axis, computing $\sum x[k]h[n-k]$ at each position. The orange markers (at $n \approx 110, 265, 340$) indicate positions where the convolution value reaches a peak, meaning the local signal waveform best matches the template. The core principle of modern radar signal capture originates from this sliding projection mechanism.

> **Figure 5: 2D Convolution for Edge Feature Extraction (Sobel Edge Detection).** The Sobel operator is a pair of orthogonal $3 \times 3$ differential templates that detect brightness gradients along the $x$ and $y$ directions, respectively. When the template slides over a grayscale image, positive and negative projections cancel out in flat regions (inner product near zero), while pixel steps at edges cause the inner product magnitude to increase significantly. By combining the two orthogonal components via $\|\nabla I\| = \sqrt{G_x^2 + G_y^2}$, edge information in the physical world can be extracted. This is the foundational layer of feature extraction in computer vision.
---
## Chapter 9 Discrete Cosine Transform and JPEG Compression
### 9.1 Theory and Rigorous Definitions
The Discrete Cosine Transform (DCT) is the core algorithm of the JPEG image compression standard$^{[18][19]}$. From the inner product perspective, DCT performs orthogonal projection of image blocks onto a set of discrete cosine basis functions, transforming spatial-domain pixel values into frequency-domain coefficients.
```ad-definition
title: Definition 9.1 2D DCT
Let $f(x, y)$ be an $N \times N$ image block ($x, y = 0, 1, \dots, N-1$). Its 2D DCT is defined as:
$$F(u, v) = \frac{2}{N} C(u) C(v) \sum_{x=0}^{N-1} \sum_{y=0}^{N-1} f(x, y) \cos\left[\frac{(2x+1)u\pi}{2N}\right] \cos\left[\frac{(2y+1)v\pi}{2N}\right] \tag{9.1}$$
where $u, v = 0, 1, \dots, N-1$ are frequency indices, and the normalization coefficients are:
$$C(k) = \begin{cases} 1/\sqrt{2}, & k = 0 \\ 1, & k \neq 0 \end{cases}$$
```
```ad-theorem
title: Proposition 9.1 DCT as Orthogonal Projection
Define $N \times N$ DCT basis functions:
$$B_{u,v}(x, y) = \frac{2}{N} C(u) C(v) \cos\left[\frac{(2x+1)u\pi}{2N}\right] \cos\left[\frac{(2y+1)v\pi}{2N}\right]$$
Then $\{B_{u,v}\}$ forms a complete orthonormal basis for $\mathbb{R}^{N \times N}$, satisfying:
$$\langle B_{u,v}, B_{u',v'} \rangle = \sum_{x=0}^{N-1} \sum_{y=0}^{N-1} B_{u,v}(x, y) B_{u',v'}(x, y) = \delta_{u,u'} \delta_{v,v'}$$
The DCT coefficient $F(u, v)$ is precisely the projection of the image block $f$ onto the basis function $B_{u,v}$:
$$F(u, v) = \langle f, B_{u,v} \rangle = \sum_{x=0}^{N-1} \sum_{y=0}^{N-1} f(x, y) B_{u,v}(x, y) \tag{9.2}$$
```
```ad-theorem
title: Proposition 9.2 Energy Concentration
For natural images, the energy of DCT coefficients is concentrated primarily in the low-frequency region (small $u, v$), while high-frequency coefficients (large $u, v$) have magnitudes close to zero. JPEG compression exploits this property by quantizing and discarding small high-frequency coefficients, achieving significant compression while maintaining visual quality.
```
### 9.2 Geometric and Spatial Intuition
An $8 \times 8$ image block can be viewed as a vector in a 64-dimensional space. The DCT basis functions form a complete orthonormal basis for this 64-dimensional space:
- **$B_{0,0}$ (DC basis)**: a constant function, corresponding to the average brightness of the image block;
- **Low-frequency bases** (small $u, v$): smooth gradient patterns, corresponding to large-scale structures in the image;
- **High-frequency bases** (large $u, v$): dense oscillatory patterns, corresponding to fine textures and noise.
Projecting the image block vector onto these 64 basis directions yields 64 DCT coefficients. For natural images, the projection energy is highly concentrated in the low-frequency coefficients (upper-left corner), while high-frequency coefficients (lower-right corner) are near zero. JPEG compression zeros out tiny high-frequency coefficients through quantization, allowing the original image block to be approximately reconstructed using only a few low-frequency coefficients.
### 9.3 Worked Examples
```ad-example
title: Example 9.1 Manual Computation of DCT Projection Coefficients for a $2 \times 2$ Image Block
To illustrate the projection nature of DCT, consider a miniature $N = 2$ image block. The $2 \times 2$ DCT basis matrix is:
$$T = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$$
$T$ is an orthogonal matrix satisfying $T^T T = I$. Given a grayscale image block:
$$I = \begin{bmatrix} 100 & 80 \\ 60 & 40 \end{bmatrix}$$
The 2D DCT can be implemented via matrix multiplication: $F = T \cdot I \cdot T^T$.
**Solution**:
**Step 1**: Compute $T \cdot I$.
$$T \cdot I = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 100 & 80 \\ 60 & 40 \end{bmatrix} = \frac{1}{\sqrt{2}} \begin{bmatrix} 160 & 120 \\ 40 & 40 \end{bmatrix}$$
**Step 2**: Compute $(T \cdot I) \cdot T^T$.
$$F = \frac{1}{\sqrt{2}} \begin{bmatrix} 160 & 120 \\ 40 & 40 \end{bmatrix} \cdot \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} = \frac{1}{2} \begin{bmatrix} 160 & 120 \\ 40 & 40 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix}$$
$$= \frac{1}{2} \begin{bmatrix} 280 & 40 \\ 80 & 0 \end{bmatrix} = \begin{bmatrix} 140 & 20 \\ 40 & 0 \end{bmatrix}$$
**Step 3**: Interpret the DCT coefficients.
- $F(0,0) = 140$: DC coefficient, corresponding to the average brightness. $(100+80+60+40)/4 = 70$, multiplied by $N = 2$ gives 140.
- $F(0,1) = 20$: horizontal high-frequency component, reflecting left-right pixel differences.
- $F(1,0) = 40$: vertical high-frequency component, reflecting top-bottom pixel differences.
- $F(1,1) = 0$: diagonal high-frequency component, zero indicates no diagonal texture.
**Key Observation**: $F(1,1) = 0$, meaning the projection onto the diagonal high-frequency basis is zero—this component can be completely discarded without losing information. This is the core principle of JPEG compression: most high-frequency DCT coefficients of natural images are near zero, and after quantization they become zero, enabling substantial compression.
```
### 9.4 Engineering and Cutting-Edge Applications
The JPEG compression pipeline is as follows:
1. **Block splitting**: Divide the image into $8 \times 8$ blocks;
2. **DCT transform**: Apply 2D DCT to each block, obtaining 64 frequency-domain coefficients;
3. **Quantization**: Divide the DCT coefficients by a quantization matrix (with larger quantization steps for high frequencies), zeroing out small coefficients;
4. **Entropy coding**: Apply Huffman or arithmetic coding to the quantized coefficients.
At the decoder side, the image block is reconstructed via the inverse DCT transform. By discarding high-frequency components to which the human eye is insensitive, JPEG can compress images to $1/10$ of their original size or smaller while maintaining visual quality.
DCT is also widely used in video compression (MPEG, H.264/AVC, HEVC), audio compression (MDCT variant in MP3), and decorrelation and feature extraction in signal processing.
---
## Chapter 10 Wavelet Transform — Multi-Resolution Inner Products
### 10.1 Theory and Rigorous Definitions
The Fourier transform projects signals onto infinitely extended sinusoidal bases, obtaining global frequency information but losing time localization—it is impossible to determine from the spectrum when a particular frequency component appears. For non-stationary signals such as music, seismic waves, and ECG signals, this "time blindness" is a fundamental limitation.
```ad-definition
title: Definition 10.1 Short-Time Fourier Transform
To compensate for the lack of time localization, the Short-Time Fourier Transform (STFT) introduces a window function $w(t)$:
$$\text{STFT}\{f(t)\}(\tau, \omega) = \int_{-\infty}^{\infty} f(t) w(t - \tau) e^{-j\omega t}\,dt$$
However, once the window length of STFT is fixed, the time resolution $\Delta t$ and frequency resolution $\Delta f$ are constrained by the Heisenberg uncertainty principle$^{[16]}$:
$$\Delta t \cdot \Delta f \geq \frac{1}{4\pi} \tag{10.1}$$
```
```ad-definition
title: Definition 10.2 Wavelet Transform
The wavelet transform employs a family of scalable and translatable basis functions $\psi_{a,b}(t)$, fundamentally resolving the time-frequency resolution conflict$^{[17]}$. Let $\psi(t)$ be the mother wavelet, satisfying $\int \psi(t)\,dt = 0$ (zero-mean condition). The wavelet basis function family is defined as:
$$\psi_{a,b}(t) = \frac{1}{\sqrt{|a|}} \psi\left(\frac{t - b}{a}\right), \quad a \neq 0, \; b \in \mathbb{R} \tag{10.2}$$
where $a$ is the scale parameter (controlling dilation, corresponding to frequency) and $b$ is the translation parameter (controlling position, corresponding to time). Wavelet basis functions have the property of **compact support**—they are non-zero only over a finite interval—thus naturally possessing time localization capability.
```
```ad-definition
title: Definition 10.3 Continuous Wavelet Transform
The Continuous Wavelet Transform (CWT) of a signal $f(t)$ is defined as the inner product of $f$ with the wavelet basis functions:
$$W_f(a, b) = \langle f, \psi_{a,b} \rangle = \int_{-\infty}^{\infty} f(t) \cdot \frac{1}{\sqrt{|a|}} \psi^*\left(\frac{t - b}{a}\right) dt \tag{10.3}$$
```
```ad-theorem
title: Proposition 10.1 Multi-Resolution Analysis
The time-frequency resolution of the wavelet transform adapts with the scale $a$:
- **Small scale $a$** (high frequency): the wavelet is compressed, offering high time resolution and low frequency resolution, suitable for analyzing transient signals;
- **Large scale $a$** (low frequency): the wavelet is stretched, offering high frequency resolution and low time resolution, suitable for analyzing long-term trends.
This **Multi-Resolution Analysis (MRA)** property is the core advantage of the wavelet transform over the Fourier transform and STFT.
```
### 10.2 Geometric and Spatial Intuition
The geometric process of the wavelet transform can be understood as using a set of differently sized "probes" sliding along the time axis:
- **Large probe (large scale $a$)**: covers a wide time range, sensing the long-term trend of the signal (low frequency), but unable to precisely locate the moment of change;
- **Small probe (small scale $a$)**: covers a narrow time range, precisely locating signal突变 points (high frequency), but unable to see the overall trend.
At each position $b$, the inner product $W_f(a, b)$ between the signal $f(t)$ and the probe $\psi_{a,b}(t)$ is computed. The result forms a **scalogram**, with time $b$ on the horizontal axis, scale $a$ (or equivalent frequency) on the vertical axis, and color intensity representing the inner product strength.
Comparison with the Fourier transform: the Fourier transform uses infinitely long sine waves to "match" the entire signal, producing a global spectrum; the wavelet transform uses finite-length wavelets to "scan" the signal, recording the local match at each position while preserving both time and frequency information.
### 10.3 Worked Examples
```ad-example
title: Example 10.1 Haar Wavelet Decomposition — Manual Computation of First and Second Level Wavelet Transforms
The Haar wavelet is the simplest orthogonal wavelet. Its scaling function $\phi(t)$ and wavelet function $\psi(t)$ are defined as:
$$\phi(t) = \begin{cases} 1, & 0 \leq t < 1 \\ 0, & \text{otherwise} \end{cases}, \quad \psi(t) = \begin{cases} 1, & 0 \leq t < 0.5 \\ -1, & 0.5 \leq t < 1 \\ 0, & \text{otherwise} \end{cases}$$ Given a discrete signal of length 8: $$x = [4, 6, 10, 12, 8, 6, 5, 5]$$ Manually perform Haar wavelet decomposition. **Solution**: **Step 1: First-level decomposition — compute approximation coefficients.** Approximation coefficients are obtained via the inner product with the scaling function, i.e., the average of adjacent pairs: $$a_1 = \frac{4+6}{2} = 5, \quad a_2 = \frac{10+12}{2} = 11, \quad a_3 = \frac{8+6}{2} = 7, \quad a_4 = \frac{5+5}{2} = 5$$ Approximation coefficient vector: $A^{(1)} = [5, 11, 7, 5]$ **Step 2: First-level decomposition — compute detail coefficients.** Detail coefficients are obtained via the inner product with the wavelet function, i.e., half the difference of adjacent pairs: $$d_1 = \frac{4-6}{2} = -1, \quad d_2 = \frac{10-12}{2} = -1, \quad d_3 = \frac{8-6}{2} = 1, \quad d_4 = \frac{5-5}{2} = 0$$ Detail coefficient vector: $D^{(1)} = [-1, -1, 1, 0]$ **Step 3: Verify reconstruction.** The original signal can be perfectly recovered from $A^{(1)}$ and $D^{(1)}$: $$x_1 = a_1 + d_1 = 5 + (-1) = 4, \quad x_2 = a_1 - d_1 = 5 - (-1) = 6$$ $$x_3 = a_2 + d_2 = 11 + (-1) = 10, \quad x_4 = a_2 - d_2 = 11 - (-1) = 12$$ $$x_5 = a_3 + d_3 = 7 + 1 = 8, \quad x_6 = a_3 - d_3 = 7 - 1 = 6$$ $$x_7 = a_4 + d_4 = 5 + 0 = 5, \quad x_8 = a_4 - d_4 = 5 - 0 = 5$$ Reconstruction is perfectly correct. **Step 4: Second-level decomposition.** Apply the Haar wavelet transform to the approximation coefficients $A^{(1)} = [5, 11, 7, 5]$: $$a_1^{(2)} = \frac{5+11}{2} = 8, \quad a_2^{(2)} = \frac{7+5}{2} = 6$$ $$d_1^{(2)} = \frac{5-11}{2} = -3, \quad d_2^{(2)} = \frac{7-5}{2} = 1$$ Second-level approximation: $A^{(2)} = [8, 6]$, second-level detail: $D^{(2)} = [-3, 1]$ **Key Observation**: The original signal requires 8 numerical values for storage. After first-level decomposition, $A^{(1)}$ (4 values) + $D^{(1)}$ (4 values) = 8 values, no compression. However, if detail coefficients with small absolute values (such as $d_4 = 0$) are zeroed out, only 7 effective values need to be stored—this is the principle of wavelet compression. JPEG2000, based on the wavelet transform (CDF 9/7 wavelet), achieves better compression performance than JPEG (DCT) with no blocking artifacts. ``` ### 10.4 Engineering and Cutting-Edge Applications Wavelet analysis has a wide range of applications in signal processing: - **JPEG2000 image compression**: uses CDF 9/7 wavelet for multi-level decomposition, achieving higher compression ratios than JPEG's DCT method with no blocking artifacts; - **ECG analysis**: wavelet transform can precisely locate QRS complexes for arrhythmia detection; - **Seismic signal processing**: wavelet time-frequency spectra can simultaneously reveal seismic wave arrival times and frequency components; - **Wavelet networks in deep learning**: using wavelet transforms as front-end feature extraction layers in neural networks for processing non-stationary signals. --- ## Chapter 11 Self-Attention Mechanism — AI's Inner Product Engine ### 11.1 Theory and Rigorous Definitions Modern artificial intelligence, especially large language models (LLMs) such as GPT and BERT, relies almost entirely on inner products (dot products) for its underlying computation. The core of the Transformer architecture—the **self-attention mechanism**—is essentially a large-scale, parallel, learnable set of vector inner product operations$^{[18]}$. ```ad-definition title: Definition 11.1 Scaled Dot-Product Attention Given an input sequence, each token is linearly projected into three vectors: the query vector $Q$, the key vector $K$, and the value vector $V$. The self-attention output is defined as: $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V \tag{11.1}$$ where $Q \in \mathbb{R}^{n \times d_k}$, $K \in \mathbb{R}^{n \times d_k}$, $V \in \mathbb{R}^{n \times d_v}$, $n$ is the sequence length, and $d_k$ is the query/key dimension. ``` ```ad-theorem title: Proposition 11.1 Attention Weights as Normalized Inner Products The $(i, j)$-th element of the matrix $QK^T$ is precisely the inner product between the $i$-th query vector and the $j$-th key vector: $$(QK^T)_{ij} = \langle Q_i, K_j \rangle = Q_i \cdot K_j = \sum_{k=1}^{d_k} Q_{i,k} \cdot K_{j,k} \tag{11.2}$$ The larger this inner product, the higher the relevance between the $i$-th token and the $j$-th token. The scaling factor $1/\sqrt{d_k}$ prevents the inner product values from growing too large with dimensionality, which would cause the softmax gradient to vanish. After softmax normalization, the inner product values are transformed into probability weights, used to compute a weighted sum of the value vectors $V$. **Multi-head attention** executes the above process $h$ times in parallel ($h$ is the number of attention heads), with each head learning a different projection subspace: $$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O \tag{11.3}$$ where $\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)$. ``` ### 11.2 Geometric and Spatial Intuition The self-attention mechanism performs an elegant "projection-retrieval" operation in high-dimensional space: 1. **Query vector $Q_i$**: encodes the query intent "who is relevant to me?"; 2. **Key vector $K_j$**: encodes the identifying information "who am I, what features do I have?"; 3. **Inner product $\langle Q_i, K_j \rangle$**: measures the similarity between the query and the key in high-dimensional space (a scaled version of the cosine of the angle between vectors); 4. **Softmax normalization**: converts similarities into a probability distribution, allowing the model to focus on the most relevant tokens; 5. **Weighted sum**: extracts contextual information from the value vectors according to the attention weights. The entire Transformer model can be viewed as a giant **differentiable inner product engine**: each layer performs inner product operations, and through backpropagation, the projection matrices for $Q$, $K$, and $V$ are continuously adjusted so that the inner product results can accurately capture long-range dependencies in the data. ### 11.3 Worked Examples ```ad-example title: Example 11.1 Manual Computation of Self-Attention for 2 Tokens Consider a minimal sequence with only two tokens: "I" and "love." After embedding and linear projection (let $d_k = 3$): $$Q = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix}, \quad K = \begin{bmatrix} 1 & 1 & 0 \\ 0 & 1 & 1 \end{bmatrix}, \quad V = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$ The first row corresponds to "I," the second row to "love." **Solution**: **Step 1: Compute $QK^T$ (all inner product pairs).** $$QK^T = \begin{bmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 0 & 1 \end{bmatrix}$$ Element-wise computation: - $(QK^T)_{11} = \langle Q_1, K_1 \rangle = 1 \times 1 + 0 \times 1 + 1 \times 0 = 1$ - $(QK^T)_{12} = \langle Q_1, K_2 \rangle = 1 \times 0 + 0 \times 1 + 1 \times 1 = 1$ - $(QK^T)_{21} = \langle Q_2, K_1 \rangle = 0 \times 1 + 1 \times 1 + 1 \times 0 = 1$ - $(QK^T)_{22} = \langle Q_2, K_2 \rangle = 0 \times 0 + 1 \times 1 + 1 \times 1 = 2$ $$QK^T = \begin{bmatrix} 1 & 1 \\ 1 & 2 \end{bmatrix}$$ **Step 2: Scale (divide by $\sqrt{d_k} = \sqrt{3} \approx 1.732$).** $$\frac{QK^T}{\sqrt{3}} = \begin{bmatrix} 0.577 & 0.577 \\ 0.577 & 1.155 \end{bmatrix}$$ **Step 3: Softmax normalization (row-wise).** First row $[0.577, 0.577]$: $$e^{0.577} \approx 1.781, \quad \text{sum} = 3.562$$ $$\text{softmax}_{11} = \frac{1.781}{3.562} = 0.5, \quad \text{softmax}_{12} = \frac{1.781}{3.562} = 0.5$$ Second row $[0.577, 1.155]$: $$e^{0.577} \approx 1.781, \quad e^{1.155} \approx 3.174, \quad \text{sum} = 4.955$$ $$\text{softmax}_{21} = \frac{1.781}{4.955} = 0.359, \quad \text{softmax}_{22} = \frac{3.174}{4.955} = 0.641$$ Attention weight matrix: $$\text{Weights} = \begin{bmatrix} 0.5 & 0.5 \\ 0.359 & 0.641 \end{bmatrix}$$ **Step 4: Weighted sum to obtain output.** $$\text{Output} = \text{Weights} \cdot V = \begin{bmatrix} 0.5 & 0.5 \\ 0.359 & 0.641 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$$ - New representation of "I": $0.5 \times [1, 0] + 0.5 \times [0, 1] = [0.5, 0.5]$ - New representation of "love": $0.359 \times [1, 0] + 0.641 \times [0, 1] = [0.359, 0.641]$ **Key Observations**: - "I" distributes attention evenly across both tokens (0.5 each), because its inner products with both are equal; - "love" attends more to itself (0.641) than to "I" (0.359), because its inner product with itself (2) is larger than with "I" (1); - The output vectors are weighted combinations of the value vectors, with weights entirely determined by inner products—this is the core mechanism of "context-aware representation through inner products." ``` ### 11.4 Engineering and Cutting-Edge Applications The computational cost of the self-attention mechanism grows as $O(n^2)$ with sequence length $n$. For large models such as GPT-4 (with context lengths up to 128K), a single forward pass requires tens of trillions of inner product operations. To accelerate computation, the industry has developed various optimization techniques: - **Flash Attention**: reduces memory reads/writes through block-wise computation and memory optimization, accelerating attention computation by 2–4$\times$; - **Sparse Attention**: computes inner products only for a subset of token pairs (e.g., local window + global tokens), reducing complexity to $O(n \log n)$; - **Multi-Query Attention (MQA)**: multiple query heads share the same set of key-value pairs, reducing KV cache size; - **Linear Attention**: uses kernel methods to approximate softmax attention, reducing complexity to $O(n)$. These optimizations essentially seek the optimal balance between "reducing the number of inner product computations" and "preserving model expressiveness." --- ## Chapter 12 Kernel Methods — Implicit High-Dimensional Inner Products ### 12.1 Theory and Rigorous Definitions In low-dimensional spaces, data is often linearly inseparable—for example, concentric circle data in a 2D plane cannot be separated by a straight line. Traditional approaches involve manually constructing high-dimensional features (such as $x_1^2 + x_2^2$), but feature engineering is extremely costly. The core idea of **kernel methods** is: instead of explicitly computing coordinates in the high-dimensional space, directly compute inner products in that space$^{[22]}$. This technique is called the **kernel trick**. ```ad-definition title: Definition 12.1 Kernel Function Let $\phi: \mathcal{X} \to \mathcal{H}$ be a nonlinear mapping from the input space to a high-dimensional (possibly infinite-dimensional) Hilbert space. The kernel function $k: \mathcal{X} \times \mathcal{X} \to \mathbb{R}$ is defined as: $$k(x, y) = \langle \phi(x), \phi(y) \rangle_{\mathcal{H}} \tag{12.1}$$ The elegance of the kernel function is that we do not need to know the explicit form of $\phi$; as long as $k(x, y)$ satisfies the **Mercer condition** (symmetric and positive semi-definite), it corresponds to an inner product in some reproducing kernel Hilbert space (RKHS). ``` ```ad-definition title: Definition 12.2 Common Kernel Functions Commonly used kernel functions include: - **Linear kernel**: $k(x, y) = x^T y$ (the inner product in the original space); - **Polynomial kernel**: $k(x, y) = (x^T y + c)^d$ (corresponding to a $d$-th order polynomial feature space); - **Gaussian RBF kernel**: $k(x, y) = \exp\left(-\frac{\|x - y\|^2}{2\sigma^2}\right)$ (corresponding to an infinite-dimensional feature space); - **Sigmoid kernel**: $k(x, y) = \tanh(\alpha x^T y + c)$. ``` ```ad-definition title: Definition 12.3 Support Vector Machine The Support Vector Machine (SVM) is the most classic application of kernel methods$^{[23]}$. SVM finds the maximum-margin hyperplane in the feature space, and its decision function depends only on the inner products between support vectors and the test sample: $$f(x) = \text{sign}\left(\sum_{i=1}^{m} \alpha_i y_i \langle \phi(x_i), \phi(x) \rangle + b\right) = \text{sign}\left(\sum_{i=1}^{m} \alpha_i y_i k(x_i, x) + b\right) \tag{12.2}$$ where $x_i$ are the support vectors, $y_i \in \{-1, +1\}$ are the labels, and $\alpha_i$ are the dual variables. ``` ### 12.2 Geometric and Spatial Intuition The geometric intuition of the kernel trick can be understood as "folding-unfolding": 1. **Input space**: data points are chaotically distributed in the low-dimensional space, where linear classifiers are powerless; 2. **Implicit mapping $\phi$**: "unfolds" the data points into a high-dimensional Hilbert space, where the originally entangled data points are "straightened out"; 3. **Inner product in high-dimensional space**: SVM finds the maximum-margin hyperplane in the high-dimensional space—equivalent to a nonlinear decision boundary in the input space; 4. **Kernel function $k(x, y)$**: directly returns the inner product value in the high-dimensional space, as if the data had been mapped to the high-dimensional space, but with the same computational cost as the low-dimensional space. **Key Insight**: The Taylor expansion of the RBF kernel $\exp(-\gamma\|x - y\|^2)$ contains polynomial features of all orders, so RBF kernel SVM can theoretically approximate arbitrarily complex decision boundaries. ### 12.3 Worked Examples ```ad-example title: Example 12.1 Kernel Trick for the 2D XOR Problem — Manual Derivation XOR dataset: $x_1 = (-1, -1)$ label $-1$, $x_2 = (1, 1)$ label $-1$, $x_3 = (-1, 1)$ label $+1$, $x_4 = (1, -1)$ label $+1$. In 2D space, XOR data is linearly inseparable. **Solution**: **Step 1: Choose a kernel function and find the implicit mapping.** Take the polynomial kernel $k(x, y) = (x^T y)^2$. Expand: $$(x^T y)^2 = (x_1 y_1 + x_2 y_2)^2 = x_1^2 y_1^2 + 2x_1 x_2 y_1 y_2 + x_2^2 y_2^2$$ $$= \langle (x_1^2, \sqrt{2}x_1 x_2, x_2^2), (y_1^2, \sqrt{2}y_1 y_2, y_2^2) \rangle$$ Therefore the implicit mapping is $\phi(x) = (x_1^2, \sqrt{2}x_1 x_2, x_2^2)$, mapping 2D data to 3D space. **Step 2: Compute the coordinates of data points in 3D space.** $$\phi(x_1) = \phi(-1, -1) = (1, \sqrt{2}, 1), \quad \phi(x_2) = \phi(1, 1) = (1, \sqrt{2}, 1)$$ $$\phi(x_3) = \phi(-1, 1) = (1, -\sqrt{2}, 1), \quad \phi(x_4) = \phi(1, -1) = (1, -\sqrt{2}, 1)$$ **Step 3: Verify linear separability.** In 3D space, $x_1, x_2$ (label $-1$) are both at $(1, \sqrt{2}, 1)$, and $x_3, x_4$ (label $+1$) are both at $(1, -\sqrt{2}, 1)$. The two classes can be perfectly separated by the plane $z_2 = 0$ (i.e., $\sqrt{2}x_1 x_2 = 0$)! **Step 4: Verify the kernel trick.** Compute $k(x_1, x_3) = (x_1^T x_3)^2$: $$x_1^T x_3 = (-1)(-1) + (-1)(1) = 0, \quad k(x_1, x_3) = 0^2 = 0$$ In 3D space: $\langle \phi(x_1), \phi(x_3) \rangle = 1 \times 1 + \sqrt{2} \times (-\sqrt{2}) + 1 \times 1 = 0$$ Both are equal, verifying the correctness of the kernel trick. **Step 5: SVM decision.** In 3D space, the maximum-margin hyperplane is $z_2 = 0$, with normal vector $w = (0, 1, 0)$ and bias $b = 0$. The support vectors are all four points, with $\alpha_i = 1$. For a test point $x = (0.5, -0.5)$: $$k(x_1, x) = ((-1)(0.5) + (-1)(-0.5))^2 = 0, \quad k(x_2, x) = ((1)(0.5) + (1)(-0.5))^2 = 0$$ $$k(x_3, x) = ((-1)(0.5) + (1)(-0.5))^2 = 1, \quad k(x_4, x) = ((1)(0.5) + (-1)(-0.5))^2 = 1$$ $$f(x) = \text{sign}(-0 - 0 + 1 + 1) = \text{sign}(2) = +1$$ Prediction is $+1$, correct. **Key Observation**: We never explicitly computed $\phi(x)$; instead, we obtained the inner product in the high-dimensional space directly through the kernel function $k(x, y) = (x^T y)^2$—achieving high-dimensional classification capability with low-dimensional computational cost. ``` ### 12.4 Engineering and Cutting-Edge Applications The applications of kernel methods extend far beyond SVM: - **Kernel PCA**: performs PCA in the kernel-mapped high-dimensional space for nonlinear dimensionality reduction; - **Kernel Ridge Regression**: extends linear ridge regression to nonlinear regression; - **Kernel Mean Matching**: used for domain adaptation and transfer learning; - **Gaussian Processes**: uses the kernel function as the covariance function for Bayesian optimization and regression; - **Neural Tangent Kernel (NTK)**: connects infinitely wide neural networks with kernel methods, providing theoretical analysis tools for deep learning. --- ## Chapter 13 Inner Products in Quantum Mechanics — Probability as Projection ### 13.1 Theory and Rigorous Definitions Quantum mechanics pushes the concept of inner products to the ultimate level of the physical world. In quantum mechanics, the state of a system is described by a **state vector** $|\psi\rangle$ in a Hilbert space $\mathcal{H}$ (Dirac notation)$^{[26]}$. The Hilbert space here is typically an infinite-dimensional complex inner product space. ```ad-definition title: Definition 13.1 State Vector and Inner Product The state vector $|\psi\rangle \in \mathcal{H}$ contains all information about the quantum system. The inner product of two states $\langle \phi | \psi \rangle$ is a complex number whose squared modulus gives the measurement probability. **Axiom 13.1 (Born Rule)** When the system is in state $|\psi\rangle$, the probability of measuring observable $\hat{A}$ and obtaining eigenvalue $\lambda_n$ is$^{[21]}$: $$P(\lambda_n) = |\langle a_n | \psi \rangle|^2 \tag{13.1}$$ where $|a_n\rangle$ is the eigenstate of $\hat{A}$ corresponding to $\lambda_n$. After measurement, the system state collapses to $|a_n\rangle$. The essence of the Born rule is: **probability equals the squared modulus of the projection of the state vector onto the measurement basis**. ``` ```ad-definition title: Definition 13.2 Observables and Self-Adjoint Operators Observables correspond to self-adjoint (Hermitian) operators $\hat{A}$ on the Hilbert space, satisfying $\hat{A}^\dagger = \hat{A}$. The eigenvalues of self-adjoint operators are real, and their eigenstates form a complete orthonormal basis. ``` ```ad-definition title: Definition 13.3 Schrödinger Equation The time evolution of the state vector is described by the Schrödinger equation: $$i\hbar \frac{d}{dt} |\psi(t)\rangle = \hat{H} |\psi(t)\rangle \tag{13.2}$$ where $\hat{H}$ is the Hamiltonian operator (energy operator). This equation is essentially a unitary evolution equation in an infinite-dimensional Hilbert space—an inner-product-preserving rotation. ``` ### 13.2 Geometric and Spatial Intuition The geometric picture of quantum mechanics has profound connections with classical inner product spaces: 1. **State vectors are unit vectors**: physically, $|\psi\rangle$ must be normalized, i.e., $\langle \psi | \psi \rangle = 1$. All possible state vectors form the unit sphere in the complex Hilbert space. 2. **Measurement is orthogonal projection**: the measurement operation projects the state vector $|\psi\rangle$ onto the eigensubspace. The projection length $|\langle a_n | \psi \rangle|$ determines the probability amplitude, whose square is the measurement probability. 3. **Orthogonal states are mutually exclusive**: if $\langle \phi | \psi \rangle = 0$, the two states are orthogonal (mutually exclusive)—when the system is in $|\psi\rangle$, the probability of measuring $|\phi\rangle$ is zero. 4. **Entangled states are inseparable**: for a composite system, if $|\psi\rangle_{AB} \neq |\phi\rangle_A \otimes |\chi\rangle_B$, the two subsystems are in an entangled state. The mathematical essence of entanglement is that the inner product structure of the two subsystems cannot be factored into a direct product form. ### 13.3 Worked Examples ```ad-example title: Example 13.1 Measurement Probability for a Spin-$1/2$ System — Inner Product Computation Consider an electron spin, whose state can be represented as a vector in a 2D complex Hilbert space. Spin $z$-direction eigenstates: $$| \uparrow_z \rangle = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \quad | \downarrow_z \rangle = \begin{pmatrix} 0 \\ 1 \end{pmatrix}$$ Spin $x$-direction eigenstates: $$| \uparrow_x \rangle = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}, \quad | \downarrow_x \rangle = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}$$ The electron is in state $|\psi\rangle = \frac{1}{\sqrt{2}}| \uparrow_z \rangle + \frac{1}{\sqrt{2}}| \downarrow_z \rangle = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}$. **Solution**: **Step 1: Verify normalization.** $$\langle \psi | \psi \rangle = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \end{pmatrix} \cdot \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix} = \frac{1}{2}(1 + 1) = 1$$ Normalization holds. **Step 2: Probability of measuring $S_z$.** $$P(\uparrow_z) = |\langle \uparrow_z | \psi \rangle|^2 = \left| \begin{pmatrix} 1 & 0 \end{pmatrix} \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix} \right|^2 = \left| \frac{1}{\sqrt{2}} \right|^2 = \frac{1}{2}$$ $$P(\downarrow_z) = |\langle \downarrow_z | \psi \rangle|^2 = \left| \begin{pmatrix} 0 & 1 \end{pmatrix} \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix} \right|^2 = \left| \frac{1}{\sqrt{2}} \right|^2 = \frac{1}{2}$$ Each 50%, as expected. **Step 3: Probability of measuring $S_x$.** $$P(\uparrow_x) = |\langle \uparrow_x | \psi \rangle|^2 = \left| \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \end{pmatrix} \cdot \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix} \right|^2 = \left| \frac{1}{2}(1 + 1) \right|^2 = 1$$ $$P(\downarrow_x) = |\langle \downarrow_x | \psi \rangle|^2 = \left| \frac{1}{\sqrt{2}}\begin{pmatrix} 1 & -1 \end{pmatrix} \cdot \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix} \right|^2 = \left| \frac{1}{2}(1 - 1) \right|^2 = 0$$ **Key Observation**: $|\psi\rangle = | \uparrow_x \rangle$, so measuring $S_x$ yields $+\hbar/2$ with 100% probability. This verifies the geometric meaning of inner products: when the state vector is perfectly aligned (inner product modulus 1), the probability is 100%; when orthogonal (inner product 0), the probability is 0. **Step 4: State collapse after measurement.** Suppose measuring $S_z$ yields $+\hbar/2$, the state vector collapses: $$|\psi\rangle = \frac{1}{\sqrt{2}}| \uparrow_z \rangle + \frac{1}{\sqrt{2}}| \downarrow_z \rangle \xrightarrow{\text{measure } S_z = +\hbar/2} |\psi'\rangle = | \uparrow_z \rangle$$ At this point, measuring $S_z$ again will yield $+\hbar/2$ with 100% probability, but measuring $S_x$ will return to 50/50 probability. This is the essence of "measurement changes the state"—an orthogonal projection operation. ``` ### 13.4 Engineering and Cutting-Edge Applications The concept of quantum inner products is driving revolutionary technologies: - **Quantum computing**: quantum gate operations are essentially unitary transformations (inner-product-preserving rotations) in Hilbert space. Shor's algorithm and Grover's algorithm leverage superposition and interference (inner product phase) of quantum states to achieve exponential speedup; - **Quantum cryptography**: the BB84 protocol uses the orthogonality of measurement bases to detect eavesdropping—an eavesdropper's measurement collapses the state vector, altering the inner product result, which is detected by the legitimate communicating parties; - **Quantum teleportation**: uses the inner product structure of Bell states (maximally entangled states) to achieve remote transmission of quantum information; - **Quantum machine learning**: quantum kernel methods leverage quantum state inner products to efficiently compute kernel functions in high-dimensional Hilbert spaces, promising quantum advantage. --- ## Final Chapter Unified Knowledge Graph and Philosophical Sublimation ### Everything is Projection — An Inner Product Atlas Across All Disciplines Reviewing the knowledge system constructed throughout this article, from 2D vector dot products to infinite-dimensional complex Hilbert space state vector inner products, the concept of inner products permeates every corner of mathematics, physics, engineering, and computer science. **Core Theme**: The inner product $\langle \cdot, \cdot \rangle$ is a **similarity measure**. Regardless of whether the objects are vectors, functions, signals, images, or quantum states, the inner product answers the same question—"how similar are these two objects?" **Unified Knowledge Graph**: | Domain | Inner Product Form | Geometric Interpretation | Core Application | |--------|-------------------|------------------------|-----------------| | Linear Algebra | $\langle x, y \rangle = x^T y$ | Projection length | Orthogonal decomposition, least squares | | Functional Analysis | $\langle f, g \rangle = \int fg$ | Waveform similarity | Fourier series, wavelet transform | | Signal Processing | $\langle x, h \rangle = \sum x[n]h[n]$ | Matched filtering | Convolution, correlation detection | | Probability & Statistics | $\text{Cov}(X,Y) = E[(X-\mu_X)(Y-\mu_Y)]$ | Correlation direction | PCA, regression analysis | | Machine Learning | $\langle Q_i, K_j \rangle$ | Attention weight | Transformer, self-attention | | Image Processing | $\langle I, K \rangle$ | Feature response | CNN, edge detection | | Quantum Mechanics | $\langle \phi \mid \psi \rangle$ | Probability amplitude | Measurement, quantum computing | | Control Theory | $\langle f, e^{-st} \rangle$ | Complex frequency projection | Laplace transform, stability analysis | ### Philosophical Sublimation — Projection as Cognition From a philosophical perspective, "everything is projection" is not merely a mathematical statement, but a way of understanding the world$^{[22]}$: 1. **Cognition is projection**: the process by which humans understand the world is essentially projecting the complex information of the external world onto a finite set of cognitive basis functions. We do not see "the world as it truly is," but rather the projection coefficients of the real world onto our cognitive basis. 2. **Orthogonality is independence**: when two concepts are orthogonal, it means they do not interfere with or overlap each other. Orthogonal decomposition is the ultimate weapon for simplifying complex problems—decomposing a complex system into mutually independent, non-interfering modules. 3. **Projection is decision-making**: the method of least squares shows that when an exact solution does not exist, finding the projection is the optimal choice. When a perfect solution is unattainable, orthogonal projection onto the feasible region yields the optimal decision. 4. **The choice of basis determines everything**: Fourier chooses sine waves as the basis, wavelets choose compactly supported functions as the basis, and Transformers choose learnable attention bases as the basis—the choice of basis determines what kind of world one can see. ### Final Thoughts The inner product is not merely a mathematical operation; it is a **metalanguage** that connects the microscopic and macroscopic, the continuous and discrete, the deterministic and probabilistic. From the Pythagorean theorem to quantum entanglement, from least squares to large language models, the inner product, in its concise and profound form, unifies every corner of the edifice of human knowledge. --- ## Appendix Code for Generating Figures in This Article All five figures in this article (cosine similarity heatmap, least squares projection, Fourier decomposition, convolution matched filter, Sobel edge detection) are uniformly generated by main.py. This script is based on Python's scientific computing ecosystem (NumPy, SciPy, Matplotlib) and revolves around the core theme of "inner products," transforming the abstract mathematical concepts in the text into intuitive visual graphics.
The core design philosophy of the script is as follows:
1. **Cosine similarity**: computes the normalized inner product between word embedding vectors via the `cosine_similarity()` function, generating a $5 \times 5$ heatmap matrix. This function implements the definition of cosine similarity in equation (1.5).
2. **Least squares method**: uses `np.linalg.lstsq` to solve the normal equation $A^T A \hat{x} = A^T b$ (Theorem 3.1), essentially projecting the observation vector orthogonally onto the model space.
3. **Fourier decomposition**: projects the time-domain signal onto frequency bases via FFT (Theorem 6.1), with each peak in the spectrum corresponding to the inner product coefficient of a frequency component.
4. **Convolution and matched filtering**: treats convolution as a sliding inner product operation (Definition 8.1), using pointwise inner products between the template and the signal to detect pulse positions.
5. **Sobel edge detection**: computes the 2D inner product between the convolution kernel and the image (Example 8.2), calculating the gradient magnitude at each pixel.
The following is the core code snippet from the script that generates the cosine similarity heatmap:
def cosine_similarity(vec_a: np.ndarray, vec_b: np.ndarray) -> float:
dot_product = float(np.dot(vec_a, vec_b))
norm_a = np.linalg.norm(vec_a)
norm_b = np.linalg.norm(vec_b)
return dot_product / (norm_a * norm_b)
def build_semantic_demo() -> tuple[list[str], dict[str, np.ndarray], np.ndarray]:
tokens = ["king", "queen", "man", "woman", "apple"]
embeddings = {
"king": np.array([0.92, 0.10, 0.78, 0.25, 0.60]),
"queen": np.array([0.90, 0.12, 0.80, 0.30, 0.63]),
"man": np.array([0.88, 0.18, 0.40, 0.22, 0.35]),
"woman": np.array([0.86, 0.22, 0.42, 0.28, 0.38]),
"apple": np.array([0.05, 0.95, 0.08, 0.87, 0.10]),
}
matrix = np.array(
[[cosine_similarity(embeddings[left], embeddings[right]) for right in tokens] for left in tokens]
)
return tokens, embeddings, matrix
For the complete code, please refer to main.py
## References
[1] Wikipedia contributors. (2026, April 28). Dot product. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:42, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Dot_product&oldid=1351567929.
[2] Wikipedia contributors. (2025, November 3). Orthogonal complement. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:43, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Orthogonal_complement&oldid=1320174088.
[3] Wikipedia contributors. (2025, July 7). Orthogonalization. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:44, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Orthogonalization&oldid=1299273509.
[4] Wikipedia contributors. (2025, September 1). Orthogonal functions. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:46, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Orthogonal_functions&oldid=1308940353.
[5] Wikipedia contributors. (2026, March 13). Least squares. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:46, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Least_squares&oldid=1343263636.
[6] Wikipedia contributors. (2026, May 23). Hilbert space. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:47, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Hilbert_space&oldid=1355759876.
[7] 卷积、内积、互相关概念. CSDN博客, 2024. https://blog.csdn.net/qq_31073871/article/details/146475191.
[8] Wikipedia contributors. (2026, February 27). Inner product space. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:51, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Inner_product_space&oldid=1340828148.
[9] 内积和外积[G/OL]. OI Wiki, 2025. https://oi-wiki.org/math/linear-algebra/product/.
[10] 维基百科编者. 内积[G/OL]. 维基百科, 2025(20250703)[2025-07-03]. https://zh.wikipedia.org/w/index.php?title=%E5%86%85%E7%A7%AF&oldid=88045564.
[11] Wikipedia contributors. (2026, April 24). Fourier series. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:55, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Fourier_series&oldid=1350934101.
[12] Wikipedia contributors. (2026, May 20). Fourier transform. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:55, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Fourier_transform&oldid=1355147665.
[13] Wikipedia contributors. (2026, May 17). Cosine similarity. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:56, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Cosine_similarity&oldid=1354643579.
[14] Wikipedia contributors. (2026, May 11). Laplace transform. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:56, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Laplace_transform&oldid=1353668445.
[15] Wikipedia contributors. (2026, May 8). Z-transform. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:57, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Z-transform&oldid=1353129057.
[16] Wikipedia contributors. (2025, June 1). Frequency domain. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:57, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Frequency_domain&oldid=1293464779.
[17] Wikipedia contributors. (2026, May 20). Convolution. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:57, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Convolution&oldid=1355143781.
[18] Wikipedia contributors. (2026, April 25). Discrete cosine transform. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:58, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Discrete_cosine_transform&oldid=1350947997.
[19] Wikipedia contributors. (2026, May 19). JPEG. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:58, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=JPEG&oldid=1355030069.
[20] Wikipedia contributors. (2026, April 29). Wavelet. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:58, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Wavelet&oldid=1351640900.
[21] Wikipedia contributors. (2026, March 22). Word embedding. In _Wikipedia, The Free Encyclopedia_. Retrieved 11:59, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Word_embedding&oldid=1344811356.
[22] Wikipedia contributors. (2025, November 24). Kernel method. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:00, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Kernel_method&oldid=1323912764.
[23] Wikipedia contributors. (2026, April 19). Support vector machine. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:00, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Support_vector_machine&oldid=1350010737.
[24] Wikipedia contributors. (2026, May 23). Cluster analysis. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:00, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Cluster_analysis&oldid=1355672094.
[25] Wikipedia contributors. (2026, April 8). Regression analysis. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:01, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Regression_analysis&oldid=1347668389.
[26] Wikipedia contributors. (2026, May 22). Quantum mechanics. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:01, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Quantum_mechanics&oldid=1355584024.
[27] Wikipedia contributors. (2026, May 20). Uncertainty principle. In _Wikipedia, The Free Encyclopedia_. Retrieved 12:01, May 24, 2026, from https://en.wikipedia.org/w/index.php?title=Uncertainty_principle&oldid=1355179215.

Comments NOTHING