计算机代写|机器学习代写machine learning代考|COMP30027

2022年12月30日

## 计算机代写|机器学习代写machine learning代考|Computationally Efficient Kernels

As already mentioned above, the performance of properly scaling kernels depends on the kernel function $f$ only via the three parameters $\left(a_1, a_2, v\right)$. It is thus possible to design a prototypical family $\mathcal{F}$ of functions $f$ having (i) universal properties with respect to $\left(a_1, a_2, v\right)$, that is, for each $\left(a_1, a_2, v\right)$ there exists $f \in \mathcal{F}$ with these Hermite coefficients and (ii) having numerically advantageous properties. Thus, any arbitrary kernel function $f$ can be mapped, through $\left(a_1, a_2, v\right)$, to a function in $\mathcal{F}$ with good numerical properties. One such prototypical family $\mathcal{F}$ is the set of “ternary kernel” functions $f \mathrm{~s}$, parametrized by a triplet $\left(t, s_{-}, s_{+}\right)$, and defined as
$$f(x)=\left{\begin{array}{ll} -r t & x \leq \sqrt{2} s_{-} \ 0 & \sqrt{2} s_{-}\sqrt{2} s_{+} \end{array},\left{\begin{array}{l} a_1=\frac{t}{\sqrt{2 \pi}}\left(e^{-s_{+}^2}+r e^{-s_{-}^2}\right) \ a_2=\frac{1}{\sqrt{2 \pi}}\left(s_{+} e^{-s_{+}^2}+r s_{-} e^{-s_{-}^2}\right) \ v=\frac{t^2}{2}\left(1-\operatorname{erf}\left(s_{+}\right)\right)(1+r) \end{array},\right.\right.$$
where $r \equiv \frac{1-\operatorname{erf}\left(s_{+}\right)}{1+\operatorname{er}\left(s_{-}\right)}$. That is, $f$ only takes three discrete values so that the resulting kernel matrix may be stored and operated on very efficiently. Figure $4.7$ displays such a function $f$ in (4.31) together with the cubic function $c_3 x^3+c_2\left(x^2-1\right)+c_1 x$ sharing the same coefficients $\left(a_1, a_2, v\right)$.

The equivalence class of kernel functions induced by this mapping (i.e., those having asymptotically equivalent spectral properties) is quite unlike the equivalence class of the previous section for the “improper” scaling $f\left(\mathbf{x}_i^{\top} \mathbf{x}_j / p\right)$ regime. In the latter, functions $f(x)$ of the same class of equivalence are those having common $f^{\prime}(0)$ and $f^{\prime \prime}(0)$ values, while here these functions may have no similar local behavior (as shown in the example of Figure 4.7).

In pursuit of computationally more efficient kernels by tuning the three key parameters $\left(a_1, a_2, v\right)$, one must be very careful since, by Theorem $4.5$ and Figure $4.5$, taking $a_2 \neq 0$ can result in up to two spurious noninformative spikes that may be mistaken as informative ones by spectral clustering algorithms. We refer the interested readers to Liao et al. [2021] for a thorough discussion on the “complexity and performance tradeoff” of properly scaling kernels for different $\mathcal{F}$ families (e.g., sparse, quantized, and even binarized functions).

## 计算机代写|机器学习代写machine learning代考|Implications to Kernel Methods

By simply “plugging” the random matrix equivalents of the kernel matrices studied in the previous sections into kernel-based learning algorithms, it is now possible to analyze the asymptotic performance of these algorithms in the large $n, p$ regime. The present section is dedicated to this analysis, successively for unsupervised (kernel spectral clustering in Section 4.4.1), semi-supervised (with kernel graph Laplacian in Section 4.4.2), and fully supervised (kernel ridge regression in Section 4.4.3) learning.
We will discover in this section that, as a result of the curse of dimensionality (following from the convergence $\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p \stackrel{\text { a.s. }}{\longrightarrow} \tau_p$ ) and of the induced inappropriate (low-dimensional) intuitions when applied to the large-dimensional setting, all these algorithms (i) behave differently from what is expected, (ii) sometimes fail to perform as intended and, (iii) are often far from optimal. The random matrix analyses preformed in the previous section provide new intuitions and, as shall be seen, always allow for a proper adaptation (such as an optimal hyperparameter tuning) and improvement (sometimes via very simple but fundamental modifications) of the algorithms. As another important outcome, the possibility to access the performance of these improved algorithms provides a safer ground for further optimization and even for comparing to the ultimate information-theoretic bounds associated with the machine learning problem at hand.

From a machine learning perspective, spectral clustering is often seen as a discrete-tocontinuous relaxation of a graph min-cut problem [Luxburg, 2007]. More precisely, assuming $\mathbf{K}$ to be the adjacency matrix of a graph with nodes $\mathbf{x}1, \ldots, \mathbf{x}_n \in \mathbb{R}^p$ and edges $f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p\right)$, the min-cut problem consists in determining a $k$-class partition $\mathcal{S}_1 \cup \ldots \cup \mathcal{S}_k$ of ${1, \ldots, n}$ that minimizes the affinity across classes, that is, $$\left(\mathcal{S}_1, \ldots, \mathcal{S}_k\right) \in \underset{\mathcal{S}_1 \cup \ldots \cup \mathcal{S}_k={1, \ldots, n}}{\arg \min } \sum{a=1}^k \sum_{\substack{i \in \mathcal{S}_a \ j \notin \mathcal{S}_a}} \frac{f\left(\left|\mathbf{x}_i-\mathbf{x}_j\right|^2 / p\right)}{\left|\mathcal{S}_a\right|},$$
where the division by the cardinality $\left|\mathcal{S}_a\right|$ ensures that classes have approximately balanced weights (this is formally known as the ratio-cut adaptation of the original min-cut problem for which the denominator is simply 1). This optimization problem has been shown to be equivalent to finding the isometric matrix $\mathbf{S}=\left[\mathbf{s}1, \ldots, \mathbf{s}_k\right] \in \mathbb{R}^{n \times k}$ (i.e., $\mathbf{S}^{\top} \mathbf{S}=\mathbf{I}_k$ ) with columns defined as $\left[\mathbf{s}_a\right]_i=\delta{i \in \mathcal{S}_a} / \sqrt{\left|\mathcal{S}_a\right|}$, which minimizes
$$\operatorname{tr} \mathbf{S}^{\boldsymbol{T}}(\mathbf{D}-\mathbf{K}) \mathbf{S}$$
where $\mathbf{D}=\operatorname{diag}\left(\mathbf{K} \mathbf{1}_n\right)$. Solving this discrete problem is known to be NP-hard [Luxburg, 2007], but relaxing $\mathbf{S}$ to be merely an orthonormal matrix with no structure constraint gives the straightforward solution that $\mathbf{S} \in \mathbb{R}^{n \times k}$ is the collection of the $k$ eigenvectors associated with the smallest eigenvalues of $\mathbf{D}-\mathbf{K}$.

## 计算机代写|机器学习代写machine learning代考|Computationally Efficient Kernels

## 计算机代写|机器学习代写machine learning代考|Implications to Kernel Methods

$$\operatorname{tr} \mathbf{S}^T(\mathbf{D}-\mathbf{K}) \mathbf{S}$$

