Below are some notes I've taken on the k-NN algorithm and some variants. I first introduce the basic algorithm and how k might be chosen. I then discuss weighted variations of the algorithm (optimally weighted k-NN, bagged k-NN, kernel density estimators). Finally, I discuss the use of k-NN in metric learning.
k Nearest Neighbors
kNN Overview
k-nearest neighbors is an over-seventy-year-old supervised learning method. Given \(n\) samples \(\{X_i, Y_i\}_{i=1}^n\) from joint distribution \(X, Y\) where \(X\in \mathbb{R}^d\) are independent variables and \(Y\in \mathbb{R}\) is the dependent variable, k-NN regression predicts the dependent variable for a new input \(x\) as
$$\mathbb{E}[Y | x] \approx k\text{NN}_{\text{regr}}(x) = \frac{1}{k}\sum_{i : i\in S_k(x)} Y_i$$where \(S_k(x) = \text{argmin}_{S'\in \mathcal{P}([n]):\, |S'|=k} \sum_{i \in S'} d(x, X_i)\) is a collection of indices of the \(k\) elements of \(X_i\) closest to \(x\) according to distance metric \(d\). In classification, the general formula is a little different as votes need for each class are computed separately and the class with the highest number of votes is returned:
$$k\text{NN}_{\text{clf}}(x) = \text{argmax}_y \frac{1}{k}\sum_{i\in S_k(x)} \mathbb{1}_{Y_i=y}$$In practice, most literature reviewing k-NN focuses on the classification tasks so, unless it is noted, assume all results/proofs/etc discussed here are about classification performance (i.e. error rate).
Some notable properties of kNN include
- Slowness: kNN predictions require computing distances with respect to every observation in the training dataset, yielding a complexity of \(O(nd)\) per prediction.
- Convergence: Provided that \(k\to\infty\), \(n\to\infty\), and \(k/n\to 0\), kNN eventually converges to the best possible classifier.
- Sensitivity to feature scale: If features are scaled poorly, kNN risks making predictions based on irrelevant features. When the scale of an uninformative feature increases relative to informative features,
k NN approaches random guessing.
Intuitions for choosing k
At a high-level, the best value of k depends on the dataset distribution. For example Cover and Hart demonstrate, among other things, that k=1 yields strictly better peformance than any other value of k when in-class distances are greater than any between class distance. For example, if our background dataset has \(n_0\) observations with \(Y=0\) and we have distributions with this property, then k-NN is accurate for a new observation with \(Y=0\) as long as \(k < 2n_0\).
On the other hand, if \(X | Y=0\) and \(X | Y=1\) have identical distributions, then the best classifier will simply return the more prevalent value of \(Y\). This is most reliably achieved by the k=\(n\) classifier and least reliably achieved by the k=1 classifier. In practice, most distributions are somewhere between these two cases and the appropriate selection of k is dataset dependent.
A useful paper towards selection of k is Choice of neighbor order in nearest-neighbor classification. Here, the k value that minimizes average error rate over train and test samples (roughly r% and 1-r% of the original dataset) bootstrapped in a particular way (see the paper) is determined. Then this value is multiplied by \(r^{-4/(d+4)}\) to adjust for the size of the full training dataset. This is based on the observation that the difference between k-NN risk and the Bayes classifier risk is \(O(k^{-1}+(k/n)^{4/d})\).
Weighted k-NN and Bootstrapping
k-NN with Proximity Rank Based Weights
Samworth proposed replacing standard k-NN's average over neighbors with a weighted average based on proximity ranks. The asymptotically optimal weight for the ith nearest neighbor using a background dataset of size n and dimension \(d\) is given as
$$w_{ni} = \left\{ \begin{array}{ll} \frac{1}{k^*}\biggr[1+\frac{d}{2}-\frac{d}{2(k^*)^{2/d}}\{i^{1+2/d} - (i-1)^{1+2/d}\}\biggr] & 1\leq i\leq k^{*} \\ 0 & i > k^{*} \\ \end{array} \right.$$This leads to an asymptotic reduction of regret of at least 5% for \(d\leq 15\) and around 2% for \(d=50\), relative to the optimal unweighted k-NN analogue.
Samworth also points out that bagged k-NN (i.e. ensembling k-NN models based on resamples of the background dataset) becomes a form of weighted k-NN as the number of bagged models ges to infinity. In this case, the ith neighbor in the background dataset winds up getting a de-facto weight proportional to its probability of being a nearest neighbor during a given resample. While this is a suboptimal weighting strategy, the performance difference is minor and decreases as \(d\) increases. Even \(d>5\) is sufficient for the difference to be minor. Both methods empirically beat basic k-NN.
Distance Weighted k-NN
One might wonder if, instead of weighted neighbors based on their proximity rank, they could be weighted based on distance instead. This might be written as:
$$k\text{NN}(x) = \frac{\sum_{i\in S_k(x)} w(d(X_i, x)) Y_i}{\sum_{i\in S_k(x)} w(d(X_i, x))}$$where \(w: \mathbb{R}_{\geq 0}\to \mathbb{R}\) determines the weight associated with a particular distance. This has a functional form resembling the Nadaraya-Watson kernel regression and can be viewed as such with locally dependent truncations of the kernel. If k approaches the size of the background dataset, those truncations go away. Then, one loses the local properties of k-NN and gets kernel regression limited by the "bandwidth" of \(w\) which is not usually what a k-NN user wants.
There is some dissonance in taking the "kernel"/\(w\) which has a single scale / bandwidth, and applying an arbitrarily tight locally-dependent truncation to it. For this reason, one usually wants to incorporate both distance and local properties into \(w\). There are many ways of doing this. For example, when Dudani introduced distance-weighted k-NN, he proposed weights that linearly decrease from one to zero with distance over the course of the k neighbors (which means the weight of the kth neighbor is defined to be zero):
$$w(d) = \left\{ \begin{array}{ll} \frac{d_{\text{max}}-d}{d_{\text{max}}-d_1} & d_{\text{max}}\neq d_{\text{min}}\\ 1 & d_{\text{max}} = d_{\text{min}} \\ \end{array} \right.$$where \(d_{\text{max}} = \text{max}_{i\in S_k(x)}\, d(X_i , x)\) and \(d_{\text{min}} = \text{min}_{i\in S_k(x)}\, d(X_i , x)\), making the weighting function locally dependent. The weighting function can vary and considerably impact performance, as found in 1987 by Macleod and Titterington who propose another locally dependent linear method of assigning weights which performs better in some cases.
Alas, while this method may be somewhat intuitive, it is less desirable than one might think. In "A note on distance-weighted k-nearest neighbour rules" (1978), Bailey and Jain found that regular unweighted k-NN has better error than any distance-weighted k-NN when the background dataset gets arbitrarily large. Therefore, k-NN with optimal weighting based on proximity ranks must also beat it in the limit. Macleod and Titterington do argue that, practically speaking, datasets are finite so the method may still have some empirical use. However, computers and dataset sizes have grown considerably since 1987 so these empirical observations are less compelling now.
Neighborhood Component Analysis
Methodology
As discussed earlier, the speed at which k-NN converges to a good classifier depends on the independent variables in the dataset being scaled so distances between observations are meaningful. Developed in 2004 , Neighborhood Component Analysis (NCA) addresses this challenge by finding a linear transformation \(A\) of the independent features in a dataset or (equivalently) finding a adjusted distance metric for a k-NN classifier that yields good performance.
Typically minimization is done through gradient descent but k-NN performance changes discontinuously when an observation in the background dataset moves out of the k neighbors. Thus NCA maximizes the performance of stochastic 1-NN instead. Stochastic 1-NN takes a new observation \(x\) and gives it the class of a randomly selected observation \(X_j\) in the background dataset with probability \(p\):
$$ p_{A}(x, X_j; \{X_k\}_{k=1}^n) = \frac{w(d(Ax, A X_j))}{\sum_{k=1}^n w(d(Ax, A X_k))}$$where \(d\) is a distance metric, \(X_i, X_j\) are samples in the background dataset of size \(n\) and \(w:\, \mathbb{R}\to (0,1)\) determines how the probability of selection scales with distance. Using this, one can write the leave-one-out error for stochastic k-NN on the background dataset given \(A\) as the probability that \(i\) will be assigned the correct clas, averaged over all \(i\):
$$ LOO_D(A) = \sum_{i=1}^n \sum_{j\in [n]\backslash i:\, Y_i=Y_j} p(X_i, X_j; \{X_k\}_{k=1, k\neq i}^n). $$Now one can can simply take the derivative with respect to \(A\) and minimize. Note that this problem is not convex (i.e. multiple local minima exist) so minimization ought to be attempted from different starting positions to ensure good performance.
In the original paper, overfitting was not seen. KL-divergence was also considered as an alternative to leave-one-out error and yielded similar results. Furthermore, by making \(A\) a rectangular matrix instead of a square one, one can easily find lower dimensional representations of the dataset as well.
Adaptation to Kernel Regression
Understandably, NCA uses stochastic k-NN rather than any k-NN variant. This is necessary because any incorporation of discrete proximity ranks into the model would prevent its performance from being differentiable. However, proximity rank is also what makes k-NN equivalent to an variable kernel density estimator. Without this, its hard to equate stochastic --NN to k-NN at all since it is not sensitive to the data's topology. Actually, with respect to topology, it is more like a kernel regressor with a global bandwidth of scale implicitly learned by \(A\).
Since regression and classification are the same when \(Y\in \{0,1\}\), one can even compare the stochastic k-NN loss directly to kernel regression using an analogous weighting function:
$$ \begin{aligned} MSE_{\text{Stoch-kNN}}(A) &\propto \sum_{j\in[n]\backslash i}^n p_A(X_i, X_j; \{X_k\}_{k=1, k\neq i}^n)|Y_i - Y_j|^2 \\ MSE_{\text{Kernel}}(A) &= \sum_{i=1}^n \left |Y_i - \sum_{j\in[n]\backslash i} p_A(X_i, X_j; \{X_k\}_{k=1, k\neq i}^n)Y_j \right |^2 \end{aligned} $$While similar in some ways, the losses are materially different. Still, it is clear that both stochastic k-NN and kernel regression are topology-oblivious estimators with differentiable loss functions and comparable weighting methodologies. One may therefore wonder if an NCA style technique can be applied to kernel regression. Weinberger and Tesauro did just this in 2007 with their paper Metric Learning for Kernel Regression (MLKR), explicitly describing the paper as an adaptation of NCA for kernel regression. The methodology is nearly identical to NCA methodology with a kernel regresssion loss function swapped in with one distinction: to make the optimization efficient, only probabilities for the top 1000 nearest neighbors of an observation were computed, with those neighbors periodically updated every fifteen gradient steps.
Large Margin Nearest Neighbor Classification (LMNN)
NCA can be slow due to the need to recompute \(n\) probabilities every time \(A\) is updated. In 2009, Weinberger and Saul proposed another approach for finding a linear transformation of a dataset that improves k-NN performance in their paper Distance Metric Learning for Large Margin Nearest Neighbor Classification.
Conceptually, this approach notices that, for accuracy on a specific point, only a small set of the \(n\) observations in the background dataset need to be considered. Thus, it assigns each observation \(i\) a set \(\text{NS}_i\) of k target neighbors — observations of the same class as \(i\) that ought to be close to it. As a result, this method is sensitive to topology. Next, the set \(\text{NS}_i\) also defines the criteria for imposters: observations of the wrong class that invade a perimeter defined by \(\text{NS}_i\). More precisely, an imposter \(l\) for observation \(i\) satisfies
$$ d(A X_i, A X_l) \leq d(A X_i, A X_j) + 1$$for some \(j\in \text{NS}_i\). Note the presence of the \(+1\) on the right hand side of the expression, to ensure that imposters are separated with a large margin. The loss function to be minimized is then defined by two terms:
$$ \begin{aligned} \epsilon_{\text{pull}} &= \sum_{i=1}^n \sum_{j\in \text{NS}_i} d(A X_i, A X_j) \\ \epsilon_{\text{push}} &= \sum_{i=1}^n \sum_{j\in \text{NS}_i} \sum_{l=1}^n 1_{Y_i \neq Y_l}[1+d(A X_i, A X_j) - d(A X_i, A X_l)]_{+} \\ \epsilon_{\text{loss}} &= (1-u)\epsilon_{\text{pull}} + u\epsilon_{\text{push}} \end{aligned} $$where \(u\) is a user-selected weight parameter (usually \(u=0.5\)) and \([z]_+=\max(z,0)\) denotes the hinge loss. Note that when the imposter inequality is satisfied, \(1+d(A X_i, A X_j) - d(A X_i, A X_l)\) is negative and gets zeroed out. A benefit of this loss function is that it can be formulated as an instance of semidefinite programming (a convex optimization method with better properties than naive gradient descent) if re-framed in a certain way.
One limitation of LMNN is that there is not a principled way of ensuring the original target neighbors are a good choice. They have to be selected before a better feature representation is learned. Thus, one natural extension of of LMNN is multi-pass variant which learns \(A\), uses \(A\) to pick new sets of target neighbors, and then re-learns \(A\) and so on. This is comparable to the approach for training in Metric Learning for Kernel Regression.
Fast Neighborhood Component Analysis
Fast neighborhood component analysis was developed in 2012 by Yang, Wang and Zuoto as another way of accelerating NCA (info on FNCA freely available in Jan Hanousek's bachelor thesis). Like LMNN, it also defines the set \(\text{NS}_i\). This corresponds to the \(k\) nearest neighbors of \(i\) in the untransformed dataset that share its class. However, instead of using this set to implicitly describe a set of imposters, another set \(\text{ND}_i\) is defined, corresponding to the \(k\) nearest neighbors in that untransformed dataset that do not share a class. \(A\) is then optimized to maximize the leave-one-out performance of stochastic nearest neighbors where \(j\) has a nonzero probability of selection for observation \(i\) only if \(j\in \text{ND}_i\) or \(j\in \text{NS}_i\).
Assuming the original feature space is good enough, this variant of stochastic nearest neighbors should perform well since \(\text{ND}_i\) and \(\text{NS}_i\) should still capture the relevant neighbors even if \(A\) transforms the features. Furthermore, its performance might be better because there is no chance of selecting extremely distant observations. It is also faster as training scales with the size of the \(\text{ND}_i\) and \(\text{NS}_i\) sets instead of the full dataset. Unfortunately, because these sets are small, overfitting is now possible and a regularization term is needed.