What is the K-Nearest Neighbors (KNN) algorithm?
The K-Nearest Neighbors is a vector search algorithm that guarantees finding some number of K vectors that are nearest (via some distance measure) to a search vector.
The K-Nearest Neighbors algorithm, or KNN, is a straightforward, powerful supervised learning method used extensively in machine learning and data science. It is versatile, handling both classification and regression tasks, and is known for its ease of implementation and effectiveness in various real-world applications.
KNN identifies the K-nearest neighbors to a given data point using a distance metric, making predictions based on similar data points in the dataset. It is also useful in benchmarking Approximate Nearest Neighbors (ANN) search indexes. ANN searches are the most commonly used vector search technique as they are (many) orders of magnitude faster than brute-force KNN searches against high-dimensional data–but this speed comes at the cost of retrieval accuracy. KNN can be used to measure this loss of accuracy, allowing ANN parameters to be adjusted to strike an appropriate balance of accuracy and speed.
This guide will delve into the fundamentals of the KNN algorithm, its workings, and its applications, grounded in the context of vector search.
The KNN algorithm operates on the principle of similarity or “nearness,” predicting the label or value of a new data point by considering the labels or values of its K-nearest (the value of K is simply an integer) neighbors in the training dataset.
Consider the following diagram:
In the solid blue oval are kinds of fruit (nouns), and in the dashed pink oval are kinds of actions (verbs). So if we had a new word in position 1 in the solid blue oval, we would probably categorize it as a “fruit.” If we had to guess what the word in position 2 in the dashed pink oval was, humans might come up with a word like “jog” – but the KNN algorithm would pick “run” as the word because “jog” is not in its vocabulary, and point 2 is nearest to “run.”
In addition to categorizing and predicting new words, KNN can also perform numeric regression. For instance, if actions in the dashed pink oval have associated speeds, the algorithm can predict the speed for a new action at position 2 by averaging the speeds of its nearest neighbors.
This is an example of KNN: based on the relative position of a new value to existing values, we can determine the category a new value belongs to (as in the word 1 example), predict a value for a new word (as in the word 2 example), and perform numeric regression by combining the values of the nearest neighbors.
KNN is widely used within machine learning but it is also used as a tool for optimizing ANN searches.
As a vector search algorithm, KNN has many of the same applications as ANN search, but KNN can provide a guarantee of “closest matches” (at the expense of speed). Because it is easy to implement, KNN can be used as a “proof of concept” to build a more complex application such as neural networks, decision trees, and support vector machines.
Most ANN algorithms have tunable parameters that can optimize the algorithm. For example, within the Hierarchical Navigable Small Worlds (HNSW) algorithm there are parameters to manage the number of layers, the density of each layer, and the number of connections between and within layers.
These parameters influence the speed and space requirements of the ANN index, but generally, the faster and smaller these indexes are, the more likely they are to miss the nearest neighbors. This is measured with something called “recall,” where 100% recall means all of the nearest neighbors are found. KNN always has a 100% recall!
To measure ANN recall performance, choose the same definition of how to measure “nearest” with a set of test data and learn how many of the true nearest neighbors (via KNN) have been missed by the ANN algorithm.
We do not generally care too much about KNN's slower performance in this context—we are treating it as a baseline reference. Further, we measure the K-nearest neighbors on any dataset once so we can adjust ANN parameters and quickly re-run the ANN evaluations.
Break down the KNN algorithm's workflow into the following steps:
To find “nearest neighbor,” we need to have some way of defining “nearest”; for this we use a distance metric. Because we need to compare distances between a given point and all points in the dataset, KNN typically employs fast, low-complexity distance measures. Here are some commonly used measures:
The computational effort for Minkowski distance lies between that of Manhattan distance (easiest) and Euclidean distance (more effort), and the effort for Cosine similarity is generally greater than that for Euclidean distance.
In the following formulas, we will use the following notation:
Formula Notations | |
---|---|
D | Distance between two points |
pi | The value of the i-th dimension of some point p |
s | The source point for the distance calculation |
x | One of the training points for the distance calculation |
d | The number of dimensions in the space |
Euclidean distance, often considered the most intuitive metric, calculates the straight-line distance between two data points in a multi-dimensional space. Its formula, rooted in the Pythagorean theorem, is straightforward:
Manhattan distance, also known as taxicab distance, measures the distance between two points by summing the absolute differences of their coordinates along each dimension. In two dimensions, visualize it as navigating a grid-like city block - you can only move horizontally or vertically.
Manhattan distances become particularly useful in sparsely-populated high-dimensional spaces as described by Aggarwal et al. (“On the Surprising Behavior of Distance Metrics in High Dimensional Spaces,” 2001), or if features have significantly different scales where a squaring of differences would tend to amplify the distances caused by larger numbers (imagine a feature with values between 0 and 10, compared to one that has values between 0 and 1).
The formula for Manhattan distance is as follows:
Minkowski distance serves as a generalization of both Euclidean and Manhattan distances. It introduces a parameter p, a value between 1 and 2 that dictates the nature of the distance calculation. The formula for Minkowski distance is given by:
When p is set to 1, the Minkowski distance is equivalent to the Manhattan distance, and when p is set to 2, it is equivalent to the Euclidean distance.
Cosine similarity measures the cosine of the angle between two vectors in a multi-dimensional space, indicating how similar the vectors are in terms of their direction. Unlike other distance metrics, cosine similarity focuses on the orientation rather than the magnitude, making it particularly useful in text analysis and high-dimensional data. The formula for cosine similarity is:
If the space is “normalized” (all vectors have a unit length of 1), only the numerator (dot product) needs to be computed. For a deeper explanation of cosine similarity, here’s an overview of cosine similarity or some other real-world use cases of cosine similarity, along with a deeper dive into the math.
Determining how many neighbors to find the top K value is application-dependent. If using KNN to validate ANN performance, the K value is generally the same as the ANN search K value, which is itself application-dependent!
In general, a small K value can lead to results that are overly sensitive to noise in the data, while a large K value may fail to capture the most relevant neighbors.
Several techniques can help determine the optimal K value for vector search applications, including:
Employing these techniques can identify the top K value that optimally balances search accuracy and computational efficiency, thereby improving the vector search's performance and robustness.
The KNN algorithm, like any other tool, comes with its own set of advantages and drawbacks:
Limitations
At first glance, it would seem that knowing the “exact” neighbors is preferable to only knowing the “approximate” neighbors. But for how long do you want to wait to find these? Most vector search applications have a user waiting (patiently) on the other end for the result – and in RAG-type applications, these results must first be sent to a large language model (LLM) before being sent to a user.
Imagine a vector database with 1 million documents, each with 768 dimensions in a normalized vector space. We would use cosine similarity to compare the distance, and really we can take the dot-product shortcut because the data is normalized.
To conduct a KNN search of one vector via “brute force,” we would need to compute 768 x 106 multiplications plus 767 x 106 additions (for the distance measurement), plus another 106 comparison operations, for a total of 1.536 x 109 (billion) operations. Though this could certainly be done in a highly parallel fashion, that is a lot of calculations! Beygetzimer et al. demonstrated a cover tree index structure (“Cover trees for nearest neighbor”, 2006) that shows speedups of up to several orders of magnitude, though these are not widely used.
As an alternative approach, consider a widely-used ANN algorithm such as HNSW which typically needs to search O(log N) data points for high recall rates; In this case that will be about 20 (of the 1 million) points we need to compare to. This math works out to (768 + 767) x 20 + 20 = 30,720 operations: a factor of 50 million fewer comparisons!
ANN can often be tuned to achieve high 90s percent recall without dramatically increasing computational complexity. Indexing techniques such as jVector can further improve latency and space requirements of ANN indexes.
While some vector search applications may “need” a guarantee of 100% recall accuracy, in practice the vector embedding algorithms and LLM training limitations end up being the constraining factor in application accuracy and general performance.
This guide explores the K-Nearest Neighbors (KNN) algorithm, highlighting its simplicity and versatility in handling classification and regression tasks. KNN is often a good starting point for building a machine learning application, but for most generative AI vector search applications it is often too slow, owing to the intensity of the computations. KNN is useful in evaluating and optimizing ANN search parameters, facilitating a testable assertion of ANN recall performance.
One example of such evaluation is found in an open-source tool called Neighborhood Watch. It has been built by DataStax engineering teams to create ground truth datasets for high-dimension embeddings vectors that are more representative of what people are actually using today. One aspect of the tool is the generation of KNN results using a brute-force approach, which leverages GPU acceleration to compute these results efficiently. The KNN results can then be tested against various ANN configurations, enabling the tuning and optimization of recall and precision metrics for specific use cases. The blog post Vector Search for Production: A GPU-Powered KNN Ground Truth Dataset Generator gives a more detailed overview of this.Such a real-world analysis of ANN recall performance using a KNN ground truth baseline has led DataStax to introduce a source_model parameter on vector indexes: testing found that recall and latencies are improved by adjusting some ANN parameters that vary depending on the embedding model.
The K-Nearest Neighbors (KNN) algorithm is a simple, powerful approach to vector search and machine learning tasks. It's easy to implement, and its versatility makes it valuable for proof-of-concept applications and benchmarking Approximate Nearest Neighbors (ANN) search performance.
KNN's shortfall is its computational intensity. It doesn't scale well for large datasets and real-time applications. While KNN guarantees 100% recall accuracy, modern vector search solutions often favor ANN algorithms for their superior speed and efficiency.
When properly tuned, the Approximate Nearest Neighbor algorithm achieves high recall rates with significantly reduced computational overhead. For most generative AI and vector search applications, ANN algorithms strike an optimal balance between accuracy and performance, making them the preferred choice in production environments.
KNN remains essential for evaluating and optimizing ANN search parameters to fine-tune vector search implementations for specific use cases.
The K-Nearest Neighbors is a vector search algorithm that guarantees finding some number of K vectors that are nearest (via some distance measure) to a search vector.
KNN is the simple machine learning algorithm most commonly used in classification systems. This method relates to how a neighbor's information is sorted. KNN identifies new data points using similarity measurements of previous stored points.
KNN is best suited for initial proof of concept applications, to measure the accuracy of other vector search algorithms (such as ANN), and for applications where a 100% guarantee of nearest neighbors is required.
In machine learning applications, KNN has the advantage of not requiring training. Instead, time is spent on feature selection and optimizing the value of K. The introduction of new data does not require retraining.
Some vector databases provide a KNN algorithm, but generally KNN is used only to evaluate and tune ANN vector search index parameters.
While the KNN algorithm can be used with large datasets, it is computationally much more expensive than alternative algorithms. It scales on the order of data points, as opposed to many ANN search algorithms, which scale on the log of the number of data points.