GuideJan 12, 2024

What is the Nearest Neighbor Algorithm? Method & Examples

Nearest neighbor algorithm powers the foundation for vector search functionality, how does nearest neighbor enhance results and power generative AI?

Sign Up for Astra

Finding the nearest neighbor with the nearest neighbor algorithm

Forget about artificial intelligence (AI) and all that fancy math for a moment. Let’s talk about cheese. Specifically when you are creating a charcuterie board. If you’re not familiar, the United State’s version of a charcuterie board is (literally) a board of wood or stone with a spread of meats, cheeses, and other tasty bits. If you’re doing it right, each meat on the board has been meticulously paired with a specific cheese to complement flavor, texture, and appearance (we eat with our eyes, you know). Creating a board is as much an art as it is a culinary delight.

What makes one cheese better than another? What makes one board better than another? It turns out that all cheeses can be categorized by a few distinct characteristics. And one can use these characteristics to craft the perfect board. You could even follow a theme like “cheddar”, “goat’s milk”, or “high-contrast”.

If you’ve spent any time in machine learning (or AI) “categorized” probably tipped you off to where we are headed with the cheese board example. In the ML world, information about each cheese would be referred to as the data. The different characteristics of a cheese are known as features. To find good cheese pairings, you use a nearest neighbor algorithm to analyze the data and features.

What is the nearest neighbor algorithm?

Let’s say you wanted to build an AI application that takes the description of a board and finds complementing cheeses for it. A compliment would be a cheese that shares similar characteristics. Cheese that share similar characteristics will be our definition of the nearest neighbor. Cheddar cheeses share similar attributes like how “smelly” they are and their texture. Thus, they are neighbors.

On the road to using AI to find complementing cheese, we are going to need a large amount of data about cheese. So we will index cheese.com. This is considered factual information about cheese but also includes many opinionated discussions about cheese. All this data together will be a wealth of information for making decisions.

There will be no “data management” of the stored information. We won’t have Cheesemongers combing over new cheese data tagging each entry as a “good fit for these themes” or “complements these other cheeses”. That's the job of the nearest neighbor algorithm!

Howdy, neighbor! 🤠

No better way to do semantic similarity than with Astra DB.

How does the nearest neighbor algorithm work?

A cheese expert would consider all the characteristics together to classify a given cheese. A nearest neighbor algorithm does something similar but in a natural language processing (NLP) way. It doesn’t know what the word “smelly” means. Instead, it compares words (or phrases) from different cheeses against one another. Depending on how similar they are, a probability is returned. That similarity of a word or phrase is known as semantic similarity. That is a core attribute of all nearest neighbor algorithms.

The return from a nearest neighbor algorithm will never be a definitive “I am positive these cheeses are a perfect fit”. It will be a probability that the two cheeses are a good fit as a number with a bunch of decimals from zero (0) to one (1) (zero being don’t ever put these cheeses next to one another or a civil war will break out).

A nearest neighbor algorithm analyzes all the data on every request. Classification, categorization, and everything in between will happen at the time of search (ie: just-in-time results). The search needs to be able to handle an unknown amount of data and an unknown amount of users at any given second. That’s fancy talk for saying it needs to be really really fast. If you’ve ever attempted text comparisons on a large dataset you will know that it’s anything but performant. To overcome this, the text is converted to a collection of numbers called vectors. Computers are very good at comparing numbers ;).

A nearest neighbor algorithm plots all vectors in a multi-dimensional space and uses each of the points to find a neighboring point that is nearest. Different types of nearest neighbor algorithms consider a neighboring point differently (more on that later).

Continuing with our example application, we gathered a bunch of data about cheese as unstructured text (individual documents) in an S3 bucket. Next, each document needs to be converted to a numeric value.

The act of converting text to numerics is known as tokenization. Typically the numeric value is quite a few numbers, like 1562 for each text. We won’t get too deep into the process of vectorization here, but you can learn more in our “What are vector embeddings?” guide.

With the cheese data “vectorized” and stored in a vector database, now we can calculate complementing cheeses (aka nearest neighbors). First, we would take the description provided as input and generate its vectors just like the cheese data was. Those generated vectors will be the context for calculating where their nearest neighbors are.

Each vector in the provided description represents something about a cheese, so another cheese that has the same (or very close) vectors would be a complementing cheese. Say we provided the description “I want a board that includes brie and pepper jack” to the application. Disregard the stop words like “I”, “want”, “that”, etc. Those are discarded. Vectorize the words “board”, “brie”, and “pepper jack”. Anything in the database that has vectors similar to those words is probably a neighbor - complementing cheeses. The search would hopefully return suggestions like cheddar, feta, and maybe colby. It all depends on how cheese.com describes brie and pepper jack, as well as how others discuss the two cheeses.

With the basics of nearest neighbor down, let’s look at the different algorithm types and some common conundrums the calculation runs into.

Finding the nearest neighbor is the process of plotting all the vectors in all their dimensions and then comparing a context collection of vectors to them. Using a simple coordinate system you can mathematically measure how far one point is from another (known as their distance).

The typical American neighborhood is made up of connecting streets and cul-de-sacs. Along every street is a house with an address. When someone speaks of their “neighbors” they could reference the house next door or they could be talking about a house on the other side of the neighborhood. The context is the boundaries of the neighborhood. Neighborhoods come in all shapes and sizes so you need some reference or context when someone says “my neighbor”.

Depending on the precision your app needs when calculating the nearest neighbor, you choose the best fitting algorithm (aka how to establish boundaries).

K-nearest neighbors (KNN)

The goal of KNN is usually to classify some piece of data against a large set of labeled data. Labeled data means a decision about what each item in the data set is, has been previously made. In the above example, the data had no labels that is called unsupervised data. You didn’t know what each piece of text said or how it represented something about cheese. We could have gone through that data and added a label of what cheese was being talked about. Then it would be supervised data and a good candidate for KNN classification.

What if we had a picture of cheese and needed an AI application to figure out what family it was a part of? You could vectorize a whole bunch of cheese pictures with family labels on each, and use those to compare to your picture’s vectors.

The “K” in KNN is a representation of bounds, meaning how many pictures of cheese are you willing to consider? Once those pictures are found, what family has the majority in the group? That’s a simple way to classify something like text or a picture against supervised data.

The return from KNN is a prediction of how well the provided data fits the existing data label. There are usually two values returned, a percentage and a classifier. Our AI application would then need to decide if the percentage is strong enough to apply the given classification or if some other action needs to be taken, to continue.

Go Ahead and Throw All You Want at an Embedding

Astra DB handles nearest neighbor functionality with ease.

Approximate Nearest Neighbor (ANN)

The cheese board example above used ANN to find complementing cheeses to the given description of cheese. It’s one of the most common uses of nearest neighbor, mostly because it works well on non-labeled (unsupervised) data. Today’s AI applications try to use as much data as possible to make informed decisions, but it is such an investment of time and effort to label everything that it’s easier to adapt your algorithm(s).

The “approximate” in ANN should tip you off to the precision of the algorithm. The return will be approximately what data is closely related to the input, but hallucinations are real so be careful.

Fixed radius nearest neighbor

The “K” in K-nearest neighbors is a bound of how many points in space you are willing to consider before finding the majority. The focus is the number of points. Fixed radius is an extended approach to KNN where you are looking at the number of points, but it is limited to a certain distance. This comes with the probability that no points may be found. But if the application is willing to accept that, and can create a new classification, then it’s a quick way to limit the number of data points to consider. Limiting the number of points to consider is an easy way to speed up the overall calculation.

The radius is usually provided as a fixed value, along with the context value. The context value is a vector and the radius is a measure of distance from that vector.

Partitioning with k-dimensional tree(k-d tree)

When data is tokenized (converted to vectors) the number of dimensions is chosen. You choose the number of dimensions based on how accurate you need a search to be. But like all things, there is a trade-off. The more dimensions an embedding has the longer it’s going to take to compute a nearest neighbor. You have to find a balance between the two for your application.

When you have a lot of dimensions (possibly thousands or more), k-d tree comes in quite handy. After the vectors are plotted in a single space (before the comparison of nearest neighbor distance takes place) k-d tree splits the single space into a number of spaces (called partitions). The number of spaces is the “K” part of the name. How the spaces are sorted (so their shared context is not lost) is an implementation choice (median-finding sort).

Ideally, the number of leaves that make up each space in the tree is balanced. This makes for uniform, predictable search performance.

The power of using the k-d tree with nearest neighbor is that while traversing the tree when an initial neighbor is found the algorithm is given a sense of proximity. With that knowledge, it can choose to not search large portions of the tree because it knows those leaves are too far. That can really speed up a search.

If you haven’t heard, Datastax Astra DBCassandra is a very good database for storing and searching vectors. Combine serverless, Storage-Attached Indexing (SAI), and the speed of Astra’s network, and you have a vector search ready for production.

If you are looking for a vector database that can enable your generative AI applications with nearest neighbor functionality with ease, have a look at DataStax Vector Search, which uses the open source JVector project.

You can get started with DataStax Astra by registering here, and learn more about the non-vector capabilities of the NoSQL Astra platform here.

Astra DB: The Performant Vector Database

Making text comparisons on a large dataset performant since 2020

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.