Sorting by feature vector distance
Sometimes, describing the real world requires us to use a lot of numbers measuring a variety of features. We call a set of these numerical qualities a feature vector. For example, let’s consider the specs of a variety of apples. Suppose we’ve rated the apples for a variety of flavor characteristics from 0-100:
Since we have 5 different numerical features, we can describe the apple characteristics in 5-dimensional vector space.
Suppose, given an apple from this list with 5-dimensional flavor characteristics, we want to find the top 3 other apples in our system that are most similar to it in flavor. In other words, we want to measure the “flavor distance” between apples, and find the 3 apples with the shortest distance to our original apple.
Fortunately, the simplest way to calculate distance in any number of dimensions is one we all learned in middle school: a variation on the familiar Pythagorean Theorem known as Euclidean distance, which simply extends that theorem from 2 to an arbitrary number of dimensions. As in the Pythagorean theorem, we take the difference between points in each dimension, square the differences, then add up the squares. Finally, we take the square root of that sum. So to find the distance between vectors q and p in n dimensions:
If we’re modeling these apples in Ruby and ActiveRecord, we might have a model that looks like:
Now that we have a basic approach for finding the distance between 2 flavor vectors, what if we want to dramatically increase the number of apples in our database? While finding the distance between a single apple and every other apple runs in O(n) time, comparing each apple to every other apple runs in O(n2) time, which becomes prohibitive rather quickly. In addition, a naive approach using this function would be extra slow in Rails:
This is incredibly slow, because not only are the comparisons
To speed this up, we can calculate and store these comparisons in a separate table, so we can query based on distance and avoid the overhead of instantiating that many AR objects. However, even with this approach we’ll need to take the hit of recalculating the distances every time a change is made to the flavor vectors of one of the apples. Fortunately, Postgres gives us tools to avoid ActiveRecord altogether.
Postgres 9 introduces the
column type, which can be used to store our feature vectors. The Rails Postgres
adapter allows us to use this column type in our AR model.
We’ll need to maintain this flavor_vector column whenever a field changes, or else we can just delete the individual columns for the flavor attributes altogether.
Taking this a step further, Postgres’
cube_distance, which calculates the distance between two
vectors in a query. This allows us to sort by distance and return only the
top 3 distances! These distances can be cached at the database level, making
subsequent queries even faster.
This allows us to rewrite our
top_apple_similarities function as follows:
We’re still in
cube_distance function as a column allows us to inspect the computed result on
the returned model.
By leveraging the database to do the hard work, we can achieve significant gains in performance even without dramatically changing the algorithm.
Main image from Flickr user originalbliss.