It’s time to return to our photo album. In the previous installment, we used Rust to build a simple command-line tool, scan our picture library, and calculate a 64-bit perceptual hash of every image. We left with an open question: How can we use those hashes to find similar photos in our album?
One intuitive solution would be to put the images' hashes into a data structure which keeps data sorted and easy to search; a B-tree, for instance. Given a previously unseen hash, the data structure would efficiently tell us which hashes are the closest to the one given in the query.
Unfortunately, treating perceptual hashes like numbers won’t get us far.
Similarity and distance in the space of hashes have little to
do with the concept of difference we know from arithmetics.
Two similar hashes might be very far from one another when interpreted as
unidimensional numbers.
Consider 0x8001
.
It’s only one bit away from both 0x8003
and 0x0001
, but it’s much greater
than the latter number.
We have to try something else.
Let’s return to what similarity means in terms of perceptual hashes. The more two images look alike, the fewer positions there are on which the binary representations of their hashes differ. What we’re describing here is the Hamming distance, and we can implement it as follows.
Note that our distance
function is a valid metric.
This means that for arbitrary hashes a
, b
and c
the function satisfies the
following conditions:
-
distance(a, b) >= 0
, -
distance(a, b) == 0
if and only ifa == b
, -
distance(a, b) == distance(b, a)
, and -
distance(a, b) <= distance(a, c) + distance(c, b)
.
The last property is particularly interesting. It’s called the triangle inequality, and we will return to it shortly. For now, let’s talk about a particular kind of tree.
BK-tree is a data structure described 45 years ago by Burkhard and Keller. Trees we typically work with are optimized to efficiently look up a given key. A BK-tree on the other hand can tell us what keys are most similar to the given one without a costly exhaustive search.
Our BK-tree is either empty, or it is a node.
A node contains a hash and non-empty child nodes — its subtrees. There is an important constraint: all hashes in a subtree have exactly the same distance to the hash in the parent node. We’ll illustrate it with an example shortly.
The distance between two different hashes must be between 1 and 64.
Therefore, each node can have up to 64 child nodes; one per possible
distance value.
We model it as a dictionary BTreeMap<Distance, Node>
.
All hashes in a child node at a key k must have distance k to the hash in
the parent.
Let’s look at an example.
Below, we see a node storing hash 0x2af7
.
It has two child nodes.
All hashes in the sub_one
node have a distance of exactly 1 from 0x2af7
.
Analogously, all hashes in the sub_six
node have a distance of exactly 6 from
0x2af7
.
A node storing a single hash can be built using the following function.
Let’s consider adding elements to our tree. The inserting function must differentiate between empty and non-empty trees.
In case of an Empty
, we replace it with a tree with a single hash.
Otherwise, we call the following insert
function on the nested node.
If the new
hash is equal to the value stored in self
, we’re done inserting.
Otherwise, we calculate new_dist
as the distance between the hashes.
Following the constraint guiding the structure of the tree, we insert new
into
the child node stored under the new_dist
key.
If the child node is absent, we create one using single
.
Now, it’s time to query our tree.
We want to define a find
function with two arguments:
- the
needle
hash to look up in the tree, and - our distance tolerance.
The function should return all hashes in the tree whose distance to the sought hash is within the given tolerance.
In case of an Empty
, there is nothing we can return.
Otherwise we call find
on the nested node.
For now, let’s make it visit all the children and exhaustively search the tree.
Finally, to create an empty BKTree
we simply have to return Empty
.
We can now check whether our tree passes a small test case.
Running it with cargo test
shows no errors.
So far, so good, but our find
function is still searching through the entire
tree.
Is there a way of pruning some subtrees and visiting fewer nodes?
Let’s return to the triangle inequality.
The distance
function is a valid metric, so for arbitrary hashes a
, b
, and
c
, the inequality is always true.
In the Node::find
function we’re attempting to find
needle
.
Let’s consider one of the subtrees in children
.
The subtree at key dist
has all of its hashes dist
-distant from self.hash
.
We can apply the triangle inequality to every hash sub_hash
stored in the
subtree children[dist]
:
The distance between needle
and self.hash
is needle_dist
.
The distance between every sub_hash
and self.hash
is dist
.
We can simplify:
We now have a lower bound for distances between needle
and all hashes in the
subtree.
Given our tolerance
, we can only find some similar hashes in the
children[dist]
subtree if:
By putting it together we end up with a following condition for visit the subtree.
By adding it as an if
statement in the for
loop in
find
we can prune some nodes during our search.
We can do better though.
Let’s apply the triangle inequality again, this time to a different edge.
We plug our variables.
Just like above, given that distance(needle, sub_hash)
must be lesser or equal
to tolerance
, we obtain:
We end up with two inequalities helping us to prune subtrees during our search:
We can interpret it as a condition that dist
must belong to the
interval:
Since we represent all the values above using unsigned integers,
subtraction poses a risk of an underflow.
To prevent this we add dist
and needle_dist
to both sides
of the first and the second inequality respectively.
The result is:
Let’s add them as a condition to the find
function.
The resulting code looks as follows.
That’s it.
The condition we added in the for
loop allows us to prune a lot of nodes
without ever visiting them.
The lower the value of our tolerance the more nodes we will ignore.
As we showed above, they cannot contain hashes within our tolerance.
Our tree implementation is ready. All we need now is a bit of extra code to handle I/O interactions. See the source code in its entirety on GitHub.
I used the BK-tree to scan my photo album for potential duplicates. Some results were exactly what I hoped for. In the GIF above you see a comparison of two images mentioned towards the end of the previous blog post. The similarity is obvious and pHash captures it; the Hamming distance between their hashes is 10.
But there were some results which I found quite surprising. Below you see two photos I took two years apart. Strikingly, the Hamming distance between their hashes is also equal to 10. It’s fascinating that the pHash algorithm was able to preserve the similarity of their composition.
Let’s see how many more surprising similarities I’ll be able to find in my archive in the future. Luckily, even if my album continues to grow and gets scattered across several backup disks, Rust, pHash, and BK-trees will allow me to quickly find what I’m looking for.