Differentiable Memory
Introduction Before deep learning was a thing, if you had \(N\) vectors, and you wanted to match the closest one to some noisy version you would use kNN (\(k\) nearest neighbour). This implies comparing the input to all candidates, and usually weighting the final answer by the distance \(d_i\) to the top \(k\) matches. If \(k=N\) and the weighing has the from \(exp(-d / \rho)\), this becomes the Nadaraya–Watson kernel regressor, which turns out to have the same form as the venerable “attention” mechanism from deep learning.