Diff Algorithms
However, heuristics like this trade diff minimality for performance, this is not always desirable. Sometimes, a minimal diff is exactly what's required. disables these heuristics to always find a minimal diff irrespective of the costs. We established before that a diff algorithm finds one of many possible solutions. Given such a solution we can discover more solutions by it locally and then selecting the best solution according to some metric. This is exactly how Michael Haggerty's indentation heuristic works for text. For any given diff, we can often slide the edits up or down in a way that doesn't change the meaning of a diff. For example, has the same meaning as We call edits that can be slid up or down sliders . The question is, how do we select the best slide? Michael collected human ratings for different sliders of the same diff and used them to develop a heuristic to match these ratings: diff-slider-tools . However, this heuristic only works for text and is tuned towards code instead of prose. I decided to make it optional. It can be enabled with the option. The representation used during the execution of the diff algorithm has a surprising impact on the algorithm performance and result readability. This is not at all obvious, and so it took me a while to figure out that the best approach is akin to a side-by-side view of a diff: You use two slices to represent the left side and the right side respectively: in the left side slice represents a deletion and on the right side an insertion. is a matching element. This representation has four big advantages: It can be preallocated, the order in which edits are discovered doesn't matter, it's easy to mutate during post-processing, and it's easy to generate other representations from it. Diff algorithms are relatively complicated by themselves, but they pale in comparison to what's necessary to provide a high-quality diff library. This article tries to explain what went into my new diff library, but there's still more that I haven't implemented yet. Here is one real-world example of why worst-case scenarios are important: Imagine you're breaking an existing feature in a way that triggers a worst-case scenario in a test. If the test is running for a very long time or runs out of memory, you're going to have to debug two problems instead of one. ↩︎ ↩︎ See benchmark_comparison.txt for the source of these ratings. ↩︎ ↩︎ ↩︎ ↩︎ No support for arbitrary sequences ↩︎ ↩︎ ↩︎ ↩︎ No support for unified diffs ↩︎ ↩︎ The diffmatchpatch API is very hard to use ↩︎ No support for structured results ↩︎ Quadratic memory use; for my test cases, this resulted in >30 GB of memory used. ↩︎ The mb0 API is from before generics and is a bit cumbersome to use ↩︎ udiff has a very low threshold for when it starts to stop searching for an optimal solution. This improves the speed, but it also results in relatively large diffs. ↩︎ ↩︎ There's no single patience diff heuristic, instead there are different implementations with different performance characteristics. ↩︎ Stolen from https://blog.jcoglan.com/2017/03/22/myers-diff-in-linear-space-theory/ ↩︎ I can recommend https://blog.robertelder.org/diff-algorithm/ and this 5 part series https://blog.jcoglan.com/2017/02/12/the-myers-diff-algorithm-part-1/ ↩︎ Myers, E.W. An O(ND) difference algorithm and its variations. Algorithmica 1, 251-266 (1986). https://doi.org/10.1007/BF01840446 ↩︎