How Collection Diffing works in Swift

How Collection Diffing works in Swift

Ordered Collection Diffing is a feature added in Swift 5.1 that allows you to calculate and apply the difference between two collections. Using diffing libraries is common in iOS for a few reasons, the most popular one being to handle the addition and removal of elements in UITableViews. With the addition of this feature, developers can now diff Collections without having to bother with external libraries.

Have you wondered what efficiently diffing something looks like? In this article, we'll see and analyze the diffing APIs as well as the Myers's Diffing Algorithm used by the Swift Standard Library -- the same used in git.

Collection Diffing APIs

The diffing APIs are available as an extension for any BidirectionalCollection, with an abstracted helper type for ones whose's elements are Equatable:

extension BidirectionalCollection {
  public func difference<C: BidirectionalCollection>(
    from other: C,
    by areEquivalent: (C.Element, Element) -> Bool
  ) -> CollectionDifference<Element>
  where C.Element == Self.Element {
    return ... // Article spoilers removed!
  }
}
extension BidirectionalCollection where Element: Equatable {
  public func difference<C: BidirectionalCollection>(
    from other: C
  ) -> CollectionDifference<Element> where C.Element == Self.Element {
    return difference(from: other, by: ==)
  }
}

The return type for diffing operations is an CollectionDifference object -- basically an array of "moves" necessary to turn the first collection into the second one. The "moves" are represented as an enum of additions and removals, and the CollectionDifference object holds all of them.

public struct CollectionDifference<ChangeElement> {
  /// A single change to a collection.
  @frozen
  public enum Change {
    case insert(offset: Int, element: ChangeElement, associatedWith: Int?)
    case remove(offset: Int, element: ChangeElement, associatedWith: Int?)
  }

  public let insertions: [Change]
  public let removals: [Change]
}

CollectionDifference is a Collection itself, which means that it can be iterated. An important aspect of this is that the iteration will happen in a specific order: First, all the removals from highest to lowest offset, followed by insertions from lowest to highest. This is both for performance reasons (removing elements from the end of an array is faster from the beginning, and adding elements to the front when there are fewer elements is faster than when there's a lot), and for visual reasons as we'll see when the underlying algorithm is inspected.

We can see this behavior with a real example. Let's calculate the difference between these two strings:

let a = "ABCABBA"
let b = "CBABAC"

There are tons of ways of transforming a into b, but our interest is doing so in the least amount of steps. When calling difference(_:) and iterating the result, we'll see one of these solutions:

let diff = b.difference(from: a)
diff.forEach { print($0) }
// remove(offset: 5, element: "B", associatedWith: nil)
// remove(offset: 1, element: "B", associatedWith: nil)
// remove(offset: 0, element: "A", associatedWith: nil)
// insert(offset: 1, element: "B", associatedWith: nil)
// insert(offset: 5, element: "C", associatedWith: nil)

(We'll leave the explanation of the algorithm itself for the last section because it's a bit complicated.)

From here, you can use the resulting CollectionDifference object as you see fit, like animating the addition and removal of rows from an UITableView. The neat thing is that Swift provides an applying(_:) method that allows you to directly apply a CollectionDifference object into a Collection:

let newB = a.applying(diff)
//CBABAC

This particular example looks useless, so here's something where applying diffs is useful: The same three-way merge process that git uses as a strategy to merge commits!

// Split the contents of the sources into lines
let baseLines = base.components(separatedBy: "\n")
let theirLines = theirs.components(separatedBy: "\n")
let myLines = mine.components(separatedBy: "\n")

// Create a difference from base to theirs
let diff = theirLines.difference(from:baseLines)

// Apply it to mine, if possible
guard let patchedLines = myLines.applying(diff) else {
    print("Merge conflict applying patch, manual merge required")
    return
}

// Reassemble the result
let patched = patchedLines.joined(separator: "\n")
print(patched)

How applying() works

The applying(_:) method works by creating a new, empty Collection instance that gets filled as the changes are iterated. Because the changes are iterated in the specific high removals -> low removals -> low insertions -> high insertions, the algorithm is able to build the final result through ranges instead of individual elements. For example, if you have a 10 character string and the first step is to remove the 5th character, we know right off the bat that the final result will contain the range 5...9 in that same order, so "removing the 5th character" is equivalent to adding everything that comes after it. The same logic is applied until all the changes are processed. Because of the interaction with ranges, applying(_:) is only available to RangeReplaceableCollections.

It is possible for this process to fail, for example, when the desired element to be removed doesn't exist, which is possible in cases like the git example due to conflicting information in the data. In these cases, applying(_:) returns nil.

The actual method is a bit long and hard to read, so I'll leave the link to it instead of showing it here.

The complex part: How difference() works

Now that we've seen how the APIs work, we're ready to take a look at what Swift does to generate the final CollectionDifference object.

As stated before, there are many ways to transform a Collection into another one. For example, to transform a: ABCABBA to b: CBABAC, we could simply remove every single element from a and append the ones from b. Easy peasy!

However, at a best case of a.count + b.count "moves", that would be very inefficient. One way to find the best way possible to do this change is to transform this into a graph problem. Imagine that we have a matrix where the columns are characters from a and rows are characters from b:

-  A  B  C  A  B  B  A

-

C

B

A

B

A

C

In this matrix, the top-left position 0,0 represents the original ABCABBA string, while the bottom-right position represents the final CBABAC string. Moving right in this matrix represents removing an element from the original string, while moving down represents adding an element from the new one. Each of these movements counts as a step, but if the elements from both strings at a specific position match (like the position 1,3 - two Cs), you can diagonally move down and to the right for "free", as no changes are necessary for that position.

Given this description, we can transform this problem into another one: How can we reach the bottom right of the matrix in the least amount of steps (a.k.a maximizing the number of diagonal moves)?

If you ever studied graphs you might remember that finding the shortest path between two nodes in an unweighted graph is a classical computer science problem! This problem is solved by the popular graph search algorithm Breadth-First Search -- because the graph is unweighted, taking turns on checking the neighbors of each node is guaranteed to discover the shortest path between a source node to a target one. Applying the algorithm in the previous example will show us a path that transforms our string in 5 moves (remember that diagonal moves do not count as moves):

-  A  B  C  A  B  B  A

- 0  1  2

C          2  3    
         
B                3
     
A                4

B                   4

A                      4

C                      5

The implementation behind the true diffing algorithms is very similar to this, but with a few (unfortunately, relatively complicated) important twists. Although we're looking for the shortest edit path between two Collections, we're not exactly looking for any short path. Especially for visual applications like git, the overall order of how deletions and insertions are present is important. When you're looking at the merge diff in git, you would expect the diff to be somewhat "intelligent" in terms of knowing what was really added and deleted by your changes instead of randomly saying that things were "added" and "deleted" because it fulfills the shortest path condition.

Good:   class Foo {                  Bad:    class Foo {
          init(bar: String) {                  init(bar: String) {
            self.bar = bar                       self.bar = bar
          }                                +   }
      +                                    +
      +   func inspect() {                 +   func inspect() {
      +     print(bar)                     +     print(bar)
      +   }                                    }
        }                                    }

Diffing algorithms carefully choose the paths that Breadth-First Search is going to traverse in order to provide a meaningful diff.

Swift's Diffing Algorithm: Myers's Diffing Algorithm

Swift uses Myers's Diffing Algorithm (paper here) as part of the diffing APIs. Given two collections, the result is the Shortest Edit Script that transforms A into B. Much like the abstracted CollectionDifference type, the Shortest Edit Script is a list of additions and removals. In order to provide a good diff, Myers's prefers deleting over inserting, and it differs from Breadth-First Search by prioritizing what seems to be "good" paths instead of blindly switching nodes like the original algorithm.

This is done by adding two additional components to the matrix: How deep we are into the graph (number of moves, in the X-axis), and the ratio of additions versus deletions, which is a value k that increases by one every time we remove a value (moving right), and decreases by one every time we add a value (moving down), in the Y-axis. This is better seen if we shift a graph by 45 degrees (note that this graph doesn't contain the diagonal connections that make the paths shorter):

|     0     1     2     3     4     5
    +--------------------------------------
 5  |                                   5,0
    |                                 /
 4  |                             4,0 
    |                           /     \
 3  |                       3,0         4,1
    |                     /     \     /
 2  |                 2,0         3,1
    |               /     \     /     \
 1  |           1,0         2,1         3,2
    |         /     \     /     \     /      ...
 0  |     0,0         1,1         2,2
    |         \     /     \     /     \
-1  |           0,1         1,2         2,3
    |               \     /     \     /
-2  |                 0,2         1,3
    |                     \     /     \
-3  |                       0,3   ...  1,4

To find the perfect path, the algorithm iterates the possible depth values. For each depth, it iterates the possible k values at that depth (-d...d) and determines the best move it can make at that point for that k value.

The best move is determined based on a few rules we'll see soon, but the important part here is that this is done based on the decisions made on the previous depth iteration. To make the current iteration not mess with the previous moves, the k values are iterated in steps of 2.

var result = [KValues]()
var currentKValues = KValues()
currentKValues[1] = 0 // Ignore this for now, explained later!

outer: for d in 0...(a.count + b.count) {

    result.append(currentKValues)
    let previousKValues = currentKValues
    currentKValues = KValues()

    for k in stride(from: -d, through: d, by: 2)
        // Determine the best moves for the current K values
    }
}
return backtrackPath(fromKValues: result) // Additions and Deletions

"Determining" the best move means deciding to go right or down from the positions recorded in the previous depth iteration. We should always move right if k == -d (bottom of the currently "accessible" graph), and always down if k == +d (right edge of the currently "accessible" graph). Otherwise, we should pick the position based on the current adjacent k's best values that leads us to the highest x value -- this way, we prioritize deletion.

let currX: Int
if k == -d {
    currX = previousKValues[k+1] // Moving down
} else if k == d {
    currX = previousKValues[k-1] + 1 // Moving to the right
} else if previousKValues[k-1] < previousKValues[k+1] {
    currX = previousKValues[k+1] // Moving down
} else {
    currX = previousKValues[k-1] + 1 // Moving to the right
}
let currY = currX - k

This loop is the reason why we need to start the algorithm with the 1 k value set to 0. This makes sure that the first iteration determines that the "first best move" is to go down to 0,0 -- the beginning of the graph. This allows everything else to proceed as normal.

Once the best move is found, we try to move diagonally as much as we can to claim our free steps.

while currX < a.count && currY < b.count {
    if a[currX] == b[currY] {
      break
    }
    x &+= 1
    y &+= 1
}

Finally, we store the best X value we've found for the current k value and halt the algorithm if we've reached the bottom right. Note how we only store X: Because k is a ratio of X and Y, we can infer Y from it and thus do not need to store Y.

currentKValues[k] = currX
if currX >= a.count && currY >= b.count {
    break outer
}

You might wonder how subscripting currentKValues works if k can be negative -- the Standard Library created a custom type that translates such negative indexes into positive ones. The idea is to create an array long enough to fit both positive and negative values and treat the indexes inside.

The complete algorithm from Swift can be found here.

The result of this implementation of Myers's is an array representing how far each k value has traveled in the X-axis in a specific depth. To get the actual path, all we have to do is backtrack from the only k value that has reached the maximum depth. This is done through a similar loop: For each current value, move up or to the left based on who has the largest X value in the previous iteration of the original method. Here's how this is implemented in the Standard Library (with _V being their version of my KValues type):

// Backtrack through the trace generated by the Myers descent to produce the changes that make up the diff
func _formChanges(
  from a: UnsafeBufferPointer<C.Element>,
  to b: UnsafeBufferPointer<C.Element>,
  using trace: [_V]
) -> [CollectionDifference<C.Element>.Change] {
  var changes = [CollectionDifference<C.Element>.Change]()
  changes.reserveCapacity(trace.count)
  var x = a.count
  var y = b.count
  for d in stride(from: trace.count &- 1, to: 0, by: -1) {
    let v = trace[d]
    let k = x &- y
    let prev_k = (k == -d || (k != d && v[k &- 1] < v[k &+ 1])) ? k &+ 1 : k &- 1
    let prev_x = v[prev_k]
    let prev_y = prev_x &- prev_k

    while x > prev_x && y > prev_y {
      // No change at this position.
      x &-= 1
      y &-= 1
    }

    if y != prev_y {
      changes.append(.insert(offset: prev_y, element: b[prev_y], associatedWith: nil))
    } else {
      changes.append(.remove(offset: prev_x, element: a[prev_x], associatedWith: nil))
   }

    x = prev_x
    y = prev_y
  }

  return changes
}

Now that we have the final CollectionDifference type, we can finally return it and conclude the diffing algorithm. The worst case for Myers's is O(a * b) as it is possible for us to traverse all of the nodes of the graph.

Conclusion

As always, I like to finish these by saying that knowing what's inside of what you're using allows you to make better decisions in code. Different algorithms are meant for different use-cases, and knowing how to make trade-offs is useful everywhere. Plus, you get to learn cool little pieces of trivia -- did you know that Myers's is one of the merging strategies in git? :)

References and Good Reads

Myers's Paper
Diffing.swift
CollectionDifference.swift
The Myers Diff Algorithm Explanation - Great read!