# How Collection Diffing works in Swift

**Ordered Collection Diffing** is a feature added in Swift 5.1 that allows you to calculate and apply the difference between two collections. Using diffing libraries is common in iOS for a few reasons, the most popular one being to handle the addition and removal of elements in **UITableViews**. With the addition of this feature, developers can now diff **Collections** without having to bother with external libraries.

Have you wondered what efficiently diffing something looks like? In this article, we'll see and analyze the diffing APIs as well as the **Myers's Diffing Algorithm** used by the Swift Standard Library -- the same used in **git**.

## Collection Diffing APIs

The diffing APIs are available as an extension for any **BidirectionalCollection**, with an abstracted helper type for ones whose's elements are **Equatable**:

`extension BidirectionalCollection {`

`public func difference<C: BidirectionalCollection>(`

`from other: C,`

`by areEquivalent: (C.Element, Element) -> Bool`

`) -> CollectionDifference<Element>`

`where C.Element == Self.Element {`

`return ... // Article spoilers removed!`

`}`

`}`

`extension BidirectionalCollection where Element: Equatable {`

`public func difference<C: BidirectionalCollection>(`

`from other: C`

`) -> CollectionDifference<Element> where C.Element == Self.Element {`

`return difference(from: other, by: ==)`

`}`

`}`

The return type for diffing operations is an **CollectionDifference** object -- basically an array of "moves" necessary to turn the first collection into the second one. The "moves" are represented as an enum of additions and removals, and the **CollectionDifference** object holds all of them.

`public struct CollectionDifference<ChangeElement> {`

`/// A single change to a collection.`

`@frozen`

`public enum Change {`

`case insert(offset: Int, element: ChangeElement, associatedWith: Int?)`

`case remove(offset: Int, element: ChangeElement, associatedWith: Int?)`

`}`

`public let insertions: [Change]`

`public let removals: [Change]`

`}`

**CollectionDifference** is a **Collection** itself, which means that it can be iterated. **An important aspect of this is that the iteration will happen in a specific order: First, all the removals from highest to lowest offset, followed by insertions from lowest to highest.** This is both for performance reasons (removing elements from the end of an array is faster from the beginning, and adding elements to the front when there are fewer elements is faster than when there's a lot), and for visual reasons as we'll see when the underlying algorithm is inspected.

We can see this behavior with a real example. Let's calculate the difference between these two strings:

`let a = "ABCABBA"`

`let b = "CBABAC"`

There are tons of ways of transforming **a** into **b**, but our interest is doing so in the least amount of steps. When calling **difference(_:)** and iterating the result, we'll see one of these solutions:

`let diff = b.difference(from: a)`

`diff.forEach { print($0) }`

`// remove(offset: 5, element: "B", associatedWith: nil)`

`// remove(offset: 1, element: "B", associatedWith: nil)`

`// remove(offset: 0, element: "A", associatedWith: nil)`

`// insert(offset: 1, element: "B", associatedWith: nil)`

`// insert(offset: 5, element: "C", associatedWith: nil)`

(We'll leave the explanation of the algorithm itself for the last section because it's a bit complicated.)

From here, you can use the resulting **CollectionDifference** object as you see fit, like animating the addition and removal of rows from an **UITableView**. The neat thing is that Swift provides an **applying(_:)** method that allows you to directly apply a **CollectionDifference** object into a **Collection**:

`let newB = a.applying(diff)`

`//CBABAC`

This particular example looks useless, so here's something where applying diffs is useful: The same three-way merge process that **git** uses as a strategy to merge commits!

`// Split the contents of the sources into lines`

`let baseLines = base.components(separatedBy: "\n")`

`let theirLines = theirs.components(separatedBy: "\n")`

`let myLines = mine.components(separatedBy: "\n")`

`// Create a difference from base to theirs`

`let diff = theirLines.difference(from:baseLines)`

`// Apply it to mine, if possible`

`guard let patchedLines = myLines.applying(diff) else {`

`print("Merge conflict applying patch, manual merge required")`

`return`

`}`

`// Reassemble the result`

`let patched = patchedLines.joined(separator: "\n")`

`print(patched)`

## How applying() works

The **applying(_:)** method works by creating a new, empty **Collection** instance that gets filled as the changes are iterated. Because the changes are iterated in the specific high removals -> low removals -> low insertions -> high insertions, the algorithm is able to build the final result through ranges instead of individual elements. For example, if you have a 10 character string and the first step is to remove the 5th character, we know right off the bat that the final result will contain the range **5...9** in that same order, so "removing the 5th character" is equivalent to adding everything that comes after it. The same logic is applied until all the changes are processed. Because of the interaction with ranges, **applying(_:)** is only available to **RangeReplaceableCollections**.

It is possible for this process to fail, for example, when the desired element to be removed doesn't exist, which is possible in cases like the **git** example due to conflicting information in the data. In these cases, **applying(_:)** returns **nil**.

The actual method is a bit long and hard to read, so I'll leave the link to it instead of showing it here.

## The complex part: How difference() works

Now that we've seen how the APIs work, we're ready to take a look at what Swift does to generate the final **CollectionDifference** object.

As stated before, there are many ways to transform a **Collection** into another one. For example, to transform **a: ABCABBA** to **b: CBABAC**, we could simply remove every single element from **a** and append the ones from **b**. Easy peasy!

However, at a best case of **a.count + b.count** "moves", that would be very inefficient. One way to find the best way possible to do this change is to transform this into a **graph problem**. Imagine that we have a **matrix** where the columns are characters from **a** and rows are characters from **b**:

`- A B C A B B A`

`-`

`C`

`B`

`A`

`B`

`A`

`C`

In this matrix, the top-left position 0,0 represents the original **ABCABBA** string, while the bottom-right position represents the final **CBABAC** string. Moving **right** in this matrix represents **removing** an element from the original string, while moving **down** represents **adding** an element from the new one. Each of these movements counts as a step, but if the elements from both strings at a specific position **match** (like the position 1,3 - two Cs), you can **diagonally move down and to the right** for "free", as no changes are necessary for that position.

Given this description, we can transform this problem into another one: **How can we reach the bottom right of the matrix in the least amount of steps (a.k.a maximizing the number of diagonal moves)?**

If you ever studied graphs you might remember that finding the shortest path between two nodes in an unweighted graph is a classical computer science problem! This problem is solved by the popular graph search algorithm Breadth-First Search -- because the graph is unweighted, taking turns on checking the neighbors of each node is guaranteed to discover the shortest path between a source node to a target one. Applying the algorithm in the previous example will show us a path that transforms our string in **5** moves (remember that diagonal moves do not count as moves):

`- A B C A B B A`

`- 0 1 2`

`C 2 3`

`B 3`

`A 4`

`B 4`

`A 4`

`C 5`

The implementation behind the true diffing algorithms is very similar to this, but with a few (unfortunately, relatively complicated) important twists. Although we're looking for the shortest edit path between two **Collections**, we're not exactly looking for **any** short path. Especially for visual applications like **git**, the overall order of how deletions and insertions are present is important. When you're looking at the merge diff in git, you would expect the diff to be somewhat "intelligent" in terms of knowing what was really added and deleted by your changes instead of randomly saying that things were "added" and "deleted" because it fulfills the shortest path condition.

`Good: class Foo { Bad: class Foo {`

`init(bar: String) { init(bar: String) {`

`self.bar = bar self.bar = bar`

`} + }`

`+ +`

`+ func inspect() { + func inspect() {`

`+ print(bar) + print(bar)`

`+ } }`

`} }`

Diffing algorithms carefully choose the paths that Breadth-First Search is going to traverse in order to provide a meaningful diff.

## Swift's Diffing Algorithm: Myers's Diffing Algorithm

Swift uses Myers's Diffing Algorithm (paper here) as part of the diffing APIs. Given two collections, the result is the **Shortest Edit Script** that transforms A into B. Much like the abstracted **CollectionDifference** type, the Shortest Edit Script is a list of additions and removals. In order to provide a good diff, Myers's prefers deleting over inserting, and it differs from Breadth-First Search by prioritizing what seems to be "good" paths instead of blindly switching nodes like the original algorithm.

This is done by adding two additional components to the matrix: How **deep** we are into the graph (number of moves, in the X-axis), and the **ratio** of additions versus deletions, which is a value **k** that increases by one every time we remove a value (moving right), and decreases by one every time we add a value (moving down), in the Y-axis. This is better seen if we shift a graph by 45 degrees (note that this graph doesn't contain the diagonal connections that make the paths shorter):

`| 0 1 2 3 4 5`

`+--------------------------------------`

`5 | 5,0`

`| /`

`4 | 4,0`

`| / \`

`3 | 3,0 4,1`

`| / \ /`

`2 | 2,0 3,1`

`| / \ / \`

`1 | 1,0 2,1 3,2`

`| / \ / \ / ...`

`0 | 0,0 1,1 2,2`

`| \ / \ / \`

`-1 | 0,1 1,2 2,3`

`| \ / \ /`

`-2 | 0,2 1,3`

`| \ / \`

`-3 | 0,3 ... 1,4`

To find the perfect path, the algorithm iterates the possible **depth** values. For each **depth**, it iterates the possible **k values** at that depth (**-d...d**) and determines the best move it can make at that point for that **k value**.

The best move is determined based on a few rules we'll see soon, but the important part here is that this is done based on the decisions made on the previous depth iteration. To make the current iteration not mess with the previous moves, the **k values** are iterated in steps of 2.

`var result = [KValues]()`

`var currentKValues = KValues()`

`currentKValues[1] = 0 // Ignore this for now, explained later!`

`outer: for d in 0...(a.count + b.count) {`

`result.append(currentKValues)`

`let previousKValues = currentKValues`

`currentKValues = KValues()`

`for k in stride(from: -d, through: d, by: 2)`

`// Determine the best moves for the current K values`

`}`

`}`

`return backtrackPath(fromKValues: result) // Additions and Deletions`

"Determining" the best move means deciding to go right or down from the positions recorded in the previous depth iteration. We should always move right if **k == -d** (bottom of the currently "accessible" graph), and always down if **k == +d** (right edge of the currently "accessible" graph). Otherwise, we should pick the position based on the current adjacent k's best values that leads us to the highest x value -- this way, we prioritize deletion.

`let currX: Int`

`if k == -d {`

`currX = previousKValues[k+1] // Moving down`

`} else if k == d {`

`currX = previousKValues[k-1] + 1 // Moving to the right`

`} else if previousKValues[k-1] < previousKValues[k+1] {`

`currX = previousKValues[k+1] // Moving down`

`} else {`

`currX = previousKValues[k-1] + 1 // Moving to the right`

`}`

`let currY = currX - k`

This loop is the reason why we need to start the algorithm with the **1** **k value** set to 0. This makes sure that the first iteration determines that the "first best move" is to go down to 0,0 -- the beginning of the graph. This allows everything else to proceed as normal.

Once the best move is found, we try to move diagonally as much as we can to claim our free steps.

`while currX < a.count && currY < b.count {`

`if a[currX] == b[currY] {`

`break`

`}`

`x &+= 1`

`y &+= 1`

`}`

Finally, we store the best X value we've found for the current **k value** and halt the algorithm if we've reached the bottom right. Note how we only store X: Because **k** is a ratio of X and Y, we can infer Y from it and thus do not need to store Y.

`currentKValues[k] = currX`

`if currX >= a.count && currY >= b.count {`

`break outer`

`}`

You might wonder how subscripting **currentKValues** works if k can be negative -- the Standard Library created a custom type that translates such negative indexes into positive ones. The idea is to create an array long enough to fit both positive and negative values and treat the indexes inside.

The complete algorithm from Swift can be found here.

The result of this implementation of Myers's is an array representing how far each **k value** has traveled in the X-axis in a specific depth. To get the actual path, all we have to do is backtrack from the only **k value** that has reached the maximum depth. This is done through a similar loop: For each current value, move up or to the left based on who has the largest X value in the previous iteration of the original method. Here's how this is implemented in the Standard Library (with **_V** being their version of my **KValues** type):

`// Backtrack through the trace generated by the Myers descent to produce the changes that make up the diff`

`func _formChanges(`

`from a: UnsafeBufferPointer<C.Element>,`

`to b: UnsafeBufferPointer<C.Element>,`

`using trace: [_V]`

`) -> [CollectionDifference<C.Element>.Change] {`

`var changes = [CollectionDifference<C.Element>.Change]()`

`changes.reserveCapacity(trace.count)`

`var x = a.count`

`var y = b.count`

`for d in stride(from: trace.count &- 1, to: 0, by: -1) {`

`let v = trace[d]`

`let k = x &- y`

`let prev_k = (k == -d || (k != d && v[k &- 1] < v[k &+ 1])) ? k &+ 1 : k &- 1`

`let prev_x = v[prev_k]`

`let prev_y = prev_x &- prev_k`

`while x > prev_x && y > prev_y {`

`// No change at this position.`

`x &-= 1`

`y &-= 1`

`}`

`if y != prev_y {`

`changes.append(.insert(offset: prev_y, element: b[prev_y], associatedWith: nil))`

`} else {`

`changes.append(.remove(offset: prev_x, element: a[prev_x], associatedWith: nil))`

`}`

`x = prev_x`

`y = prev_y`

`}`

`return changes`

`}`

Now that we have the final **CollectionDifference** type, we can finally return it and conclude the diffing algorithm. The worst case for Myers's is **O(a * b)** as it is possible for us to traverse all of the nodes of the graph.

## Conclusion

As always, I like to finish these by saying that knowing what's inside of what you're using allows you to make better decisions in code. Different algorithms are meant for different use-cases, and knowing how to make trade-offs is useful everywhere. Plus, you get to learn cool little pieces of trivia -- did you know that Myers's is one of the merging strategies in **git**? :)

## References and Good Reads

Myers's PaperDiffing.swift

CollectionDifference.swift

The Myers Diff Algorithm Explanation - Great read!

### About

**Bruno Rocha**is a Software Engineer at Spotify and is the developer of open sources libraries like SwiftInfo and SwiftShield.