How Hashable works in Swift

How Hashable works in Swift

Hashing algorithms are functions that map an arbitrary string to a seemingly "random" output with a fixed length. Commonly associated as a component of Sets and Dictionaries, hashing algorithms are a major component of several branches of computer science, including cryptography.

In Swift, the Hashable protocol is the main component used to provide hashing capabilities to custom types. It is most commonly used to allow types to be used as the key of Dictionaries and Sets, but it also gives you access to `Sequence` helper methods like `contains(_:)`.

The way it works is that Hashable exposes a hashValue property (now Hasher in Swift 4.2), which is an Int that represents a not-guaranteed-but-hopefully-most-of-the-times unique number that represents the contents of that type. In the case of Dictionaries, this number is used to calculate where in memory a key's value should be stored to later allow you to retrieve that same value in constant time. As the implementations of the protocol aren't the scope of this article, here's a video on hash tables if you need more information on this.

The internal algorithm used to calculate a type's hashValue (now Hasher in Swift 4.2) and the related compiler features changed several times throughout Swift's releases, and it was only after Swift 4.2 that a real universal hashing algorithm was added to Swift. To see how this is implemented today, let's take a look at how this looked like throughout the history of Swift.

Swift 4.0 and below - A lawless world

Throughout the history of Swift up until Swift 4.0, the internal implementation of a Int's hashValue simply returned their own value, while Strings alternated between using the MurmurHash2 algorithm (Swift 3.0 and below) and SipHash1-3 (Swift 4.0 forward).

Although Strings were doing fine in terms of algorithms, adding Hashable conformance to custom types in that period of time was usually an annoyance. If you were around that time, then you probably remember that Apple's recommended way to implement hashValue was to XOR all properties together:

struct Foo: Hashable {
    let someProperty: String
    let anotherProperty: String

    var hashValue: Int {
        return someProperty.hashValue ^ anotherProperty.hashValue
    }
}

As there was no automatic synthesis at the time, this is how things worked for a long time.

Although this did the trick in terms of hashing, the necessity of having users do this in first place was fundamentally flawed:

- Good hashing algorithms are very complex, so it's unreasonable to expect that your average developer will be able to determine if simply XORing properties is enough for the desired use-case.
- Because this is a manual bit manipulation process, this is very easy to screw up.

- Good hashing algorithms have guarantees in terms of collision resistance, distribution and safety, while XORing properties have none of them.

As Swift is known to be very user-friendly, changes were definitely expected -- and they came.

Swift 4.1: Synthesize Hashable with _mixInt

Alongside the changes to synthesize conformance to Equatable types, Swift 4.1 attempted to treat this problem for the first time by synthesizing a custom type's conformance to Hashable as well.

(We won't talk about how compiler synthesis works because it was already covered in How CaseIterable Works Internally in Swift, and Hashable synthesis works pretty much exactly the same as the CaseIterable one. If you want more info on that, check that article out!)

With the proposed changes, while the synthetized hashValue would still XOR a type's properties's hashValues, the values that are being XORed would not be the hashValues themselves -- but a "scrambled" version of them, provided by an internal method called _mixInt:

struct Foo: Hashable {
    let someProperty: String
    let anotherProperty: String

//////// Compiler Generated ///////// 

    var hashValue: Int {
        return _mixInt64(someProperty.hashValue) ^ _mixInt64(anotherProperty.hashValue)
    }

/////////////////////////////////////
}

The implementation of _mixInt progressively multiplies and shifts the original integer's bits based on a randomly picked hardcoded value. Because most of the original value's bit information will be lost during the shifting steps, this will make the resulting value appear to be completely random, which is a very important trait of a hashing algorithm.

static func hash16Bytes(_ low: UInt64, _ high: UInt64) -> UInt64 {
  let mul: UInt64 = 0x9ddfea08eb382d69
  var a: UInt64 = (low ^ high) &* mul
  a ^= (a >> 47)
  var b: UInt64 = (high ^ a) &* mul
  b ^= (b >> 47)
  b = b &* mul
  return b
}

If this looks like a bunch of gibberish to you, welcome to hashing algorithms! In fact, as having pseudorandom results is an important quality of good hashing algorithms, you'll see that most hashing algorithms are based on even weirder non-sense. It's exactly what they are supposed to do!

Although _mixInt is an implementation of its own, its behavior was inspired by a widely-used hashing algorithm called MurmurHash -- which works in a similar fashion of progressively multiplying, shifting and XORing values, although in a more complex way.

While this is better than simply XORing properties, this approach still had some flaws. First of all, while _mixInt ticked a few boxes of what defines a good hashing algorithm, like having avalanche behavior (minor changes in the input have catastrophic, pseudorandom effects in the output), it was never intended to be used to combine hash values.

A second problem is that _mixInt was private, so manual implementations of Hashable were still flawed in the same ways of the past Swift versions.

As a third problem, another issue is that just like MurmurHash, _mixInt is an unsafe algorithm. But whoah! Why would they add something unsafe to the language?

Don't worry! In the world of hashing algorithms, this term is not necessarily a bad thing. Before continuing on why this is bad in this specific case, let's understand what this means.

What makes a hashing algorithm "safe"?

In the world of hashing algorithms, safety is measured by how viable an algorithm is in the context of cryptography. This is measured by how well the algorithms fit some key requirements, like:

- Given a hash, it should be extremely hard to find the input that created it. (pre-image resistance)
- Given an input and a hash, it should be extremely hard to find another input that generates the same hash (collision resistance)
- The output should appear to be completely random. (pseudorandomness)
- Hashes from different inputs should have no concrete relation to each other, even if the inputs have clear similarities (non-malleability)
- The algorithm should be fast, but slow enough so that brute-forcing your way through the previous requirements is unfeasible.

An example of an algorithm that fits most of these boxes is SHA, which is today's standard for checking the integrity of files.

Because Swift needs hashing for the purposes of efficiently storing values in Dictionaries and Sets, having strong cryptographic abilities is simply not important here. Instead, hashing algorithms for purposes like Swift's tend to focus more on speed (as there isn't much to be brute-forced) and good distribution of values in order to minimize (but for performance reasons, not completely avoid) collisions.

Unfortunately, the lack of cryptographic abilities makes unsafe algorithms susceptible to some attacks. In Swift's case, _mixInt was weak to hash flooding: An attack where someone purposively feeds inputs that they know will produce the same final index in a hash table in order to cause as many collisions as possible, in an attempt to bring down the service that is hosting this hash table.

Unsafe algorithms are supposed to be used in environments where the lack of safety isn't a problem, which is not 100% true for hash tables. Fortunately, the solution to these problems came in Swift 4.2 with the addition of Hasher.

Swift 4.2 - Hasher brings SipHash-1-3 for everyone

As part of Hashable's revamp in Swift 4.2, The Hasher type was added as a public API to abstract the process of creating hashValues from the user. Now, instead of manually providing a hashValue, the property was replaced by Hashable's new hash(into:) method which receives a reference to a Hasher instance.

struct Foo: Hashable {
    let someProperty: String
    let anotherProperty: String

//////// Compiler Generated /////////

    func hash(into hasher: inout Hasher) {
        hasher.combine(someProperty)
        hasher.combine(anotherProperty)
    }

/////////////////////////////////////

}

Because Hasher abstracts the hashing process, this means that manual implementations don't have to hope that XORing things together was good enough -- you can now just send everything to Hasher and let it handle the rest with its internal algorithms.

The addition of Hasher also meant that Swift now officially had a universal hashing algorithm, as everything in the Standard Library (including Ints and Strings) was modified to exclusively use Hasher when building hash values.

This algorithm is SipHash -- which at first glance might not look too different from _mixInt as things are still based on random values and operations:

extension Hasher {
  // FIXME: Remove @usableFromInline and @frozen once Hasher is resilient.
  // rdar://problem/38549901
  @usableFromInline @frozen
  internal struct _State {
    // "somepseudorandomlygeneratedbytes"
    private var v0: UInt64 = 0x736f6d6570736575
    private var v1: UInt64 = 0x646f72616e646f6d
    private var v2: UInt64 = 0x6c7967656e657261
    private var v3: UInt64 = 0x7465646279746573

    @inline(__always)
    internal init(rawSeed: (UInt64, UInt64)) {
      v3 ^= rawSeed.1
      v2 ^= rawSeed.0
      v1 ^= rawSeed.1
      v0 ^= rawSeed.0
    }
  }
}

extension Hasher._State {
  @inline(__always)
  private static func _rotateLeft(_ x: UInt64, by amount: UInt64) -> UInt64 {
    return (x &<< amount) | (x &>> (64 - amount))
  }

  @inline(__always)
  private mutating func _round() {
    v0 = v0 &+ v1
    v1 = Hasher._State._rotateLeft(v1, by: 13)
    v1 ^= v0
    v0 = Hasher._State._rotateLeft(v0, by: 32)
    v2 = v2 &+ v3
    v3 = Hasher._State._rotateLeft(v3, by: 16)
    v3 ^= v2
    v0 = v0 &+ v3
    v3 = Hasher._State._rotateLeft(v3, by: 21)
    v3 ^= v0
    v2 = v2 &+ v1
    v1 = Hasher._State._rotateLeft(v1, by: 17)
    v1 ^= v2
    v2 = Hasher._State._rotateLeft(v2, by: 32)
  }

  @inline(__always)
  private func _extract() -> UInt64 {
    return v0 ^ v1 ^ v2 ^ v3
  }
}

However, there is a fundamental difference between the two.

While algorithms like MurmurHash only need the input to produce the output, SipHash is a keyed hash function which also requires an arbitrary secret key value to perform the hashing. Hashing equal values with different keys will produce completely different results, which means that attackers cannot perform hash flooding attacks against SipHash-powered systems without knowing the key that's being used.

For safety, in Swift, SipHash's key is randomly generated during runtime. This is why Apple tells you to never save/compare hashValues of a property between sessions: The same property will have different hash values in different executions of an app because SipHash's key will be different:

static swift::_SwiftHashingParameters initializeHashingParameters() {
  auto determinism = getenv("SWIFT_DETERMINISTIC_HASHING");
  if (determinism && 0 == strcmp(determinism, "1")) {
    return { 0, 0, true };
  }
  __swift_uint64_t seed0 = 0, seed1 = 0;
  swift::swift_stdlib_random(&seed0, sizeof(seed0));
  swift::swift_stdlib_random(&seed1, sizeof(seed1));
  return { seed0, seed1, false };
}

As shown in the previous code, if you require deterministic hash values for the purposes of unit testing, Swift allows you to disable random seeding by adding SWIFT_DETERMINISTIC_HASHING as an environment variable. Note that there's no guarantee that the values will be the same between different Swift releases, so ideally you should never use it.

SipHash will also hash an input multiple times before producing the final output, which is determined by the numbers in the name of the algorithm. In our case, Swift uses SipHash-1-3, which means that there's one round of hashing per message block, with three finalization rounds. This makes SipHash slightly slower than your average unsafe algorithm, which alongside its keyed hash properties allows SipHash to be considered a safe hashing algorithm.

extension Hasher._State {
  @inline(__always)
  internal mutating func compress(_ m: UInt64) {
    v3 ^= m
    _round()
    v0 ^= m
  }

  @inline(__always)
  internal mutating func finalize(tailAndByteCount: UInt64) -> UInt64 {
    compress(tailAndByteCount)
    v2 ^= 0xff
    for _ in 0..<3 {
      _round()
    }
    return _extract()
  }
}

Although SipHash's key-hashing feature is meant to be used as a network traffic authenticator, the improved defense against hash flooding makes it a very good hashing algorithm for hash tables, which is now being used by programming languages like Perl 5, Ruby, Rust and now Swift.

References and Good Reads

The Swift Source Code
SipHash
SE-0206 Hashable Enhancements
SE-0185 Synthesize Hashable