Parsing and Performance: Combinators

Episode #128 • Dec 7, 2020 • Subscriber-Only

We convert some of our substring parsers to work on lower levels of String abstractions, and unlock huge performance gains. Even better, thanks to our generalized parser we can even piece together multiple parsers that work on different abstraction levels, maximizing performance in the process.

Combinators
Introduction
00:05
Benchmarking a simple parser
01:13
Benchmarking a complex parser
17:30
Parsing multiple abstraction levels
33:46
Parsing across abstraction levels
40:47
Benchmarking an even more complex parser
46:28
Next time: even more performance?
49:50

Unlock This Episode

Our Free plan includes 1 subscriber-only episode of your choice, plus weekly updates from our newsletter.

Introduction

So there really are quite substantial performance gains to be had by dropping to lower and lower abstractions levels. The biggest gain is just by using substring over string, but even using unicode scalars and UTF-8 are big enough to consider using it if possible.

Now how do we apply what we’ve learned to parsers?

Well, so far all of our string parsers have been defined on Substring, and we’ve used lots of string APIs such as removeFirst, prefix and range subscripting. As we have just seen in very clear terms, these operations can be a little slow on Substring because of the extra work that must be done to properly handle traversing over grapheme clusters and normalized characters. The time differences may not seem huge, measured in just a few microseconds, but if you are parsing a multi-megabyte file that can really add up.

So, let’s see what kind of performance gains can be had by switching some of our parsers to work with UTF-8 instead of Substring.

This episode is for subscribers only.

Subscribe to Point-Free

Access this episode, plus all past and future episodes when you become a subscriber.

See plans and pricing

Already a subscriber? Log in

Exercises

  1. While string’s UTF8View offers a super-performant, low-level abstraction, we have found that there are other low-level abstractions that offer different performance characteristics. For one, while UTF8View is a Swift Collection of UInt8s, so is a more general array [UInt8]. Write another int parser overload where Input is a more general array. What changes in the implementation? What changes as far as performance is concerned?

    Solution

    For our parser type to work most efficiently, we will constrain our parser to work with ArraySlice, which has a subsequence of the same type and can share storage when removing elements from the start and end of the array by simply moving the indices:

    extension Parser where Input == ArraySlice<UInt8>, Output == Int {
      static let int = Self { input in
        var isFirstCharacter = true
        let intPrefix = input.prefix { c in
          defer { isFirstCharacter = false }
          return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter
            || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c)
        }
    
        guard let match = Int(String(decoding: intPrefix, as: UTF8.self))
        else { return nil }
    
        input.removeFirst(intPrefix.count)
        return match
      }
    }
    

    The only two changes are in the constraint, which is ArraySlice<UInt8> instead of Substring.UTF8View, and where the match is converted to a string, which must now use the more generalized String.init(decoding:as:) initializer instead of the String.init-Substring.init that was possible given a Substring.UTF8View.

    As far as performance is concerned, there are two things to consider. We must first convert a string to an array before we can parse it, which is a performance cost that both requires double the memory, and the up-front cost of that conversion. Once the string it converted, however, it appears to be about 30–40% faster to parse.

  2. There is another abstraction level we can parse from, and that’s unsafe pointers (😰), but we don’t necessarily have to fear them!

    Write another int parser that works on an input for one of the safer of unsafe pointers: UnsafeBufferPointer. This is a pointer to a collection of elements stored in contiguous memory. Strings offer two main ways of accessing this kind of data:

    // Directly on the string:
    string.withUTF8 { (ptr: UnsafeBufferPointer<UInt8>) in
      ...
    }
    
    // On any collection:
    string.utf8.withContiguousStorageIfAvailable { (ptr: UnsafeBufferPointer<UInt8>) in
      ...
    }
    

    After writing this parser, benchmark it against each interface.

    Solution

    Starting from the previous exercise, the only thing that needs to change is the extension, which needs to work on a slice of a buffer pointer. Everything else is the same:

    extension Parser where Input == Slice<UnsafeBufferPointer<UInt8>>, Output == Int {
      static let int = Self { input in
        var isFirstCharacter = true
        let intPrefix = input.prefix { c in
          defer { isFirstCharacter = false }
          return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter
            || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c)
        }
    
        guard let match = Int(String(decoding: intPrefix, as: UTF8.self))
        else { return nil }
    
        input.removeFirst(intPrefix.count)
        return match
      }
    }
    

    We can benchmark withUTF8 and withContiguousStorageIfAvailable with a little extra upfront work.

    withUTF8 requires a mutable string (or substring), since if the string is not in contiguous memory, it must be moved into contiguous memory beforehand, so we must copy the string into a mutable variable before parsing. The operation returns a non-failable result because strings can always move their bytes into contiguous memory.

    suite.benchmark("withUTF8") {
      var s = string
      precondition(
        s.withUTF8 { ptr in
          var ptr = ptr[...]
          return Parser.int.run(&ptr)!
        }
        == 1234567890
      )
    }
    

    withContiguousStorageIfAvailable on the other hand is more general and more failable. If contiguous memory cannot be allocated for the given collection, it will return nil, so we must further unwrap it in our precondition (we know it should never fail for UTF8Views).

    suite.benchmark("withContiguousStorageIfAvailable") {
      precondition(
        string.utf8.withContiguousStorageIfAvailable { ptr in
          var ptr = ptr[...]
          return Parser.int.run(&ptr)!
        }!
        == 1234567890)
      )
    }
    

    Performance-wise, these methods are even faster than ArraySlice<UInt8> and do not incur the same memory cost: about 2x faster than UTF8View and 70% faster than ArraySlice<UInt8>. Performance between each interface is negligible and can be measured in under 10 nanoseconds.

  3. The int parser on UTF8View, [UInt8] and UnsafeBufferPointer<UInt8> can be unified in a single implementation that is constrained against the Collection protocol. Comment out the other parsers and use this single, more generalized parser, instead. How does this affect performance?

    Solution

    The body of this parser is the same as the last two. We just need to update the extension with more general constraints and use a computed static property instead of a let:

    extension Parser
    where
      Input: Collection,
      Input.SubSequence == Input,
      Input.Element == UInt8,
      Output == Int
    {
      static var int: Self {
        Self { input in
          var isFirstCharacter = true
          let intPrefix = input.prefix { c in
            defer { isFirstCharacter = false }
            return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter
              || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c)
          }
    
          guard let match = Int(String(decoding: intPrefix, as: UTF8.self))
          else { return nil }
    
          input.removeFirst(intPrefix.count)
          return match
        }
      }
    }
    

    Performance characteristics seem to be the same (in this single module).

References

swift-benchmark

Google • Friday Mar 13, 2020

A Swift library for benchmarking code snippets, similar to google/benchmark.

UTF-8

Michael Ilseman • Wednesday Mar 20, 2019

Swift 5 made a fundamental change to the String API, making the preferred encoding UTF-8 instead of UTF-16. This brings many usability and performance improves to Swift strings.

Strings in Swift 4

Ole Begemann • Monday Nov 27, 2017

An excerpt from the Advanced Swift that provides a deep discussion of the low-level representations of Swift strings. Although it pre-dates the transition of strings to UTF-8 in Swift 5 it is still a factually correct accounting of how to work with code units in strings.

Improve performance of Collection.removeFirst(_:) where Self == SubSequence

Stephen Celis • Tuesday Jul 28, 2020

While researching the string APIs for this episode we stumbled upon a massive inefficiency in how Swift implements removeFirst on certain collections. This PR fixes the problem and turns the method from an O(n) operation (where n is the length of the array) to an O(k) operation (where k is the number of elements being removed).

Downloads