Collection
Unlock This Episode
Our Free plan includes 1 subscriber-only episode of your choice, plus weekly updates from our newsletter.
Introduction
So there really are quite substantial performance gains to be had by dropping to lower and lower abstractions levels. The biggest gain is just by using substring over string, but even using unicode scalars and UTF-8 are big enough to consider using it if possible.
Now how do we apply what we’ve learned to parsers?
Well, so far all of our string parsers have been defined on Substring
, and we’ve used lots of string APIs such as removeFirst
, prefix
and range subscripting. As we have just seen in very clear terms, these operations can be a little slow on Substring
because of the extra work that must be done to properly handle traversing over grapheme clusters and normalized characters. The time differences may not seem huge, measured in just a few microseconds, but if you are parsing a multi-megabyte file that can really add up.
So, let’s see what kind of performance gains can be had by switching some of our parsers to work with UTF-8 instead of Substring
.
Subscribe to Point-Free
Access this episode, plus all past and future episodes when you become a subscriber.
Already a subscriber? Log in
Exercises
While string’s
UTF8View
offers a super-performant, low-level abstraction, we have found that there are other low-level abstractions that offer different performance characteristics. For one, whileUTF8View
is a SwiftCollection
ofUInt8
s, so is a more general array[UInt8]
. Write anotherint
parser overload whereInput
is a more general array. What changes in the implementation? What changes as far as performance is concerned?Solution
For our parser type to work most efficiently, we will constrain our parser to work with
ArraySlice
, which has a subsequence of the same type and can share storage when removing elements from the start and end of the array by simply moving the indices:extension Parser where Input == ArraySlice<UInt8>, Output == Int { static let int = Self { input in var isFirstCharacter = true let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c) } guard let match = Int(String(decoding: intPrefix, as: UTF8.self)) else { return nil } input.removeFirst(intPrefix.count) return match } }
The only two changes are in the constraint, which is
ArraySlice<UInt8>
instead ofSubstring.UTF8View
, and where the match is converted to a string, which must now use the more generalizedString.init(decoding:as:)
initializer instead of theString.init
-Substring.init
that was possible given aSubstring.UTF8View
.As far as performance is concerned, there are two things to consider. We must first convert a string to an array before we can parse it, which is a performance cost that both requires double the memory, and the up-front cost of that conversion. Once the string it converted, however, it appears to be about 30–40% faster to parse.
There is another abstraction level we can parse from, and that’s unsafe pointers (😰), but we don’t necessarily have to fear them!
Write another
int
parser that works on an input for one of the safer of unsafe pointers:UnsafeBufferPointer
. This is a pointer to a collection of elements stored in contiguous memory. Strings offer two main ways of accessing this kind of data:// Directly on the string: string.withUTF8 { (ptr: UnsafeBufferPointer<UInt8>) in … } // On any collection: string.utf8.withContiguousStorageIfAvailable { (ptr: UnsafeBufferPointer<UInt8>) in … }
After writing this parser, benchmark it against each interface.
Solution
Starting from the previous exercise, the only thing that needs to change is the extension, which needs to work on a slice of a buffer pointer. Everything else is the same:
extension Parser where Input == Slice<UnsafeBufferPointer<UInt8>>, Output == Int { static let int = Self { input in var isFirstCharacter = true let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c) } guard let match = Int(String(decoding: intPrefix, as: UTF8.self)) else { return nil } input.removeFirst(intPrefix.count) return match } }
We can benchmark
withUTF8
andwithContiguousStorageIfAvailable
with a little extra upfront work.withUTF8
requires a mutable string (or substring), since if the string is not in contiguous memory, it must be moved into contiguous memory beforehand, so we must copy the string into a mutable variable before parsing. The operation returns a non-failable result because strings can always move their bytes into contiguous memory.suite.benchmark("withUTF8") { var s = string precondition( s.withUTF8 { ptr in var ptr = ptr[...] return Parser.int.run(&ptr)! } == 1234567890 ) }
withContiguousStorageIfAvailable
on the other hand is more general and more failable. If contiguous memory cannot be allocated for the given collection, it will returnnil
, so we must further unwrap it in our precondition (we know it should never fail forUTF8View
s).suite.benchmark("withContiguousStorageIfAvailable") { precondition( string.utf8.withContiguousStorageIfAvailable { ptr in var ptr = ptr[...] return Parser.int.run(&ptr)! }! == 1234567890) ) }
Performance-wise, these methods are even faster than
ArraySlice<UInt8>
and do not incur the same memory cost: about 2x faster thanUTF8View
and 70% faster thanArraySlice<UInt8>
. Performance between each interface is negligible and can be measured in under 10 nanoseconds.The
int
parser onUTF8View
,[UInt8]
andUnsafeBufferPointer<UInt8>
can be unified in a single implementation that is constrained against theCollection
protocol. Comment out the other parsers and use this single, more generalized parser, instead. How does this affect performance?Solution
The body of this parser is the same as the last two. We just need to update the extension with more general constraints and use a computed static property instead of a
let
:extension Parser where Input: Collection, Input.SubSequence == Input, Input.Element == UInt8, Output == Int { static var int: Self { Self { input in var isFirstCharacter = true let intPrefix = input.prefix { c in defer { isFirstCharacter = false } return (c == UTF8.CodeUnit(ascii: "-") || c == UTF8.CodeUnit(ascii: "+")) && isFirstCharacter || (UTF8.CodeUnit(ascii: "0")...UTF8.CodeUnit(ascii: "9")).contains(c) } guard let match = Int(String(decoding: intPrefix, as: UTF8.self)) else { return nil } input.removeFirst(intPrefix.count) return match } } }
Performance characteristics seem to be the same (in this single module).
References
swift-benchmark
Google • Friday Mar 13, 2020A Swift library for benchmarking code snippets, similar to google/benchmark.
UTF-8
Michael Ilseman • Wednesday Mar 20, 2019Swift 5 made a fundamental change to the String API, making the preferred encoding UTF-8 instead of UTF-16. This brings many usability and performance improves to Swift strings.
Strings in Swift 4
Ole Begemann • Monday Nov 27, 2017An excerpt from the Advanced Swift that provides a deep discussion of the low-level representations of Swift strings. Although it pre-dates the transition of strings to UTF-8 in Swift 5 it is still a factually correct accounting of how to work with code units in strings.
Improve performance of Collection.removeFirst(_:) where Self == SubSequence
Stephen Celis • Tuesday Jul 28, 2020While researching the string APIs for this episode we stumbled upon a massive inefficiency in how Swift implements removeFirst
on certain collections. This PR fixes the problem and turns the method from an O(n)
operation (where n
is the length of the array) to an O(k)
operation (where k
is the number of elements being removed).