Parsing and Performance: Strings

Episode #127 • Nov 30, 2020 • Subscriber-Only

We want to explore the performance of composable parsers, but to do so we must first take a deep dive into the Swift string API. There are multiple abstractions of strings in Swift, each with its own benefits and performance characteristics. We will benchmark them in order to get a scientific basis for comparison, and will describe how to properly write a benchmark.

Strings
Introduction
00:05
String vs. substring
02:04
How to benchmark
06:46
Copying substrings
14:26
Substring vs. unicode scalars
20:29
Substring vs. UTF-8
29:27
Next time: parser performance
34:09

Unlock This Episode

Our Free plan includes 1 subscriber-only episode of your choice, plus weekly updates from our newsletter.

Introduction

Over the past many weeks we have built up a pretty impressive parser library. Our parser is a very general type that allows you to parse any kind of nebulous input into any kind of well-structured output. It supports lots of interesting forms of composition that allow you to break large problems down into smaller ones.

On top of all of that we were able to build up an impressive library of parsers and higher-order parsers that work on strings. They allowed us to scan values off the fronts of strings in an efficient manner, such as characters, numbers, prefixes and more. And these parsers dealt with a lower-level API than just plain String, called Substring. A Substring is like a view into a string. It’s not the actual string itself, but rather just a representation of a portion of a base string that is stored somewhere else. This means we can consume little bits off the front of a substring while only changing our view of the underlying string, which is a very lightweight thing to do. If we were dealing with raw Strings then we would need to make a whole copy of the string, which can be a very expensive operation.

So it seems that maybe we are ready to close the book on parsers and open source it! But not so fast. There is a very important topic to consider, especially when it comes to parsers, which is performance. Sometimes we need to parse megabytes or even gigabytes of data, and we need to be as efficient as possible when it comes to scanning the input. And although we have taken a huge step by using Substrings instead of Strings, it turns out there is a lot more we can do.

We want to start off by giving everyone a quick deep dive into Swift strings and their performance characteristics. It’s a tricky subject, and there are a few subtle edge cases to think about, but once that’s done we will find that there is an even lower level representation of strings for which parsing is even more efficient than Substring. It’s really quite amazing to see, but it also kind of opens up a whole can of worms that requires more work to wrangle in.

This episode is for subscribers only.

Subscribe to Point-Free

Access this episode, plus all past and future episodes when you become a subscriber.

See plans and pricing

Already a subscriber? Log in

References

swift-benchmark

Google • Friday Mar 13, 2020

A Swift library for benchmarking code snippets, similar to google/benchmark.

UTF-8

Michael Ilseman • Wednesday Mar 20, 2019

Swift 5 made a fundamental change to the String API, making the preferred encoding UTF-8 instead of UTF-16. This brings many usability and performance improves to Swift strings.

Strings in Swift 4

Ole Begemann • Monday Nov 27, 2017

An excerpt from the Advanced Swift that provides a deep discussion of the low-level representations of Swift strings. Although it pre-dates the transition of strings to UTF-8 in Swift 5 it is still a factually correct accounting of how to work with code units in strings.

Swiftʼs Collection Types

Harshil Shah • Wednesday Aug 5, 2020

In this episode we explored different representations of strings and their subsequences (i.e. Substring, UnicodeScalarView, and UTF8View), but more generally there are collections and slices. This article gives a nice accounting of the zoo of types in Swift’s collections API.

Improve performance of Collection.removeFirst(_:) where Self == SubSequence

Stephen Celis • Tuesday Jul 28, 2020

While researching the string APIs for this episode we stumbled upon a massive inefficiency in how Swift implements removeFirst on certain collections. This PR fixes the problem and turns the method from an O(n) operation (where n is the length of the array) to an O(k) operation (where k is the number of elements being removed).

Downloads