Generalized Parsing: Part 3

Episode #126 • Nov 23, 2020 • Subscriber-Only

Generalizing the parser type has allowed us to parse more types of inputs, but that is only scratching the surface. It also unlocks many new things that were previously impossible to see, including the ability to parse a stream of inputs and stream its output, making our parsers much more performant.

Part 3
Introduction
00:05
Parsing in a memory-efficient manner
04:03
A parser of streaming input
16:38
Streaming a parser's output
26:26
Conclusion
34:35

Unlock This Episode

Our Free plan includes 1 subscriber-only episode of your choice, plus weekly updates from our newsletter.

Introduction

This is pretty impressive. We add just 70 additional lines of parsers to our base library and we have unlocked the ability to very succinctly and expressively parse incoming requests so that we can route them to different parts of our application or web site. There are entire libraries out there devoted to this functionality, and yet here we have discovered it to be just a small corollary to having powerful, generalized parsing library available to us.

The only thing missing from this routing micro-library is a few more combinators for parsing the headers and body of the request, as well as a way to turn a nebulous URLRequest value into one of these RequestData values, but we will leave both of those things as exercises for the viewer.

So I think this is pretty incredible. We have massively generalized our parsing library, all of the parsers we previously wrote still compile and run just like before, but we can now perform parsing tasks on all new types of input that would have previously been impossible.

But as cool as all of that is, we still want to ask the all-important question that we ask at the end of every series of episodes on Point-Free: what’s the point? Because although we have generalized parsing we have also made it a little more complex. Not only do we have to think a bit harder when it comes to writing a general parser and have to have a bit of knowledge of generics and the Collection protocol, but we also sometimes have to give the compiler some extra hints in order for it to figure out the types.

So, is it worth this extra complexity? Instead of generalizing parsers should we have spent a little more time creating more robust parsers that perhaps could have handled the complexities of parsing a raw URLRequest rather than inventing the RequestData type and trying to parse it?

And we of course think its absolutely worth this extra complexity. Generalizing the parser signature has allowed us to parse all new types of input, but that’s only the beginning. The very act of generalizing has opened up all new possibilities that were previously impossible to see. For example:

  • With zero changes to our core parser type we can create a new parser operator that allows us to incrementally parse a stream of data coming in from an outside source, such as standard input, and even incrementally stream output to an outside source, such as a file, standard output, or anything. This can dramatically improve the performance of parsers that need to work on large data sets for which it is unreasonable to bring a large chunk of data into memory and parse it into a large array of data for processing.
  • So that’s incredible, but it gets better. By generalizing we can now see an all new form of composition that sits right next to our beloved map, zip and flatMap operators. This operator has the same shape that we have discovered on Point-Free time and time again, and it will be instrumental in allowing us to take a parser that works on a small piece of data and transform it into a parser that works on a larger, more complex piece of data.
  • And if that weren’t enough, things get even better. This new form of composition turns out to be the key to unlock a new tier of performance in our parsers. We can increase the performance of some of our parsers by as much as 5-10x with minimal changes to the parsers themselves, which makes their performance competitive with hand-rolled parsers and even beats the performance of Apple’s parser helpers, such as the Scanner type.

These are some really big claims we are making. We are saying that by simply generalizing the input type of our parsers we can unlock the ability to stream input into our parsers, uncover new forms of composition, and immediately improve the performance of our parsers, basically for free.

This episode is for subscribers only.

Subscribe to Point-Free

Access this episode, plus all past and future episodes when you become a subscriber.

See plans and pricing

Already a subscriber? Log in

Exercises

  1. Implement the following method on Parser:

    extension Parser {
      func f<NewOutput>(_ parser: Parser<Output, NewOutput>) -> Parser<Input, NewOutput> {
        fatalError("unimplemented")
      }
    }
    

    What might this method be useful for?

    Solution
    extension Parser {
      func f<NewOutput>(_ parser: Parser<Output, NewOutput>) -> Parser<Input, NewOutput> {
        .init { input in
          let original = input
          guard var output = self.run(&input) else { return nil }
          guard let newOutput = parser.run(&output) else {
            input = original
            return nil
          }
          return newOutput
        }
      }
    }
    
  2. Swift strings have a lower-level representation called UnicodeScalarView that can be more performant to work with. Generalize Parser.int to parse Substring.UnicodeScalarView:

    extension Parser where Input == Substring.UnicodeScalarView, Output == Int {
      static let int = Self { input in
        fatalError("unimplemented")
      }
    }
    
    let string = "123 Hello"
    var input = string[...].unicodeScalars
    precondition(Parser.int.run(&input) == 123)
    precondition(Substring(input) == " Hello")
    

    How do the performance characteristics compare with Substring?

    Solution

    Stay tuned for the solution in coming episodes!

  3. Even lower-level than UnicodeScalarView is UTF8View. Generalize Parser.int to parse Substring.UTF8View:

    extension Parser where Input == Substring.UTF8View, Output == Int {
      static let int = Self { input in
        fatalError("unimplemented")
      }
    }
    
    let string = "123 Hello"
    var input = string[...].utf8
    precondition(Parser.int.run(&input) == 123)
    precondition(Substring(input) == " Hello")
    

    How do the performance characteristics compare with Substring and Substring.UnicodeScalarView?

    Solution

    Stay tuned for the solution in coming episodes!

Downloads