Sunday, May 23, 2010

Writing a Perl 6 URI module

I wanted to write a parser of some sort using Perl 6's spiffy parser language otherwise known as "rules". This is the super-extended regular expression syntax that Perl 6's own parser is written in, and it's not just powerful, it's easy to use. In fact, it's so easy to use that almost all of my time writing a URI parser module was spent on other aspects of the code than the parser itself.

First off, some background. Perl 6 has a URI module already. However, it relies on a number of Perl built-in character classes to match things like digits and alphanumerics. In reality, the RFCs that define URIs are very precise, and there are different specifications depending on what you need. So, I decided to re-write the module with a pluggable parser so that you could give a regular, modern URI and have it parse correctly, but you could also ask for special "IRI" parsing on an internationalized URI and the right thing would happen there. I even went so far as to bring in an older version of the specification as a legacy mode.

The current state of the Perl 6 parser and runtime called Rakudo is actually fairly solid for a pre-release implementation of such a complex language spec. There are some gaping holes, but they were all relatively easy to work around. Some of these included overly aggressive list-flattening, some operators that were broken at the time I wrote this code and the big one: named rules only work as a stand-alone grammer with a specific entry-point called TOP.

I worked around all of these issues and have, so far, been able to parse basic URIs according to RFC 3986. Here's a sample of what a Perl grammar for URIs looks like:

    token URI {
        ':' [ '?' ]? [ '#' ]?
    }

Here you can see most of the basics: "token" introduces a single expression within the grammar. It calls out to other tokens by enclosing their names in angle-brackets. Literal sequences are enclosed in single-quotes and sub-expressions can be enclosed in square-brackets with regular expression-like repetition counts such as ? for 0 or 1 matches.

In order to have a pluggable interface, I needed a class capable of providing me with two things for each grammar: the grammar itself and a set of routines which would tell me how to find the resulting URI elements in the match data. For this I defined an interface using Perl 6's roles:

  role URI::Specification {
      method parser() { ... }
      method scheme_path() { ... }
      # ... other _path methods here...
  }

Those ellipses are literal. They cause the methods to be required for any class composed with this role, but do not define any functionality themselves.

Each parser is then defined as:

  class URI::rfc3896 does URI::Specification {
      grammar URI::rfc3896::spec {
          token TOP { }
          # RFC definition of URI goes here.
      }
      method parser() { return ::URI::rfc3896::spec }
      method scheme_path() {
          gather do { take }
      }
      # And so on ...
  }

That's it. The only really funky bit here is the gather/take code in the scheme_path. That's the way Perl 6 defines a coroutine-like interface. The paths define how we traverse the match object to find match results. So, for example, the "scheme" (the "http" in "http:/www.example.com/") can only be matched in the URI rule's scheme sub-rule. Some URI elements, however, such as authority (the host name and port - possibly username as well) can be matched multiple ways, so these routines might return multiple lists of subrule names to traverse. I would have simply returned a list of lists, but Perl 6's parameter passing is very complex and currently some of the specification is not yet implemented. Right now, this manifests as overly aggressive list flattening when returning them from a subroutine or method.

This is why I used coroutines to return each of the sub-lists, one call at a time.

I'll continue to post new updates as my URI module nears readiness. For now, it's just awaiting some love on the other parsers, and I think it'll be ready to go.