Fixing urlparse: A case for Parsec and pyparsing

In a previous post, I described issues with parsing and validating URL's with the functionality provided by Python's stdlib. I will just restate that clearly, all messages exchanged by angel-app nodes must be validated in order for it to work properly. What to do? First of all, I was of course not the first person to notice the module's shortcomings. However, I was surprised at the answers that popped up: It seems like no one was interested in actually coming up with a validating parser (perhaps even just for a subset of the complete URI syntax), but instead people focussed on fixing specific cases where the parser would fail -- in essence adding new features, rather than putting the whole system on a solid basis. Suggestions go so far as to propose a new URI parsing module. However, the proposed new module is again based on the premise that the input represents a valid URI, the behavior in the case of an invalid input is again left undefined. WTF? Have these people never looked beyond string.split() and regexes?

Dudes, writing a VALIDATING PARSER is NOT THAT HARD, if you have a reasonable grammar and good libs. Why do people keep pretending that it is? Sure, you might be afraid of having to fire up lex, yacc and antlr, and for good reason. But with sufficiently dynamic languages, that's usually unnecessary, if you have a parser combinator library handy.

The key idea behind parser combinators is that you write your parser in a bottom up fashion, in just the same way that you would define your grammar. You write a parser for a small part of the grammar, then combine these partial parsers to form a complex whole. The canonical example in this context is Haskell's parsec library. Let's start out with a simple restricted URI grammar:

module RestrictedURI where

import Text.ParserCombinators.Parsec

data URI = URI {
host :: [String],
port :: Int,
path :: [String]
} deriving (Eq, Show, Read)

schemeP = string "http" "scheme"
schemeSepP = string "://" "scheme separator"

hostPartP = many lower "part of a host name"
hostNameP = sepBy hostPartP (string ".") "host name"

pathSegmentP = sepEndBy1 (many1 alphaNum) (string "/") "multiple path segments"
pathP = do {
root - string "/" "absolute path required";
segments - pathSegmentP;
return (root:segments)
} "an absolute path, optionally terminated by a /"

restrictedURIP :: Parser URI
restrictedURIP =
do {
ignored - schemeP;
ignored - schemeSepP;
h - hostNameP;
p - pathP;
return (URI h 80 p)
} "a subset of the full URI grammar"

parseURI :: String -> (Either ParseError URI)
parseURI = parse restrictedURIP ""

(Where you should forgive me for the blog inserting break tags all over the place). But just to illustrate:

vincent$ ghci 
GHCi, version 6.8.1: :? for help
Loading package base ... linking ... done.
Prelude> :l restrictedURI
[1 of 1] Compiling RestrictedURI ( restrictedURI.hs, interpreted )
Ok, modules loaded: RestrictedURI.
*RestrictedURI> parseURI ""
Loading package parsec- ... linking ... done.
Right (URI {host = ["localhost","com"], port = 80, path = ["/","foo","bar"]})

Plus, we get composability, validation and error messages essentially for free:

*RestrictedURI> parseURI "" 
Left (line 1, column 17): unexpected "2" expecting lowercase letter,
"." or an absolute path, optionally terminated by a /

Now consider the following excerpt from Haskell's Network.URI.

--  RFC3986, section 3.1  
uscheme :: URIParser String
uscheme =
do { s - oneThenMany alphaChar (satisfy isSchemeChar)
; char ':'
; return $ s++":"

(Again, please forgive for the blog eating my code, but you can also get it from the haskell web site.) And compare that to the ABNF found in the corresponding section of the RFC:

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

Note how the complete URI grammar specification in the RFC is barely a page long. So yeah, implementing this grammar is a significant amount of work (of course you could always choose to support just a well-defined subset), but if you have a good parser combinator library, it's just a few hours of mechanically transforming the ABNF into your parser grammar. You can even watch the Simpsons while doing it (I did). In the case of Network.URI, this boils down a line count of 1278, with about half of the lines being comments or empty lines. Not only that, but given the complete grammar specification, it's super easy to formulate a modified grammar.

As it turns out, Python has a library quite like parsec, it's called pyparsing and I'll bore you with it in my next (and last) post on this topic.


No new comments allowed (anymore) on this post. twisting values since 1994