bg



Fixing urlparse: Make the simple easy, keep the complex solvable

In my previous post, I presented netaddress, an RFC 3986 compliant (I believe) URI parser (and all the shenanigans that come with it, such as numerical IP addresses). Now, while it's good to know that that's available, it has made the parsing simple URI's (the most common case) more complicated than it needs to be. This is because it now exposes most of the complexity inherent in URI's. But this is yet another place where parser combinators really shine. Say, I'd want to parse URI's of the simplified form $(scheme)://$(host)$(path), then this is all you need to do:

from rfc3986 import scheme, reg_name, path_abempty
from pyparsing import Literal
host = reg_name.setResultsName("host")
path = path_abempty.setResultsName("path")
URI = scheme + Literal("://") + host + path

And now you've got yourself a validating parser for your reduced grammar. Nice, no? I've added this as an extra module ("notQuiteURI") to netaddress, so you can use it like this:

>>> from netaddress import notQuiteURI 
>>> uri = notQuiteURI.URI.parseString("http://host.name.com/path/to/resource")
>>> uri.scheme
'http'
>>> uri.host
'host.name.com'
>>> uri.path
(['/', 'path', '/', 'to', '/', 'resource'], {})

Update: netaddress is now available through the python cheese shop. If you're interested, you should be able to install it by simply typing:

$ easy_install netaddress
 Permalink

Comments

No new comments allowed (anymore) on this post.
etoy.com twisting values since 1994