Fixing urlparse: More on pyparsing and introducing netaddress

This is the last in a series of three posts (1, 2), discussing issues with pythons urlparse module. Here, I intend to provide a solution.

In the last post, I was talking about parser combinators and parsec in particular, mentioning pyparsing towards the end. The angel-app being a python application, parsec, while cool, is of no immediate use. pyparsing on the other hand provides parsec-like functionality for python. Consider this excerpt from the RFC 3986-compliant URI parser that I'm about to present in this post (please ignore as usual the blog's spurious formatting):

dec_octet = Combine(Or([
Literal("25") + ZeroToFive, # 250 - 255
        Literal("2") + ZeroToFour + Digit,     # 200 - 249
        Literal("1") + repeat(Digit, 2),       # 100 - 199
        OneToNine + Digit,                     # 10 - 99
        Digit                                  # 1-9    
IPv4address = Group(repeat(dec_octet + Literal("."), 3) + dec_octet)

And now:

>>> from netaddress import IPv4address 
[snipped warning message]
>>> IPv4address.parseString("")
([(['127', '.', '0', '.', '0', '.', '1'], {})], {})
>>> IPv4address.parseString("350.0.0.1")
Traceback (most recent call last):
File "", line 1, in ?
egg/", line 1244, in parseImpl
raise exc
pyparsing.ParseException: Expected "." (at char 2), (line:1, col:3)

Anyhow, what I mean to say is this: We have a validating URI parser now. Apart from the bugs that are still to be expected for a piece of code at this early stage, it should be RFC 3986 compliant. You can get either the python package, or a tarball of the darcs repository (unfortunately my zope account chockes on the "_darcs" directory filename, so I'm still looking for a good way to host the darcs).

This is how one would use it:

>>> from netaddress import URI
>>> uri = URI.parseString("http://localhost:6221/foo/bar")
>>> uri.port
>>> uri.scheme

Or, in the case of a more complex parse:

>>> uri = URI.parseString("http://vincent@localhost:6221/foo/bar")
>>> uri.asDict().keys()
['scheme', 'hier_part']
>>> uri.hier_part.path_abempty
(['/', 'foo', '/', 'bar'], {})
>>> uri.hier_part.authority.userinfo
>>> uri.hier_part.authority.port

Hope you find this useful.

Comments (5)  Permalink


Paul McGuire @ 23.11.2007 15:19 CET
Vincent -

Congratulations on a nice use of pyparsing! I did my own flavor of urlparse a long time ago, but more as an exercise than as a serious submission (also, my version did not try to handle IPv6 as yours does). I'll post it on the pyparsing wiki in the Examples page so you can compare with your own work.

Some tips on your grammar:
- Good application of results names. Note that it is not necessary to call asDict() to get the keys, you could call uri.keys() directly. You can also try calling uri.dump() to see more information on which fields were found in a given uri string.

- In your grammar, you have this code and comment:

# it is my understanding that pyparsing is lazy -- we need to change the order here,
# so it tests the longest first. YES?
dec_octet = Combine(Or([
Literal("25") + ZeroToFive, # 250 - 255
Literal("2") + ZeroToFour + Digit, # 200 - 249
Literal("1") + repeat(Digit, 2), # 100 - 199
OneToNine + Digit, # 10 - 99
Digit # 1-9

Pyparsing has 2 forms of alternation: MatchFirst and Or. They are mapped to operators '|' and '^', respectively. MatchFirst is lazy and returns the first match, so you would need to order longer ahead of shorter numbers. Or is non-lazy, and evaluates all alternatives and returns the longest match. In this example, since you have already done the work of ordering your expressions from most-restrictive to least, you could replace Or with MatchFirst, and this would then conform to your comment.

- Constructs such as:

dunno = Or([unreserved, sub_delims, Literal(":")])

can also be written as:

dunno = unreserved ^ sub_delims ^ Literal(":")

or even:

dunno = unreserved ^ sub_delims ^ ":"

This is purely a point of personal style or taste - your expression is valid, and it also has the advantage of not being subject to unexpected operator precedence issues.

Some comments on your pyParseExtensions:
- Here is a shortened version of repeat:

repeat = lambda parser,n : And([parser]*n)

I considered including something like this in pyparsing, but the lambda was so short, I didn't think it was worth the trouble. Seeing that you found it necessary to write your own, maybe I should add it to pyparsing itself. For that matter, I could override the * operator so that you wouldn't even need to invoke a method, you could just write in your grammar:

(dec_octet + Literal(".") * 3

instead of:

repeat(dec_octet + Literal("."), 3

This would also permit me to support multiplication by a tuple:

(dec_octet + Literal(".") * (1,3)

where a tuple multiplier would be interpreted as (min,max), similar to {min,max} in re's. (Is this what you meant by the "ABNF *x operator" in your rfc3986 comments?)

- I would discourage you from defining "oneOf" as this clashes with a similar "oneOf" method in pyparsing. Pyparsing's oneOf is a helper method for matching literal strings - oneOf builds a MatchFirst and reorders the input strings as necessary to check longer strings before shorter (similar to your longest first handling in dec_octet)

Here is how pyparsing's oneOf works:

comparison = oneOf("< = > <= >= !=")

The original version of oneOf reorders the input literals (found by invoking split on the input string) to give:

"<=" | "<" | "=" | ">=" | ">" | "!="

The current version of oneOf goes further and returns a Regex (which is nearly unreadable for all the backslashes).

Nice work, and welcome to pyparsing!

-- Paul
vincent @ 23.11.2007 15:23 CET

I am honored to receive a comment from the master himself! Thanks for these very helpful suggestions!

I'll make sure they go into the next revision.

Many thanks for pyparsing and cheers,
vincent @ 23.11.2007 15:33 CET

as a clarification: when I mentioned the "ABNF *x" operator, I meant the specification of variable repetition ranges, see, e.g.

It is my understanding that pyparsing currently only supports this in the context of the Word operator. Though maybe I've missed something...

Suggestions are more than welcome.

vincent @ 23.11.2007 15:46 CET

Indeed, the multiplication by a tuple that you propose is exactly what I was looking for.
Paul McGuire @ 24.11.2007 03:09 CET
Please check out the latest version from the pyparsing SourceForge SVN repository. Let me know how it works for you.

-- Paul
No new comments allowed (anymore) on this post. twisting values since 1994