Mailing List Archive: Re: Bug! Rebol's parsing of urls is incorrect.

[REBOL] Re: Bug! Rebol's parsing of urls is incorrect.

From: holger::rebol::com at: 9-Feb-2001 16:14


On Sat, Feb 10, 2001 at 12:16:58PM +1300, Andrew Martin wrote:
> Rebol also has a problem with parsing urls:
> >> type? http://www.rebol.com!
> == url!
> >> type? http://www.rebol.com.
> == url!
> >> type? http://www.rebol.com?
> == url!
> >> type? http://www.rebol.com,
> == url!
>
> My email client correctly leaves out the exclamation mark "!", period ".",
> question mark "?" and comma "," but Rebol treats all of them as part of the
> URL, which is clearly incorrect.

There is a difference between a legal, parsed URL and REBOL's detection of
datatypes. REBOL parses pretty much anything that starts with text: followed
by something else as a url! datatype. That does not mean that all of them
are necessarily legal URLs. In fact, whether a URL is legal or not depends
completely on the scheme. For instance the first component after
(scheme):// does not necessarily have to be a host name. Just take
file://a,b,c! as an example. This is completely legal, yet your email
client might incorrectly stop after the "a".

REBOL's scanner has no knowledge of schemes and their particular rules for
URL wellformedness, because otherwise it would be impossible to add
user-defined schemes with their own URL schemes. It would also prevent
you from doing some types of dynamic URL generation/manipulation at the
series level. The actual URL parsing and check for wellformedness is done
much later, when you try to open a port with a URL.

Individual schemes have their own parsers. There is one universal parser
(decode-url), which is what most schemes use to parse URLs. It knows about
the most common URL formats, including username/password, hostname, ports
directory and file parts.

The same concept applies to scanning vs. parsing of emails.

--
Holger Kruse
[holger--rebol--com]