Bug! Rebol's parsing of urls is incorrect.

[1/12] from: al:bri:xtra at: 10-Feb-2001 12:16

Rebol also has a problem with parsing urls:

>> type? http://www.rebol.com!

== url!

>> type? http://www.rebol.com.

== url!

>> type? http://www.rebol.com?

== url!

>> type? http://www.rebol.com,

== url! My email client correctly leaves out the exclamation mark "!", period ".", question mark "?" and comma "," but Rebol treats all of them as part of the URL, which is clearly incorrect. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[2/12] from: holger::rebol::com at: 9-Feb-2001 16:14

On Sat, Feb 10, 2001 at 12:16:58PM +1300, Andrew Martin wrote:

> Rebol also has a problem with parsing urls: > >> type? http://www.rebol.com!

<<quoted lines omitted: 8>>

> question mark "?" and comma "," but Rebol treats all of them as part of the > URL, which is clearly incorrect.

There is a difference between a legal, parsed URL and REBOL's detection of datatypes. REBOL parses pretty much anything that starts with text: followed by something else as a url! datatype. That does not mean that all of them are necessarily legal URLs. In fact, whether a URL is legal or not depends completely on the scheme. For instance the first component after (scheme):// does not necessarily have to be a host name. Just take file://a,b,c! as an example. This is completely legal, yet your email client might incorrectly stop after the "a". REBOL's scanner has no knowledge of schemes and their particular rules for URL wellformedness, because otherwise it would be impossible to add user-defined schemes with their own URL schemes. It would also prevent you from doing some types of dynamic URL generation/manipulation at the series level. The actual URL parsing and check for wellformedness is done much later, when you try to open a port with a URL. Individual schemes have their own parsers. There is one universal parser (decode-url), which is what most schemes use to parse URLs. It knows about the most common URL formats, including username/password, hostname, ports directory and file parts. The same concept applies to scanning vs. parsing of emails. -- Holger Kruse [holger--rebol--com]

[3/12] from: ryanc:iesco-dms at: 9-Feb-2001 16:36

Hey Andrew, I am not so quick to call these bugs myself. You do bring up some interesting points though on rebol's data validation capability. To say that http://www.rebol.com? is not a URL at all, is a stretch from my perspective. What about 23'12'67? Is that not a number? Maybe 'type? should be as loose as possible, allowing for the greatest flexibility on its use? A seperate pickier and slower data validation function could be used to say that something is not ready for the outside world. After all, 'type?'s docs just say "Returns a value's datatype," no mention of validity. What do you think? --Ryan Andrew Martin wrote:

> Rebol also has a problem with parsing urls: > >> type? http://www.rebol.com!

<<quoted lines omitted: 15>>

> [rebol-request--rebol--com] with "unsubscribe" in the > subject, without the quotes.

-- Ryan Cole Programmer Analyst www.iesco-dms.com 707-468-5400 I am enough of an artist to draw freely upon my imagination. Imagination is more important than knowledge. Knowledge is limited. Imagination encircles the world. -Einstein

[4/12] from: al:bri:xtra at: 12-Feb-2001 15:31

Petr wrote:

> Maybe the 'load is the way? > ->> email? load "just.a.string"

<<quoted lines omitted: 3>>

> ->> >> x: load "http://www.rebol.com!"

== http://www.rebol.com!

>> type? x

== url!

>> x

== http://www.rebol.com! 'load was where I first noticed the problem. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[5/12] from: brett:codeconscious at: 12-Feb-2001 15:04

Hi Andrew

> What we really need is a validating parser/loader/scanner, one for each > scheme. I've been skimming through the schemes and noticed there's some > repetition in Rebol's open functions, which could be abstracted out. Also, > if a validating parser rule was incorporated into each scheme to check for > wellformedness, that would be good. As it stands now, I can't use Rebol's > 'load/next function to extract a URL from plain text with punctuation

around

> it. For example extracting the URL from the following: "Rebol's HQ" > http://www.rebol.com! requires me to write my own URL parser.

I suspect that even if you had such a validating parser rule in the scheme it would not make any difference to the way Rebol scans the url! datatype - for the reason that Holger pointed out. Thus it would not help using the load/next function either. You may be better off writing your own parser. By doing so you are adding crucial knowledge to the solution that Rebol doesn't have - that being that your input is actually plain text - not a Rebol loadable dialect. Also, I recall a warning that Larry gave some time back about using load - it puts the words in system/words thus using up a finite resource. So if I were to load this email I'm writing (apart from the errors) I would have words like "also", "recall", "warning" and "larry" in system/words. I do use load for loading tag names and attributes in my html manipulation scripts, but I calm myself with the knowledge that the number of tag names is probably finite. Plain text though is coming from a much larger domain of possibilities. That said, having a validating parser rule as part of each scheme does seem appropriate. It would allow you to make your own parser and know that it will not need modification as more schemes are added. Brett.

[6/12] from: al:bri:xtra at: 12-Feb-2001 17:33

Brett wrote:

> I suspect that even if you had such a validating parser rule in the scheme

it would not make any difference to the way Rebol scans the url! datatype - for the reason that Holger pointed out. Thus it would not help using the load/next function either. That's right. I found out the problem in the first place by using 'load/next. It would be nice though if the scanner checked the scheme name and "dived into" a parser rule in the appropriate scheme.

> You may be better off writing your own parser. By doing so you are adding

crucial knowledge to the solution that Rebol doesn't have - that being that your input is actually plain text - not a Rebol loadable dialect. It's the best kind of dialect, plain text with rebol values interspersed. :-)

> Also, I recall a warning that Larry gave some time back about using load -

it puts the words in system/words thus using up a finite resource. So if I were to load this email I'm writing (apart from the errors) I would have words like "also", "recall", "warning" and "larry" in system/words. I do use load for loading tag names and attributes in my html manipulation scripts, but I calm myself with the knowledge that the number of tag names is probably finite. Plain text though is coming from a much larger domain of possibilities. I'll have to check for that. Good point.

> That said, having a validating parser rule as part of each scheme does

seem appropriate. It would allow you to make your own parser and know that it will not need modification as more schemes are added. It'll also mean less duplication of effort on the part of everyone. Thanks Brett. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[7/12] from: al:bri:xtra at: 11-Feb-2001 20:50

Holger wrote:

> The actual URL parsing and check for wellformedness is done much later,

when you try to open a port with a URL. Then it looks like I'll have to build a parser than can check for webformedness and use that to extract absolute URLs and emails from plain text. I was hoping that 'load/next would do the trick for me. Thanks Holger. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[8/12] from: al:bri:xtra at: 11-Feb-2001 21:03

Ryan wrote:

> Maybe 'type? should be as loose as possible, allowing for the greatest

flexibility on its use? A seperate pickier and slower data validation function could be used to say that something is not ready for the outside world. After all, 'type?'s docs just say "Returns a value's datatype," no mention of validity. What we really need is a validating parser/loader/scanner, one for each scheme. I've been skimming through the schemes and noticed there's some repetition in Rebol's open functions, which could be abstracted out. Also, if a validating parser rule was incorporated into each scheme to check for wellformedness, that would be good. As it stands now, I can't use Rebol's 'load/next function to extract a URL from plain text with punctuation around it. For example extracting the URL from the following: "Rebol's HQ" http://www.rebol.com! requires me to write my own URL parser. Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

[9/12] from: petr:krenzelok:trz:cz at: 11-Feb-2001 12:37

----- Original Message ----- From: Andrew Martin <[Al--Bri--xtra--co--nz]> To: <[rebol-list--rebol--com]> Sent: Sunday, February 11, 2001 9:03 AM Subject: [REBOL] Re: Bug! Rebol's parsing of urls is incorrect.

> Ryan wrote: > > Maybe 'type? should be as loose as possible, allowing for the greatest

<<quoted lines omitted: 8>>

> wellformedness, that would be good. As it stands now, I can't use Rebol's > 'load/next function to extract a URL from plain text with punctuation

around

> it. For example extracting the URL from the following: "Rebol's HQ" > http://www.rebol.com! requires me to write my own URL parser.

Sorry if off-topic, but I came too late to the discussion. I remember the case when I wanted to check for the value and if it is an email. Example: ->> eml: to-email "just.a.string" == jut.a.string ->> type? eml == email! ->> email? eml == true ->> Then Elan (IIRC) explained me that even if I have some value marked as certain datatype, it doesn't mean it has to be actullly valid email adress for e.g. But even RT is/was using email? some-value to show outside world how it is easy to check for email values. But as you can see, it has not to be always true.Maybe the 'load is the way? ->> email? load "just.a.string" == false ->> email? load "[just--a--string]" == true ->> -pekr-

[10/12] from: holger:rebol at: 14-Feb-2001 9:43

On Sun, Feb 11, 2001 at 09:03:31PM +1300, Andrew Martin wrote:

Actually those open functions you are seeing are not defined separately for each scheme. They are defined for the root protocol, and most schemes inherit them at object (scheme) creation time. -- Holger Kruse [holger--rebol--com]

[11/12] from: g:santilli:tiscalinet:it at: 15-Feb-2001 12:23

Holger Kruse wrote:

> Actually those open functions you are seeing are not defined separately for > each scheme. They are defined for the root protocol, and most schemes inherit > them at object (scheme) creation time.

Correct me if I am wrong, but REBOL copies functions when cloning an object (to bind them to the new context). So each handler has a copy of the same function. I think Andrew is suggesting to use delegation instead. Regards, Gabriele. -- Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/

[12/12] from: al:bri:xtra at: 16-Feb-2001 15:05

> Holger Kruse wrote: > > Actually those open functions you are seeing are not defined separately

for each scheme. They are defined for the root protocol, and most schemes inherit them at object (scheme) creation time. Gabriele wrote:

> Correct me if I am wrong, but REBOL copies functions when cloning an

object (to bind them to the new context). So each handler has a copy of the same function. That's right. It's particularly obvious as a result of: mold system

> I think Andrew is suggesting to use delegation instead.

Actually, I wasn't, but delegation sounds like a good idea. The area of copying functions in a object is a "painful" area in Rebol. I'd really like to put a path in an object like: o: make object! [f: func [] []] oo: make o [] and see that the result of: probe oo would be something like: make object! [f: o/f] Andrew Martin ICQ: 26227169 http://members.nbci.com/AndrewMartin/

Notes

Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted