Bug! Rebol's parsing of urls is incorrect.
[1/12] from: al:bri:xtra at: 10-Feb-2001 12:16
Rebol also has a problem with parsing urls:
>> type? http://www.rebol.com!
== url!
>> type? http://www.rebol.com.
== url!
>> type? http://www.rebol.com?
== url!
>> type? http://www.rebol.com,
== url!
My email client correctly leaves out the exclamation mark "!", period ".",
question mark "?" and comma "," but Rebol treats all of them as part of the
URL, which is clearly incorrect.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[2/12] from: holger::rebol::com at: 9-Feb-2001 16:14
On Sat, Feb 10, 2001 at 12:16:58PM +1300, Andrew Martin wrote:
> Rebol also has a problem with parsing urls:
> >> type? http://www.rebol.com!
<<quoted lines omitted: 8>>
> question mark "?" and comma "," but Rebol treats all of them as part of the
> URL, which is clearly incorrect.
There is a difference between a legal, parsed URL and REBOL's detection of
datatypes. REBOL parses pretty much anything that starts with text: followed
by something else as a url! datatype. That does not mean that all of them
are necessarily legal URLs. In fact, whether a URL is legal or not depends
completely on the scheme. For instance the first component after
(scheme):// does not necessarily have to be a host name. Just take
file://a,b,c! as an example. This is completely legal, yet your email
client might incorrectly stop after the "a".
REBOL's scanner has no knowledge of schemes and their particular rules for
URL wellformedness, because otherwise it would be impossible to add
user-defined schemes with their own URL schemes. It would also prevent
you from doing some types of dynamic URL generation/manipulation at the
series level. The actual URL parsing and check for wellformedness is done
much later, when you try to open a port with a URL.
Individual schemes have their own parsers. There is one universal parser
(decode-url), which is what most schemes use to parse URLs. It knows about
the most common URL formats, including username/password, hostname, ports
directory and file parts.
The same concept applies to scanning vs. parsing of emails.
--
Holger Kruse
[holger--rebol--com]
[3/12] from: ryanc:iesco-dms at: 9-Feb-2001 16:36
Hey Andrew,
I am not so quick to call these bugs myself. You do bring up some interesting
points though on rebol's data validation capability.
To say that http://www.rebol.com? is not a URL at all, is a stretch from my
perspective. What about 23'12'67? Is that not a number?
Maybe 'type? should be as loose as possible, allowing for the greatest
flexibility on its use? A seperate pickier and slower data validation function
could be used to say that something is not ready for the outside world. After
all, 'type?'s docs just say "Returns a value's datatype," no mention of
validity.
What do you think?
--Ryan
Andrew Martin wrote:
> Rebol also has a problem with parsing urls:
> >> type? http://www.rebol.com!
<<quoted lines omitted: 15>>
> [rebol-request--rebol--com] with "unsubscribe" in the
> subject, without the quotes.
--
Ryan Cole
Programmer Analyst
www.iesco-dms.com
707-468-5400
I am enough of an artist to draw freely upon my imagination.
Imagination is more important than knowledge. Knowledge is
limited. Imagination encircles the world.
-Einstein
[4/12] from: al:bri:xtra at: 12-Feb-2001 15:31
Petr wrote:
> Maybe the 'load is the way?
> ->> email? load "just.a.string"
<<quoted lines omitted: 3>>
> ->>
>> x: load "http://www.rebol.com!"
== http://www.rebol.com!
>> type? x
== url!
>> x
== http://www.rebol.com!
'load was where I first noticed the problem.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[5/12] from: brett:codeconscious at: 12-Feb-2001 15:04
Hi Andrew
> What we really need is a validating parser/loader/scanner, one for each
> scheme. I've been skimming through the schemes and noticed there's some
> repetition in Rebol's open functions, which could be abstracted out. Also,
> if a validating parser rule was incorporated into each scheme to check for
> wellformedness, that would be good. As it stands now, I can't use Rebol's
> 'load/next function to extract a URL from plain text with punctuation
around
> it. For example extracting the URL from the following: "Rebol's HQ"
> http://www.rebol.com! requires me to write my own URL parser.
I suspect that even if you had such a validating parser rule in the scheme
it would not make any difference to the way Rebol scans the url! datatype -
for the reason that Holger pointed out. Thus it would not help using the
load/next function either.
You may be better off writing your own parser. By doing so you are adding
crucial knowledge to the solution that Rebol doesn't have - that being that
your input is actually plain text - not a Rebol loadable dialect.
Also, I recall a warning that Larry gave some time back about using load -
it puts the words in system/words thus using up a finite resource. So if I
were to load this email I'm writing (apart from the errors) I would have
words like "also", "recall", "warning" and "larry" in system/words. I do
use load
for loading tag names and attributes in my html manipulation scripts, but I
calm myself with the knowledge that the number of tag names is probably
finite. Plain text though is coming from a much larger domain of
possibilities.
That said, having a validating parser rule as part of each scheme does seem
appropriate. It would allow you to make your own parser and know that it
will not need modification as more schemes are added.
Brett.
[6/12] from: al:bri:xtra at: 12-Feb-2001 17:33
Brett wrote:
> I suspect that even if you had such a validating parser rule in the scheme
it would not make any difference to the way Rebol scans the url! datatype -
for the reason that Holger pointed out. Thus it would not help using the
load/next function either.
That's right. I found out the problem in the first place by using
'load/next. It would be nice though if the scanner checked the scheme name
and "dived into" a parser rule in the appropriate scheme.
> You may be better off writing your own parser. By doing so you are adding
crucial knowledge to the solution that Rebol doesn't have - that being that
your input is actually plain text - not a Rebol loadable dialect.
It's the best kind of dialect, plain text with rebol values interspersed.
:-)
> Also, I recall a warning that Larry gave some time back about using load -
it puts the words in system/words thus using up a finite resource. So if I
were to load this email I'm writing (apart from the errors) I would have
words like "also", "recall", "warning" and "larry" in system/words. I do
use load for loading tag names and attributes in my html manipulation
scripts, but I calm myself with the knowledge that the number of tag names
is probably finite. Plain text though is coming from a much larger domain of
possibilities.
I'll have to check for that. Good point.
> That said, having a validating parser rule as part of each scheme does
seem appropriate. It would allow you to make your own parser and know that
it will not need modification as more schemes are added.
It'll also mean less duplication of effort on the part of everyone.
Thanks Brett.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[7/12] from: al:bri:xtra at: 11-Feb-2001 20:50
Holger wrote:
> The actual URL parsing and check for wellformedness is done much later,
when you try to open a port with a URL.
Then it looks like I'll have to build a parser than can check for
webformedness
and use that to extract absolute URLs and emails from plain
text. I was hoping that 'load/next would do the trick for me. Thanks Holger.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[8/12] from: al:bri:xtra at: 11-Feb-2001 21:03
Ryan wrote:
> Maybe 'type? should be as loose as possible, allowing for the greatest
flexibility on its use? A seperate pickier and slower data validation
function could be used to say that something is not ready for the outside
world. After all, 'type?'s docs just say "Returns a value's datatype," no
mention of validity.
What we really need is a validating parser/loader/scanner, one for each
scheme. I've been skimming through the schemes and noticed there's some
repetition in Rebol's open functions, which could be abstracted out. Also,
if a validating parser rule was incorporated into each scheme to check for
wellformedness, that would be good. As it stands now, I can't use Rebol's
'load/next function to extract a URL from plain text with punctuation around
it. For example extracting the URL from the following: "Rebol's HQ"
http://www.rebol.com! requires me to write my own URL parser.
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
[9/12] from: petr:krenzelok:trz:cz at: 11-Feb-2001 12:37
----- Original Message -----
From: Andrew Martin <[Al--Bri--xtra--co--nz]>
To: <[rebol-list--rebol--com]>
Sent: Sunday, February 11, 2001 9:03 AM
Subject: [REBOL] Re: Bug! Rebol's parsing of urls is incorrect.
> Ryan wrote:
> > Maybe 'type? should be as loose as possible, allowing for the greatest
<<quoted lines omitted: 8>>
> wellformedness, that would be good. As it stands now, I can't use Rebol's
> 'load/next function to extract a URL from plain text with punctuation
around
> it. For example extracting the URL from the following: "Rebol's HQ"
> http://www.rebol.com! requires me to write my own URL parser.
Sorry if off-topic, but I came too late to the discussion. I remember the
case when I wanted to check for the value and if it is an email. Example:
->> eml: to-email "just.a.string"
== jut.a.string
->> type? eml
== email!
->> email? eml
== true
->>
Then Elan (IIRC) explained me that even if I have some value marked as
certain datatype, it doesn't mean it has to be actullly valid email adress
for e.g. But even RT is/was using email? some-value to show outside world
how it is easy to check for email values. But as you can see, it has not to
be always true.Maybe the 'load is the way?
->> email? load "just.a.string"
== false
->> email? load "[just--a--string]"
== true
->>
-pekr-
[10/12] from: holger:rebol at: 14-Feb-2001 9:43
On Sun, Feb 11, 2001 at 09:03:31PM +1300, Andrew Martin wrote:
> What we really need is a validating parser/loader/scanner, one for each
> scheme. I've been skimming through the schemes and noticed there's some
> repetition in Rebol's open functions, which could be abstracted out.
Actually those open functions you are seeing are not defined separately for
each scheme. They are defined for the root protocol, and most schemes inherit
them at object (scheme) creation time.
--
Holger Kruse
[holger--rebol--com]
[11/12] from: g:santilli:tiscalinet:it at: 15-Feb-2001 12:23
Holger Kruse wrote:
> Actually those open functions you are seeing are not defined separately for
> each scheme. They are defined for the root protocol, and most schemes inherit
> them at object (scheme) creation time.
Correct me if I am wrong, but REBOL copies functions when cloning
an object (to bind them to the new context). So each handler has a
copy of the same function. I think Andrew is suggesting to use
delegation instead.
Regards,
Gabriele.
--
Gabriele Santilli <[giesse--writeme--com]> - Amigan - REBOL programmer
Amiga Group Italia sez. L'Aquila -- http://www.amyresource.it/AGI/
[12/12] from: al:bri:xtra at: 16-Feb-2001 15:05
> Holger Kruse wrote:
> > Actually those open functions you are seeing are not defined separately
for each scheme. They are defined for the root protocol, and most schemes
inherit them at object (scheme) creation time.
Gabriele wrote:
> Correct me if I am wrong, but REBOL copies functions when cloning an
object (to bind them to the new context). So each handler has a copy of the
same function.
That's right. It's particularly obvious as a result of:
mold system
> I think Andrew is suggesting to use delegation instead.
Actually, I wasn't, but delegation sounds like a good idea.
The area of copying functions in a object is a "painful" area in Rebol. I'd
really like to put a path in an object like:
o: make object! [f: func [] []]
oo: make o []
and see that the result of:
probe oo
would be something like:
make object! [f: o/f]
Andrew Martin
ICQ: 26227169 http://members.nbci.com/AndrewMartin/
Notes
- Quoted lines have been omitted from some messages.
View the message alone to see the lines that have been omitted